A Forward Retry is a mechanism where a failed operation to an external service(s) is automatically attempted again after a certain delay. The primary purpose of this design is to handle transient network failure, temporary outages thereby improving the reliability of the system.
I recently faced a similar situation at work where I had to make sure that my service is always in sync with the external system even in case of a temporary failure. I chose this implementation because of its ease and the optimistic flow. By optimistic flow, I mean designing the service to proceed assuming the ‘happy path’ will succeed, while ensuring the safety net (the reconciliation service) is in place for when it doesn’t.
That way if the call fails because of a temporary outage or network failure, the external reconciliation service (we will talk about it in a minute) can look at the intent and try to retry the flow to fix the state.
Table of Contents
Why Use this Pattern?
You would use this pattern when you don’t want your system to be out-of-sync and want a reliable way to reconcile the state on failure instead of blocking it.
It’s not a new pattern, but rather a variant or a pragmatic successor to complex solutions like Two-Phase Commit (2PC), which often introduce blocking and high latency in distributed transactions.
There are many variants to this pattern which you might know from the following names:
- CDC –> publisher approaches. These patterns became widespread as microservices and highly available web systems replaced heavy distributed locking/2PC solutions.
- transactional outbox,
- reconciliation pattern,
- saga/compensating transactions (for multi-step workflows)
Idempotency
Another important consideration when implementing this design pattern is to make sure the downstream services are Idempotent.
When the downstream services are idempotent it makes it so much easier to make that create call without worrying about the duplicity. Thus, it helps with retry scenarios like the one we will tackle below.
stateDiagram-v2
direction TB
%% Start
[*] --> Idle
%% Flow
Idle --> ReceiveRequest: "Client → Request (idempotencyKey)"
ReceiveRequest --> CheckSeen: "Lookup idempotencyKey"
CheckSeen --> NotSeen: "Key NOT found"
CheckSeen --> Seen: "Key found"
NotSeen --> Execute: "Perform action / side-effect"
Execute --> Record: "Persist result ⟶ (idempotencyKey ↦ response)"
Record --> ReturnSuccess: "Return success to client"
ReturnSuccess --> [*]
Seen --> ReturnStored: "Return stored response (no re-run)"
ReturnStored --> [*]
%% Errors & retries
Execute --> Failure: "Transient error"
Failure --> Retry: "Retry (backoff)"
Retry --> Execute
%% Visual styling
classDef stateBlue fill:#e8f0ff,stroke:#1565c0,stroke-width:2;
classDef stateYellow fill:#fff8e6,stroke:#f57c00,stroke-width:2;
classDef stateGreen fill:#e6fff0,stroke:#2e7d32,stroke-width:2;
classDef stateRed fill:#fff0f0,stroke:#c62828,stroke-width:2;
class ReceiveRequest,CheckSeen,stateBlue
class NotSeen,Execute,Record,stateYellow
class ReturnSuccess,ReturnStored,stateGreen
class Failure,Retry,stateRed
Takeaway: it’s an industry standard approach when you need reliability and scalability without global transactions.
How can we develop this and see it in action? Let’s assume a mock scenario.
Problem Statement
Scenario: You are a senior engineer of your team and you are tasked to develop a Policy Management Service that will be calling external Policy Provider(s) to store and manage policies for the application teams. Since external provider can change we are developing a central service to provide that abstraction for our organization.
The Policy Management Service should always be in sync with the external policy provider service.
Demo the following failure scenarios and reconciliation tactics for each:
- Policy Management Service crashes before making the call to external service
- Policy Management Service crashes after making a successful call to external service
- External Service fails to create the record and errors out
The above three scenarios will broadly cover different stages of reconciliation. There could be more stages and should be adjusted based on the amount of robustness required.
Designing the Architecture
Let’s look at high level design. And then we will break each flow into its own sequence diagram so its’ easier to follow along.
- User calls the DNS server with the domain
- DNS returns the ip address of the load balancer of your service
- User calls the Policy Management Service to create the policy
- Policy Management Service creates a Policy:
- Stores it in Local DB
- Triggers Delayed Reconciliation Workflow
- Call external policy provider service to create the policy
- External service receives the request either:
- Creates policy in the database and returns
externalPolicyId - Fails at validation then returns
400 Bad Request - Crashes that returns
500 Internal Server Error
- Creates policy in the database and returns
- Reconciliation worker awakes after, let’s say 5 seconds, and reads the policy in the local database with the policy id and tries to reconcile. There are 3 cases:
- state of the policy is in
CREATE_PENDING. - state of the policy is in
ACTIVE - state of the policy is in
FAILED
- state of the policy is in
Here’s a high level flow diagram (Left to Right).
---
title: High Level Architecture Policy Management Service
---
flowchart TD
user["User"] -->|compensatingaction.bemyaficionado.com| dns["DNS"]
dns -->|ip address| user
user --> lb["Load Balancer"]
lb --> pms
pms["Policy Management Service"]
pms -->|initiate delayed reconciliation workflow| reconciliation_service["Reconciliation Service"]
pms -->|create policy with status 'pending'| localdb[("Local Policy Store DB")]
pms -->|create policy| external_provider["External Policy Provider"]
subgraph "Reconciliation Flow"
reconciliation_service -->check_status{"Check Status?"}
check_status -->|CREATE_PENDING|create_policy[["Create Policy"]]
end
reconciliation_service -->|fetch transaction state of 'policy'|localdb
create_policy -->|"Create policy with the same parameters"|external_provider
create_policy -->|"Update external_id in Local DB"| localdb
subgraph "External Policy Provider"
external_provider -->|create and store policy| policy_store_db[("Policy Store DB")]
policy_store_db .->|success| external_provider
end Scenario 1 & 2/ Crash Before Making Call to External Policy Provider Service
First, let’s tackle the first scenario where the Policy Management Service crashes before making the call to the external policy provider.
- PMS initiates a reconciliation service with a delay of 5 seconds.
- The delay is chosen at random as 5 seconds for illustration, in reality if the current sequence takes less than 200ms to complete, then a delay of 500ms or 1000ms is more than enough to trigger reconciliation process.
- The main aspect is that reconciliation process should start after the current process has completed.
- Policy Management Service (PMS) creates and stores the Policy object in its local db with
status='CREATE_PENDING' - PMS crashes afterwards.
- At this point we don’t know if the policy was created at the External Provider or not. And this is where the Idempotency of the services becomes useful (that I discussed above). Idempotency in this case means I can trigger this call as many times as possible without any side-effect.
- Reconciliation Service starts after the set delay.
- Reads the status from the local db with the
policyId. It finds:status='CREATE_PENDING' - It triggers the external policy management service to create the policy.
- Updates the external policy id and the status in the database.
external_id={ExternalPolicyId}status='ACTIVE'
- Reads the status from the local db with the
- Reconciliation successful
---
title: Policy Management Service Crash Before Making Call to External Policy Provider Service
---
sequenceDiagram
title: Policy Management Service Crashes before calling external Policy Provider
participant pms as PolicyManagementService
participant localdb as LocalDB
participant policyprovider as ExternalPolicyProviderService
participant externaldb as ExternalDB
participant reconciliation as Reconciliation Service
pms ->> reconciliation: initiate delayed reconciliation<br/> with `PolicyId`<br/>(5 seconds delay)
pms ->>+ localdb: create policy with ID and Status = 'CREATE_PENDING'
localdb -->>- pms: success
rect rgba(230,50,50)
pms -x policyprovider: crashed
end
reconciliation ->>+ localdb: read record by `PolicyId`
localdb -->>-reconciliation: Policy record with status 'CREATE_PENDING'
reconciliation ->>+ policyprovider: create policy with same details <br/>(Idempotent)
policyprovider ->>+ externaldb: create policy
externaldb -->>- policyprovider: success
policyprovider -->>- reconciliation: `ExternalPolicyID`
reconciliation ->>+ localdb: update status='ACTIVE', externalId=`ExternalPolicyID`
localdb -->>- reconciliation: successThis is the implementation of this scenario that mimics the PMS crash after writing the policy to local db and calling the external service to create policy. I return null right after calling the external policy provider to mimic crash.
public Policy crashAfterCallingExternalService(CreatePolicyRequest createPolicyRequest) {
String policyId = UUID.randomUUID().toString();
this.reconciliationService.scheduleReconciliation(policyId);
var policy = new Policy(policyId, "", Status.CREATE_PENDING, createPolicyRequest.description(), createPolicyRequest.statement());
db.put(policyId, policy);
String externalId = this.externalService.createPolicy(policy);
return null;
}Here’s the test for that implementation.
@SneakyThrows
@Test
void it_should_mimic_server_crash_when_the_policy_has_been_created_successfully_in_external_service() {
var externalService = new ExternalService(externalServiceDb, Map.of("CREATE_POLICY", true));
var reconciliationService = new ReconciliationService(policyServiceDb, externalService);
var policyService = new PolicyService(policyServiceDb, externalService, reconciliationService);
CreatePolicyRequest testCreatePolicyRequest = new CreatePolicyRequest("This is a test policy", "permit(principal, action, resource);");
Policy output = policyService.crashAfterCallingExternalService(testCreatePolicyRequest);
assertNull(output);
assertEquals(1, policyServiceDb.estimatedSize());
assertEquals(1, externalServiceDb.estimatedSize());
// verify the policy service crashed with 'CREATE_PENDING' state
var keys = policyServiceDb.asMap().keySet();
assertFalse(keys.isEmpty());
String key = keys.stream().findFirst().orElseThrow();
Policy failedPolicy = policyServiceDb.asMap().get(key);
assertEquals(Status.CREATE_PENDING, failedPolicy.status());
assertTrue(failedPolicy.externalID().isEmpty());
// verify that policy was created by the external service successfully, thus, inconsistent state
Policy createdPolicy = externalServiceDb.asMap().get(failedPolicy.id());
assertFalse(createdPolicy.externalID().isBlank());
assertEquals(Status.ACTIVE, createdPolicy.status());
// verify that the reconciliation service is working properly to reconcile the state
awaitSchedulerExecution();
assertEquals(Status.ACTIVE, policyServiceDb.asMap().get(failedPolicy.id()).status());
}The above test highlights state as each step progresses.

Scenario 3/ External Service fails to create the record and errors out
Now, let’s assume second scenario where the Policy Service was able to call the external policy provider service but that service failed instead.
%%{
init: {
'theme': 'light',
'themeCSS': '.messageLine0:nth-of-type(4) { stroke: red; textcolor: red;};'
}
}%%
sequenceDiagram
title: Policy Management Service Crashes before calling external Policy Provider
participant pms as PolicyManagementService
participant localdb as LocalDB
participant policyprovider as ExternalPolicyProviderService
participant externaldb as ExternalDB
pms ->> localdb: create policy with ID and Status = 'CREATE_PENDING'
localdb -->> pms: success
pms ->> policyprovider: create policy
policyprovider -x pms: FAILURE
pms ->> localdb: update status as 'FAILED'Let me write a test case to better explain to you what we are testing here.
class PolicyServiceTest {
private Cache<String, Policy> policyServiceDb;
private Cache<String, Policy> externalServiceDb;
@BeforeEach
void setUp() {
policyServiceDb = Caffeine.newBuilder()
.expireAfterWrite(1, TimeUnit.DAYS)
.maximumSize(1000)
.build();
externalServiceDb = Caffeine.newBuilder()
.expireAfterWrite(1, TimeUnit.DAYS)
.maximumSize(1000)
.build();
}
@AfterEach
void tearDown() {
policyServiceDb.invalidateAll();
externalServiceDb.invalidateAll();
}
@SneakyThrows
@Test
void it_should_update_the_record_as_failed_in_db_if_external_service_fails() {
var externalService = new ExternalService(externalServiceDb, Map.of("THROW_EXCEPTION", true));
var reconciliationService = new ReconciliationService(policyServiceDb, externalService);
var policyService = Mockito.spy(new PolicyService(policyServiceDb, externalService, reconciliationService));
CreatePolicyRequest testCreatePolicyRequest = new CreatePolicyRequest("This is a test policy", "permit(principal, action, resource);");
assertThrows(CreatePolicyException.class, () -> {
Policy output = policyService.compensateActionsIfExternalServiceFailsToCreatePolicy(testCreatePolicyRequest);
});
awaitSchedulerExecution();
assertEquals(1, policyServiceDb.estimatedSize());
var keys = policyServiceDb.asMap().keySet();
assertFalse(keys.isEmpty());
String key = keys.stream().findFirst().orElseThrow();
Policy failedPolicy = policyServiceDb.asMap().get(key);
assertEquals(Status.FAILED, failedPolicy.status());
}
}Here we make sure that the status of the policy in the database is updated as failed. And since Policy Management Service didn’t failed, it can perform the compensate action itself. No need for reconciliation service for this.
Here’s how the code will work.
public Policy compensateActionsIfExternalServiceFailsToCreatePolicy(CreatePolicyRequest createPolicyRequest) {
String policyId = UUID.randomUUID().toString();
this.reconciliationService.scheduleReconciliation(policyId);
var policy = new Policy(policyId, "", Status.CREATE_PENDING, createPolicyRequest.description(), createPolicyRequest.statement());
db.put(policyId, policy);
try {
String externalId = this.externalService.createPolicy(policy);
Policy createdPolicy = policy.withExternalID(externalId).withStatus(Status.ACTIVE);
this.db.put(policyId, createdPolicy);
return createdPolicy;
} catch (CreatePolicyException ex) {
this.db.put(policyId, policy.withStatus(Status.FAILED));
throw new CreatePolicyException();
}
}And when I run the test it passes. That means the state is correct.

Consideration for Production Systems
1/ Adopt the Outbox Pattern using Change Data Capture (CDC)
Instead of making an http call like I did in the example above, it would be more reliable if you rely on the DB trigger. Like a Change Data Capture mechanism. So whenever a record is inserted in the db, it will send it to a queue and will trigger the reconciliation pipeline. The reconciliation pipeline will automatically get the data. That is way more robust and reliable then making a service call at the start of your execution.
flowchart TD
db[("Database")] -->|CDC|queue[/queue/]
queue -->|"Read CDC records"| ReconciliationService2/ Ensure Idempotency
- Use a stable business idempotency key (e.g. your
PolicyId) in calls to the external provider. - The external provider should support idempotent creation (either dedupe by client id or return existing if already created).
- Locally, worker must handle duplicate success responses safely (update with
ON CONFLICT/upsert).
3/ Retries & backoff
- Implement exponential backoff with a max attempts counter.
- For persistent failures, move to dead-letter / manual reconciliation queue.
Conclusion
Today we looked at a widely used mechanism for making system more robust whenever we need a mechanism to keep two services in sync without having to deal with costly and complex distributed transactions. There are many variants and flavour to this pattern which you can adopt as needed based on your requirements.
I’ve not covered all the cases as our example was quite simple and straightforward, but it would be important when you are actually dealing with a production usecase. Listing down all possible failure scenario makes it easy to cover in your reconciliation service.
You can follow the complete code in my github repository here: Forward Retry Mechanism System Design
