Idempotence and Race Conditions
Overview
When receiving messages in an distributed serverless system a single action might be called multiple times, eg:
- If Lambda halts part way through execution it will retry
- SQS guarantees at least once delivery, meaning it might deliver a message multiple times
- Two requests from different sources might trigger the same function, but we only want to process one request
- I single source might send the same request multiple times, eg if a Lambda halts and is retried
Some of the solutions to idempotence can also be used to check for race conditions (eg storing timestamps).
Race conditions in a serverless system can include:
- Two functions running at the same time making adjustments to the same data that overwrite each other
- Messages received out of the order they were sent
DynamoDB conditionals
DynamoDB is a good way to do idempotence checks because we can check a conditional at the same time as adjusting data, eg setting the id of the request that can process a certain object/flow.
uniquerequestid
Each request that enters a Lambda will have a unique id that we can store in DynamoDB to set one request to initiate a flow, eg SQS messageId. This works well with SQS triggers because if Lambda halts it will automatically retry, with the same messageId, which can be tested, allowing the retry to process the Lambda again but any other request to be rejected.
Record Handler Handling Errors
Record Handler re-send an SQS trigger message by creating a new message using an increasing delay interval, because it is a new message it will have a new messageId, we solve this by placing the uniquerequestid into the message attributes and middleware finding this on retry and using it instead of the new messages id.
Direct SQS vs SNS>SQS messages
A Direct SQS initial request received into the Lambda will have the sent message attributes at the top level and properties in the body object. Messages that pass SNS to SQS will bump the original message attributes and properties into lower level properties under body, and add top level message attributes for the SQS message.
Direct SQS messages get reformated to mimic SNS>SQS messages, the first request will be at the top level, but retries will be bumped into body, to match SNS>SQS handling and also allow for extra details to be sent with the retried message, such as number of retries, and original uniquerequestid.
Checking uniquerequestid
At the start of any flow or function that we do not want to allow multiple requests to process we can check the stored uniquerequestid for that flow, this will be stored in a suitable DynamoDB table and be reset once processing is complete, or if is a cache workflow, when the cache is reset to be processed.
A flow might have multiple stages that do not want to allow multiple requests to process, each stage can have it's own uniquerequestid, all being removed once the processing is complete.
Ensuring old processing does not overwrite new processing
Some flows might reset while the flow is half way through processing, eg TriggeredCache, meaning an old flow might want to update data after a new flow has begun working on that data.
We can protect against this by passing along the flow's uniquerequestid, which gets checked each time data/state is updated, preferably using a conditional update. If the uniquerequestid has changed (or been reset/removed) then the query does not execute, and try to catch this and stop the logic as well.
Passing on the uniquerequestid to external flows
External logic needed in a flow, eg to a SearchResult or ComplexFilter request, will not be able to pass on the uniquerequestid to come back in the Complete message, but we need to know the flow's uniquerequestid to determine whether to continue updating data or not.
We can solve this by example saving the uniquerequestid into the AwaitingStep table, when the flow returns from the external logic we will find pending steps and also get the uniquerequestid that was saved, which can then be checked when updating data.
This will only work if the AwaitingStep ids are unique per processing request, this can be achieved by adding the uniquerequestid into the flows unique object id, eg when we ValidateCart we add the uniquerequestid into the object that creates the ... id, this way AwaitingStep records will not overwrite old ones. If we did not do this then new requests would overwrite the old AwaitingStep records and external flow responses would always continue, even if they returned stales data.