Idempotence and Race Conditions
Overview
When receiving messages in an distributed serverless system a single action might be called multiple times, eg:
- If Lambda halts part way through execution it will retry
- SQS guarantees at least once delivery, meaning it might deliver a message multiple times
- Two requests from different sources might trigger the same function, but we only want to process one request
- I single source might send the same request multiple times, eg if a Lambda halts and is retried
Some of the solutions to idempotence can also be used to check for race conditions (eg storing timestamps).
Race conditions in a serverless system can include:
- Two functions running at the same time making adjustments to the same data that overwrite each other
- Messages received out of the order they were sent
DynamoDB conditionals
DynamoDB is a good way to do idempotence checks because we can check a conditional at the same time as adjusting data, eg setting the id of the request that can process a certain object/flow.
uniquerequestid
Each request that enters a Lambda will have a unique id that we can store in DynamoDB to set one request to initiate a flow, eg SQS messageId. This works well with SQS triggers because if Lambda halts it will automatically retry, with the same messageId, which can be tested, allowing the retry to process the Lambda again but any other request to be rejected.
Record Handler Handling Errors
Record Handler re-send an SQS trigger message by creating a new message using an increasing delay interval, because it is a new message it will have a new messageId, we solve this by placing the uniquerequestid into the message attributes and middleware finding this on retry and using it instead of the new messages id.
Direct SQS vs SNS>SQS messages
A Direct SQS initial request received into the Lambda will have the sent message attributes at the top level and properties in the body object. Messages that pass SNS to SQS will bump the original message attributes and properties into lower level properties under body, and add top level message attributes for the SQS message.
Direct SQS messages get reformated to mimic SNS>SQS messages, the first request will be at the top level, but retries will be bumped into body, to match SNS>SQS handling and also allow for extra details to be sent with the retried message, such as number of retries, and original uniquerequestid.
Checking uniquerequestid
At the start of any flow or function that we do not want to allow multiple requests to process we can check the stored uniquerequestid for that flow, this will be stored in a suitable DynamoDB table and be reset once processing is complete, or if is a cache workflow, when the cache is reset to be processed.
A flow might have multiple stages that do not want to allow multiple requests to process, each stage can have it's own uniquerequestid, all being removed once the processing is complete.
Ensuring old processing does not overwrite new processing
Some flows might reset while the flow is half way through processing, eg TriggeredCache, meaning an old flow might want to update data after a new flow has begun working on that data.
We can protect against this by passing along the flow's uniquerequestid, which gets checked each time data/state is updated, preferably using a conditional update. If the uniquerequestid has changed (or been reset/removed) then the query does not execute, and try to catch this and stop the logic as well.
Passing on the uniquerequestid to external flows
External logic needed in a flow, eg to a SearchResult or ComplexFilter request, will not be able to pass on the uniquerequestid to come back in the Complete message, but we need to know the flow's uniquerequestid to determine whether to continue updating data or not.
We can solve this by example saving the uniquerequestid into the AwaitingStep table, when the flow returns from the external logic we will find pending steps and also get the uniquerequestid that was saved, which can then be checked when updating data.
This will only work if the AwaitingStep ids are unique per processing request, this can be achieved by adding the uniquerequestid into the flows unique object id, eg when we ValidateCart we add the uniquerequestid into the object that creates the ... id, this way AwaitingStep records will not overwrite old ones. If we did not do this then new requests would overwrite the old AwaitingStep records and external flow responses would always continue, even if they returned stale data.
When to reset the uniquerequestid
Reset/remove the saved uniquerequestid only after all processing is complete to protect against a Lambda halting after resetting the uniquerequestid and before the final processing is complete.
For example cache flows once complete will send a message they are complete and update the data status so processing will not happen until the cache is reset, if we reset the uniquerequestid and cache status before sending the complete message and Lambda halted, when the request is retried by Lambda it would see the cache is not ready to be processed and the request dropped, meaning the complete message may never be sent.
If the flow spans multiple Lambdas, each Lambda might check a passed on uniquerequestid matches the data's saved uniquerequestid, if we reset before completing all processing and Lambda halts, when that request is retried the uniquerequestid will not match and the request would be dropped, meaning the remaining processing never happens.
Ensuring idempotence inside a single Lambda
A Lambda could be halted at any point during it's execution, eg if a resource or external request took a long time using up the Lambda's time limit, the request would be retried by Lambda, but we do not know whether the request has already partially been executed. For this reason we want to run through the code again so that any remaining logic is executed.
This means every action (side-effect) the Lambda performs must be idempotent.
DynamoDB queries
Use conditionals to ensure CRUD queries are idempotent, eg by storing a timestamp for any updates that accumulate that can be compared to a constant timestamp in the request.
SNS / SQS messages
We have no way to record if a message has already been sent, so rely on the receiving function to be idempotent so it can handle multiple messages for the same request. Each message would have it's own uniquerequestid so we can use above method of storing uniquerequestid for receiving Lambda's processing to ensure duplicate requests are discarded.
Direct Lambda invocations
.. probably same as SNS/SQS?
Receiving messages out of order
ie: Race condition in the messaging service.
For some tasks such as a client sending a request to update a record we do not want to process old messages if a new message has already been processed, ie the messages arrive out of order, we want the most recent request to remain, and older requests to be ignored.
We can achieve this by saving a timestamp into the data record that we conditionally check when performing the update query. The calling request includes a timestamp, set by the calling logic so the timestamp matches when the request was initially sent by the calling logic (not the time it arrives at the Lambda), this timestamp is compared to the timestamp saved with the data and if it is less than equal we do not update, and can stop logic processing.
Race conditions within a Lambda
Use conditional updates to ensure any updates only perform if no race condition exists, eg adding a timestamp or uniqueid to the data that gets updated whenever data is updated.
When we query data, perform logic, then update data, we check that the timestamp is the same as when we queried the data. We can place the process in a loop so that if the timestamp changes we re-query the data, logic, and try updating again, with a set number of retries, if exceeded the Lambda can throw an error, either re-trying the request later or placing into DLQ to be checked why it was unable to update.