Latest revision as of 13:45, 5 August 2022

Overview

When receiving messages in an distributed serverless system a single action might be called multiple times, eg:

If Lambda halts part way through execution it will retry
SQS guarantees at least once delivery, meaning it might deliver a message multiple times
Two requests from different sources might trigger the same function, but we only want to process one request
I single source might send the same request multiple times, eg if a Lambda halts and is retried

Some of the solutions to idempotence can also be used to check for race conditions (eg storing timestamps).

Race conditions in a serverless system can include:

Two functions running at the same time making adjustments to the same data that overwrite each other
Messages received out of the order they were sent

DynamoDB conditionals

DynamoDB is a good way to do idempotence checks because we can check a conditional at the same time as adjusting data, eg setting the id of the request that can process a certain object/flow.

uniquerequestid

Each request that enters a Lambda will have a unique id that we can store in DynamoDB to set one request to initiate a flow, eg SQS messageId. This works well with SQS triggers because if Lambda halts it will automatically retry, with the same messageId, which can be tested, allowing the retry to process the Lambda again but any other request to be rejected.

Record Handler Handling Errors

Record Handler re-send an SQS trigger message by creating a new message using an increasing delay interval, because it is a new message it will have a new messageId, we solve this by placing the uniquerequestid into the message attributes and middleware finding this on retry and using it instead of the new messages id.

Direct SQS vs SNS>SQS messages

A Direct SQS initial request received into the Lambda will have the sent message attributes at the top level and properties in the body object. Messages that pass SNS to SQS will bump the original message attributes and properties into lower level properties under body, and add top level message attributes for the SQS message.

Direct SQS messages get reformated to mimic SNS>SQS messages, the first request will be at the top level, but retries will be bumped into body, to match SNS>SQS handling and also allow for extra details to be sent with the retried message, such as number of retries, and original uniquerequestid.

Checking uniquerequestid

At the start of any flow or function that we do not want to allow multiple requests to process we can check the stored uniquerequestid for that flow, this will be stored in a suitable DynamoDB table and be reset once processing is complete, or if is a cache workflow, when the cache is reset to be processed.

A flow might have multiple stages that do not want to allow multiple requests to process, each stage can have it's own uniquerequestid, all being removed once the processing is complete.

Ensuring old processing does not overwrite new processing

Some flows might reset while the flow is half way through processing, eg TriggeredCache, meaning an old flow might want to update data after a new flow has begun working on that data.

option 1: Allowing both requests to process at same time

do not use, planning on standardizing to use option 2 only

This will only work if the AwaitingStep ids are unique per processing request, this can be achieved by adding the uniquerequestid into the flows unique object id, eg when we ValidateCart we add the uniqueRequestId into the object that creates the cartOrderId, this way new AwaitingStep records will not overwrite existing ones. If we did not do this then new requests would overwrite the old AwaitingStep records and external flow responses would always continue, even if they returned stale data. If the old AwaitingStep record is found when an external flow completes, the uniqueRequestId saved in the AwaitingStep record will be compared to the main objects uniqueRequestId (eg: ValidateCartOrder cache), if it does not match then we discard the AwaitingStep. In addition we remove all existing AwaitingSteps when we reset ValidateCartOrder flow, but there could be race conditions for logic already executing so we add the conditional check when updating the main object's data to ensure the uniqueRequestIds still match.

Pass along the flow's uniquerequestid, which gets checked each time data/state is updated, preferably using a conditional update. If the uniquerequestid has changed (or been reset/removed) then the query does not execute, and try to catch this and stop the logic as well.

option 2: Initial request restarts processing

When resetTrigger we reset the uniqueRequestId but leave the status, if it was processing it remains so, and not begin processing the second request.

When checking triggeredCache, if it is processing we do not process again, discard the request (calling function will be triggered when cache completes).

Similar to timeCacheComplete check, at the end of the work we need to check if uniqueRequestId has been reset, if it has then instead of completing the task, we restart it. This check could be made at multiple points if makes sense to do so (eg: likely will be reset, and saves a lot of processing), but must be made at the end after all data saved when updating the work to status complete (ie conditional update checks uniqueRequestId when changing status).

The initial request flow is responsible for restarting the task when uniqueRequestId has been reset, subsequent requests do not start the processing when they see the tasks status is processing.

This will remove the danger of two flows overwriting each other, eg in option1 where a subsequent request will overwrite the uniqueRequestId saved in AwaitingStep table to itself it the tasks unique id does not change between processing, meaning stale results from the initial request could be processed for the second requests flow.

StoredCache have decided not to use any sort of check at the end of processing to see if reset, because storedCache uses expiry time that should ensure processing is always complete before it is reset to process again, there is code that resets a cache that is taking too long to process which could result in parallel requests but this is an exceptional event and reported to admins.

Passing on the uniquerequestid to external flows

External logic needed in a flow, eg to a SearchResult or ComplexFilter request, will not be able to pass on the uniqueRequestId to come back in the Complete message, but we need to know the flow's uniqueRequestId to determine whether to continue updating data or not.

We can solve this by example saving the uniqueRequestId into the AwaitingStep table, when the flow returns from the external logic we will find pending steps and also get the uniqueRequestId that was saved, which can then be checked when updating data.

When to reset the uniquerequestid

Reset/remove the saved uniquerequestid only after all processing is complete to protect against a Lambda halting after resetting the uniquerequestid and before the final processing is complete.

For example cache flows once complete will send a message they are complete and update the data status so processing will not happen until the cache is reset, if we reset the uniquerequestid and cache status before sending the complete message and Lambda halted, when the request is retried by Lambda it would see the cache is not ready to be processed and the request dropped, meaning the complete message may never be sent.

If the flow spans multiple Lambdas, each Lambda might check a passed on uniquerequestid matches the data's saved uniquerequestid, if we reset before completing all processing and Lambda halts, when that request is retried the uniquerequestid will not match and the request would be dropped, meaning the remaining processing never happens.

Ensuring idempotence inside a single Lambda

A Lambda could be halted at any point during it's execution, eg if a resource or external request took a long time using up the Lambda's time limit, the request would be retried by Lambda, but we do not know whether the request has already partially been executed. For this reason we want to run through the code again so that any remaining logic is executed.

This means every action (side-effect) the Lambda performs must be idempotent.

DynamoDB queries

Use conditionals to ensure CRUD queries are idempotent, eg by storing a timestamp for any updates that accumulate that can be compared to a constant timestamp in the request.

SNS / SQS messages

We have no way to record if a message has already been sent, so rely on the receiving function to be idempotent so it can handle multiple messages for the same request. Each message would have it's own uniquerequestid so we can use above method of storing uniquerequestid for receiving Lambda's processing to ensure duplicate requests are discarded.

Direct Lambda invocations

.. probably same as SNS/SQS?

Receiving messages out of order

ie: Race condition in the messaging service.

For some tasks such as a client sending a request to update a record we do not want to process old messages if a new message has already been processed, ie the messages arrive out of order, we want the most recent request to remain, and older requests to be ignored.

We can achieve this by saving a timestamp into the data record that we conditionally check when performing the update query. The calling request includes a timestamp, set by the calling logic so the timestamp matches when the request was initially sent by the calling logic (not the time it arrives at the Lambda), this timestamp is compared to the timestamp saved with the data and if it is less than equal we do not update, and can stop logic processing.

Race conditions within a Lambda

Use conditional updates to ensure any updates only perform if no race condition exists, eg adding a timestamp or uniqueid to the data that gets updated whenever data is updated.

When we query data, perform logic, then update data, we check that the timestamp is the same as when we queried the data. We can place the process in a loop so that if the timestamp changes we re-query the data, logic, and try updating again, with a set number of retries, if exceeded the Lambda can throw an error, either re-trying the request later or placing into DLQ to be checked why it was unable to update.

Idempotence and Race Conditions: Difference between revisions