2020-11-28 - Handling Message Batches from SQS Queues

From Izara Wiki
Jump to navigation Jump to search

Communication between services

Overview

inter-service communication we plan to use SNS In and Out queues which will be subscribe to by SQS queues that trigger Lambda functions. Intra-service flows may also utilize SQS queues to trigger same-service Lambda functions.

When a Lambda function is triggered by an SQS queue it can be configured to receive more than one message at a time, if we are confident the processing of the Lambda logic will not exceed resource limits this is more efficient than invoking one Lambda per message, so we will design for this.

Messages arrive to Lambda function in a Records array.

The Lambda's Handler (Hdr) function manages invoking the core logic function for each record, the core logic invocation has no reference to there being multiple records received.

How SQS queue messages are removed or retried

When SQS triggers a Lambda it is considered a syncronous invocation, so the Lambda's retry and DLQ settings are ignored (they only apply for asyncronous invocations)

The SQS queue's retry, visibility timeout (time before a retry message is retried), and DLQ settings are used.

If the Lambda throws an error or times out the invocation is considered a failure and the message goes through the SQS queue's retry settings, any other outcome (eg returning a value) deletes the messages from the queue.

If a batch of messages are sent to the Lambda they are retried or deleted as a batch, if the function throws or times out all messages are retried, otherwise all messages are deleted.

Idempotent logic design

Because Lambda cannot guarantee messages will only be delivered once, and because functions might throw unhandled exceptions and be retried, we need to design our code to have the same output and results regardless whether a single request is invoked multiple times.

Extra care should be taken for sections that might fail half way through execution, for example if a database query or request to external service does not respond, we can either code for these cases or have Lambda throw an error so the message is retried after a delay. If throw and retry the request will start from the beginning so this needs to be taken into consideration.

Methods of retrying only messages in a batch that fail

1) Lambda manages retries of unsuccessful records

Plan to use this method

Lambda Handler always returns without error which will remove all messags in the batch from the queue. When a record does not process successfully the Lambda Handler creates a new message and places it into the queue.

When creating a new message add a message attribute(?) that marks how many attempts have been made at processing the message, if that attribute is not set, set it to 1, if it is set add 1 each time the message gets retried.

?not sure can do? When adding the retry message to the SQS queue set its delay time so it is not retried immediately, we could stagger this, eg 1 minute for the first retry, 10 minutes for the second, 3 hours for the third.

Adding the message to a DLQ will need to be handled by the Lambda Handler as well, eg after x number of retries have occured.

We could create a shared module to manage this process in a standard way.

Benefits:

  • More control over how retries are handled
  • Lambda never throws an error, making for cleaner logging, middleware handling, etc..
  • If we expect records to fail rarely, this design will result in less resource calls, eg less calls to SQS to delete messages

2) Lambda removes successful records

Have the Lambda Handler be responsible for removing messages from the SQS queue that are successful, then throw an error so the SQS queue handles retrying unsuccessful messages, because successful messages have already been deleted they will not be retried.

Benefits:

  • Code is simpler, relying more on SQS functionality
  • Can trust retry messages will be created in the correct structure
  • SQS queue manages passing to DLQ