2024-05-10 - Saving Feeds to S3
Jump to navigation
Jump to search
Chose S3 Multipart
- each part will be reasonably large (processing from one Lambda invocation)
- can control ordering of connected parts across async processing
Process
Not do below, instead one thread processes all SortResultData records, see Multi Thread Invocation
- have a set number of records to process per invocation
- count number of records in SortResultData to decide how many processing threads to begin
- will also calculate how many Lambda invocations per thread
- each invocation is a part, so each thread will save multiple parts
- AwaitingMultipleSteps for each thread, when each thread completes check all steps finished? If yes complete the multipart upload
- Error/s can be saved to main ExportMain record
Alternative
- do not control number of records per Lambda invocation
- each processing thread saves it's data somewhere temporary, eg Elasticache
- only save processing thread's part to S3 after it completes all records allocated to it
S3 Multipart Uploads
- can begin a multipart upload and hold that process open until complete
- assume file will not be available until all parts saved and multipart upload set to complete
- must pay standard S3 charges for pending data
- must complete process or pending parts will remain and be charged, but not accessable
- can batch handle records, ordering them according to part reference numbers
Kineses Filehose Batching
- S3 calls are expensive, we build larger parts in Lambda, not sure how this would compare to Firehose limits
- Firehose limits 1MB/record, 4MB/batch
- Can compress final file before saving to S3
references
- https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPartCopy.html
- https://stackoverflow.com/questions/41783903/append-data-to-an-s3-object
- https://www.trek10.com/blog/exploring-the-depths-of-kinesis-data-streams---part-1-partitioning
- https://www.reddit.com/r/aws/comments/7a5sb8/firehose_vs_putting_directly_to_s3/
- https://www.reddit.com/r/aws/comments/smyq86/aws_kinesis_firehose_or_direct_put_to_s3_for_data/
- https://www.reddit.com/r/aws/comments/gjm2uf/kinesis_to_s3_vs_direct_writes_to_s3/
- https://www.webagesolutions.com/blog/introduction-to-pyspark