I am probably ten years too late in writing about the oldest AWS service but the power of AWS SDK asynchronous file transfer is irresistible. Moving a 300MB file takes just a few seconds. It is very easy to plug this API into reactive backend services. For this exercise we assume that one service uploads a file to S3 and another service periodically checks for new files in that location.
Fundamentally, asynchronous operations require an instance of the TransferManager class and a callback to process status notifications. I wrapped the whole process into a few classes representing abstractions for uploading to, downloading from, and detecting newly uploaded files in a pre-configured location in some S3 bucket location.
The TransferManager API typically takes a Request object and an asynchronous status listener. It returns a Transfer instance that can be used to retrieve error message in case of failure. Polling an S3 location for available files requires a loop because the results are returned in batches. Checking if a file exists at a given S3 path is implemented as an attempt to fetch the corresponding file metadata and treating a thrown exception as "FileNotFound".
The TransferManager API typically takes a Request object and an asynchronous status listener. It returns a Transfer instance that can be used to retrieve error message in case of failure. Polling an S3 location for available files requires a loop because the results are returned in batches. Checking if a file exists at a given S3 path is implemented as an attempt to fetch the corresponding file metadata and treating a thrown exception as "FileNotFound".
S3 is a simple (duh!) service so there are only two additional notes. First, it is a good idea to encode some metadata into file names on S3. Things such as tenant id or version or video resolution. It helps with deciding how to handle a downloaded file by parsing its name. Second, it's convenient to superimpose a "directory structure" onto the flat namespace of the S3 bucket abstraction.
So a reasonable file naming convention might include a three-part prefix appended to all relative file paths: a backend service name, a file schema version, and a namespace representing either an environment (e.g. PROD) or a developer (in development deployments). The version part in particular makes upgrades much easier in production. For example,
"$BUCKET/prod/somesvc/v2/relative_path/file.ext".
The digram below shows a typical sequence of operations for uploading a file and then finding it with S3 polling from a different service.
"$BUCKET/prod/somesvc/v2/relative_path/file.ext".
The digram below shows a typical sequence of operations for uploading a file and then finding it with S3 polling from a different service.
In my example,
- FileDownloader / FileUploader - abstractions used by the client to start a file transfer operation
- TransferCallback - the callback interface called by file transfer operations to report final status to the client asynchronously
- S3Destination - a way to specify a common bucket "subdirectory" for multiple files
- S3TransferProgressListener - converts progress events to success/failure notifications; makes possible to extract an error message
- S3FileTransferClient / S3Client - TransferManager-based implementation of file transfer operations
- S3Uploader / S3Downloader - S3Client-based implementation of file transfer API
- S3ChangeDetector - a job to be run periodically (e.g. with ScheduledExecutorService::scheduleWithFixedDelay) on the receiver side to look for new files on S3
- FileHandler - the callback interface called by S3ChangeDetector for every found file not seen before (the way of keeping track of previously seen files is likely to be application-specific)
No comments:
Post a Comment