May 22, 2016

Asynchronous AWS S3 file transfer

I am probably ten years too late in writing about the oldest AWS service but the power of AWS SDK asynchronous file transfer is irresistible. Moving a 300MB file takes just a few seconds. It is very easy to plug this API into reactive backend services. For this exercise we assume that one service uploads a file to S3 and another service periodically checks for new files in that location.

Fundamentally, asynchronous operations require an instance of the TransferManager class and a callback to process status notifications. I wrapped the whole process into a few classes representing abstractions for uploading to, downloading from, and detecting newly uploaded files in a pre-configured location in some S3 bucket location.

The TransferManager API typically takes a Request object and an asynchronous status listener. It returns a Transfer instance that can be used to retrieve error message in case of failure. Polling an S3 location for available files requires a loop because the results are returned in batches. Checking if a file exists at a given S3 path  is implemented as an attempt to fetch the corresponding file metadata and treating a thrown exception as "FileNotFound".

S3 is a simple (duh!) service so there are only two additional notes. First, it is a good idea to encode some metadata into file names on S3. Things such as tenant id or version or video resolution. It helps with deciding how to handle a downloaded file by parsing its name. Second, it's convenient to superimpose a "directory structure" onto the flat namespace of the S3 bucket abstraction. 

So a reasonable file naming convention might include a three-part prefix appended to all relative file paths: a backend service name, a file schema version, and a namespace representing either an environment (e.g. PROD) or a developer (in development deployments). The version part in particular makes upgrades much easier in production. For example,
"$BUCKET/prod/somesvc/v2/relative_path/file.ext".

The digram below shows a typical sequence of operations for uploading a file and then finding it with S3 polling from a different service.


In my example,
  • FileDownloader / FileUploader - abstractions used by the client to start a file transfer operation
  • TransferCallback - the callback interface called by file transfer operations to report final status to the client asynchronously
  • S3Destination - a way to specify a common bucket "subdirectory" for multiple files
  • S3TransferProgressListener - converts progress events to success/failure notifications; makes possible to extract an error message 
  • S3FileTransferClient / S3Client - TransferManager-based implementation of file transfer operations
  • S3Uploader / S3Downloader - S3Client-based implementation of file transfer API     
  • S3ChangeDetector - a job to be run periodically (e.g. with ScheduledExecutorService::scheduleWithFixedDelay) on the receiver side to look for new files on S3
  • FileHandler - the callback interface called by S3ChangeDetector for every found file not seen before (the way of keeping track of previously seen files is likely to be application-specific)