S3downloader - AWS s3 cli tool

Onefootball app is designed with the goal to deliver the best suitable football content for each individual user. In order to achieve this our backend hourly handles great volumes of data. It’s a good practice to keep track of processed data, especially when it comes to football results, where sequence of parsed files is equally important to the content.

s3downloader written in go comes in place when you need to quickly find or download needed file from s3.

Case study

Imagine you have a football match between teams A and B. Backend updates its result based on the third-party provider feed with the following naming pattern match-a-b-results.xml. Let’s assume the feed is updated every 30 seconds and we archive every delta update without overwriting the file into a separate sub-folder. In the end of the game our S3 archive would look like this:

- 2016-01-05/143000/match-a-b-results.xml
- 2016-01-05/143030/match-a-b-results.xml
- 2016-01-05/143100/match-a-b-results.xml
- 2016-01-05/143130/match-a-b-results.xml
- ...
- ...
- ...
- 2016-01-05/171430/match-a-b-results.xml
- 2016-01-05/171500/match-a-b-results.xml

Getting up and running

In order to run the command you need to have go installed on your machine. You can install go either from source or run an installation package.

Once you have installed and configured go, run go get -t -u github.com/motain/s3downloader - this will fetch the project source and install it on your machine.

In order to make sure you have correctly install the tool run the following cmd from terminal:

s3downloader -h

Expected result is

Usage of s3downloader:
  -bucket string
        Download bucket
  -dir string
        Target local dir (default "downloads-s3")
  -dry-run
        Find only flag - no download
  -p    Prepend downloaded file name with last-modified timestamp
  -prefix string
        Bucket download path
  -regexp string
        Item name regular expression (default ".*")

Configuration

Create a config file based on template.

cd $GOPATH/src/github.com/motain/s3downloader
cp config.json.dist config.json

Provide your valid AWS credentials in config.json. Consider a valid s3 region.

Example 1 - find all

Find all archived feeds from 2016-01-05

s3downloader -bucket=archive -prefix=2016-01-05 -dry-run

would output:

INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/171500/match-a-b-results.xml
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/143000/
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/143030/
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/143100/match-a-b-results.xml
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/171400/
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/143030/match-a-b-results.xml
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/143130/
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/143130/match-a-b-results.xml
INFO: 2016/01/05 15:13:58 Found: s3://archive/2016-01-05/171400/match-a-b-results.xml

Example 2 - download all

Download all files from 2016-01-05 to a local directory $HOME/s3test (will be created if does not exist), prepend downloaded file name with last-modified timestamp.

s3downloader -bucket=archive -prefix=2016-01-05 -p -dir=/home/username/s3test

you should see similar info messages in terminal:

INFO: 2016/01/05 15:20:16 Downloading s3://archive/2016-01-05/171400/match-a-b-results.xml to /Users/username/s3test/2016-01-05T14:11:26Z_match-a-b-results.xml...
INFO: 2016/01/05 15:20:16 Downloading s3://archive/2016-01-05/143030/match-a-b-results.xml to /Users/username/s3test/2016-01-05T14:10:33Z_match-a-b-results.xml...
INFO: 2016/01/05 15:20:16 Downloading s3://archive/2016-01-05/143100/match-a-b-results.xml to /Users/username/s3test/2016-01-05T14:10:48Z_match-a-b-results.xml...
INFO: 2016/01/05 15:20:16 Downloading s3://archive/2016-01-05/143130/match-a-b-results.xml to /Users/username/s3test/2016-01-05T14:11:09Z_match-a-b-results.xml...
INFO: 2016/01/05 15:20:16 Downloading s3://archive/2016-01-05/143000/match-a-b-results.xml to /Users/username/s3test/2016-01-05T14:10:18Z_match-a-b-results.xml...
INFO: 2016/01/05 15:20:16 Downloading s3://archive/2016-01-05/171500/match-a-b-results.xml to /Users/username/s3test/2016-01-05T14:12:00Z_match-a-b-results.xml...

Example 3 - download or find by regexp

// Find results for all games
s3downloader -bucket=archive -regexp=^*results\\.xml$ -dry-run

// Download all results after 170000
s3downloader -bucket=archive -p -regexp=^.*17.*\\/.*results\\.xml$

Summary

We wrote the s3downloader because we were not happy with the performance and complexity the AWS web interface or other clients suggest. Due to go concurrency pattern utilization s3downloader allows to quickly scan even very big s3 buckets. The tool helps us a lot in our everyday work and we hope it can be useful for you as well.

Written on January 8, 2016 by Siamion Makarski