1. 程式人生 > >Parallelizing S3 Workloads with s5cmd

Parallelizing S3 Workloads with s5cmd

This open source project comes from our customer community. It was developed by Peak Games to assist with their own S3 workflow, and includes features such as tab completion and built-in wild card support for files in S3 commands. Enjoy!

Background

Up until now, working on multiple objects on

Amazon S3 from the command line meant invoking multiple commands, or using wildcards, with the tools that supported them to some extent. Each command invocation is another fork/exec on the system level, whose overhead adds up when you need to run a few hundred or more operations.

The Tool

s5cmd lets you run multiple operations (with wildcards or not) using a single executable invocation. For example, if you have to delete (or copy) a few million objects, you don’t have to invoke the CLI tool a few million times. By piping the commands into s5cmd, you invoke the tool just once and let it run a few hundred workers to do the given work.

Since s5cmd already has a worker pool, wildcard operations can be accomplished using a single worker for the ListObjects call (which can match further wildcards), then let other workers do the actual processing. It also supports shell autocompletion for bash and zsh, so if you’d like to use it as a more conventional CLI tool, you can just hit TAB and let it autocomplete options, buckets, and paths/objects for you.

Installation

Install s5cmd on Mac OS X:

$ brew tap peakgames/s5cmd https://github.com/peakgames/s5cmd
$ brew install s5cmd

The tool is written in Go, other platforms can compile and install it using:

$ go get -u github.com/peakgames/s5cmd

Set up credentials just as you would for the awscli tool: Use the ~/.aws/credentials file or environment variables, or a combination of both. (If you’re running on EC2, roles are also supported.)

Usage

Commands are in “command [command options] argument1 [argument2]” format. s5cmd also takes options, which affect all commands run. To get the list of s5cmd options:

$ s5cmd -help

To get a list of available commands, run without arguments:

$ s5cmd

s5cmd in Action

Say we have a bucket named “reports-bkt”, and we have some files inside. First, let’s download one:

$ s5cmd get s3://reports-bkt/a/2018/03/14/reports_19_13716285583145.csv.gz
                     # Downloading reports_19_13716285583145.csv.gz...
2018/03/21 11:46:05 +OK "get s3://reports-bkt/a/2018/03/14/reports_19_13716285583145.csv.gz ./reports_19_13716285583145.csv.gz"

Now, let’s scan all of last month’s CSV reports and match each day’s report with a wildcard:

$ s5cmd du -g -h s3://reports-bkt/a/2018/02/*/reports*csv.gz
                            + 10.7M bytes in 367 objects: s3://reports-bkt/a/2018/02/*/reports*csv.gz [STANDARD]
2018/03/21 11:46:24 +OK "du s3://reports-bkt/a/2018/02/*/reports*csv.gz" (1)

Looks like there are 367 reports for the whole month, taking up about 10MB, all using standard storage. Let’s download them all:

$ s5cmd cp --parents s3://reports-bkt/a/2018/02/*/reports*csv.gz .

Using the --parents option, each day of the month will be downloaded to its own directory. (This option creates directory structure starting from the first wildcard specified.)

These examples are just the tip of the iceberg. Piping commands using a file or another command’s output is another option.

You might have noticed that our bucket has the usual “letter prefix” scheme. Let’s say you want to download all files for a given date, for all prefixes. This structure is something like:

- a/[yyyy]/[mm]/[dd]/object_unique_id.gz
- b/[yyyy]/[mm]/[dd]/object_unique_id.gz
- c/[yyyy]/[mm]/[dd]/object_unique_id.gz
... up to ...
- z/[yyyy]/[mm]/[dd]/object_unique_id.gz

If you have hundreds of days and billions of objects, specifying the wildcard at the first level won’t really work. Since you already know the range (letters a to z), you can generate commands for each of the prefixes. Invoke the tool just once, and let it do the work. Try this:

$ for X in {a..z}; do echo get -n s3://reports-bkt/${X}/2018/03/14/reports*csv.gz; done | s5cmd -f -
2018/03/21 11:48:03 # Stats: Total             379 281 ops/sec 1.350311978s

The first command will generate a bunch of “get” commands, then pass the commands to s5cmd to do the work. Notice that we’ve used the “-n” (no-clobber) option to prevent overwriting if the object names are not really unique – we can’t use the --parents option because the wildcard is not in the directory name. You can see how many operations were done (and how much time it took) by checking the stat counters.

Contributing

All contributions to the project are welcome, and managed using the issue tracker at github.com/peakgames/s5cmd. If you are going to submit a PR, we suggest you open an issue first to discuss it with the team.

This is a guest post from Peak Games, which leverages S3 as part of a comprehensive pipeline that distills data into knowledge, further enhancing user experience of their world-class mobile games.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

相關推薦

Parallelizing S3 Workloads with s5cmd

This open source project comes from our customer community. It was developed by Peak Games to assist with their own S3 workflow, and incl

Managing Application Workloads with Database Services

procedure middle req evel osi oge contains nds ice Managing Application Workloads with Database Services This section contains: About

Run Mixed Workloads with Amazon Redshift Workload Management

Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business

Image resize on-the-fly with Lambda and S3

A crash course on Serverless with AWS — Image resize on-the-fly with Lambda and S3Handling large images has always been a pain in my side since I started w

Tutorial for building a Web Application with Amazon S3, Lambda, DynamoDB and API Gateway

Tutorial for building a Web Application with Amazon S3, Lambda, DynamoDB and API GatewayI recently attended Serverless Day at the AWS Loft in downtown San

Restore Glacier Objects with Restore Tiers in the S3 Console

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Restore S3 Object from Amazon Glacier Storage Class with AWS CLI

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Use Wildcards With Explicit Deny With Principle Elements in an S3 Bucket Policy

{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": "*",

Delete Bucket with Amazon S3 Console or AWS CLI

Warning: The following procedure permanently deletes any data in your S3 bucket. Data deleted and removed from a bucket can't be recovered. Bef

通過 s5cmdS3 上處理多個物件任務

此開源專案來自我們的客戶社群。此專案由 Peak Games 開發,用於為其自己的 S3 工作流程提供幫助,該專案包含 Tab 補全以及支援在 S3 命令中對檔案使用內建萬用字元等諸多功能。希望您能喜歡!- Deirdré 背景 到目前為止,

AWS S3物件無法下載——This XML file does not appear to have any style information associated with it

最近,需要從AWS S3上下載渲染後的圖片,遇到了如下問題: This XML file does not appear to have any style information associat

(轉) Learning Deep Learning with Keras

trees create pda sse caffe latex .py encode you Learning Deep Learning with Keras Piotr Migda? - blog Projects Articles Publications Res

hihoCoder #1454 : Rikka with Tree II

return 一段 har 節點 sla include turn typedef ems Description 一個\(n\)個節點的樹,先根遍歷為\(1...n\)。已知兩個數組,一個數組表示是否是葉節點,另一個數組表示十分有右兄弟節點...‘?‘表示未知,求方案數

Local Authentication Using Challenge Response with Yubikey for CentOS 7

fail ins path api -m ica all use debug Connect Yubikey ,then initialize YubiKey slot 2: ykpersonalize -2 -ochal-resp -ochal-hmac -ohmac

here with you

vertical can more lose 音樂 and -a ember music Here With You - Asher Book To all my friends對我所有好友來講The night is young夜未央The music‘s loud樂未殤

[SCSS] Write similar classes with the SCSS @for Control Directive

att oop enc rem coo tro from mil for Writing similar classes with minor variations, like utility classes, can be a pain to write and upda

with ..do 簡化語句使用

ima col img class color 訪問 簡化 times mes 訪問對象的簡化語句可以用with; 通常訪問對象的屬性和方法需要在前面加上對象的名稱; 如: procedure TForm2.btn1Click(Sender: TObject); begi

Ng第二課:單變量線性回歸(Linear Regression with One Variable)

dll oba vcf 更多 dba cfq dpf gis avd 二、單變量線性回歸(Linear Regression with One Variable) 2.1 模型表示 2.2 代價函數 2.3 代價函數的直觀理解 2.4 梯度下降

【論文閱讀-REC】<<Recommending music on Spotify with deep learing>>閱讀

play ring 來源 調整 能力 表達 layers 書籍 訓練 1、協同過濾 協同過濾不使用item的具體信息,因此可適用性很強,在書籍、電影、音樂上都可用; 協同過濾不適用item的具體信息,因此強者愈強; 冷啟動問題無法解決 2、基於內容的推薦 使用聲音信號推薦

[CSS] Draw Simple Icons with CSS

cnblogs elements chang pre active pla com man simple Using pseudo-elements like ::before and ::after we can draw some simple icons withou