Parallelizing S3 Workloads with s5cmd
This open source project comes from our customer community. It was developed by Peak Games to assist with their own S3 workflow, and includes features such as tab completion and built-in wild card support for files in S3 commands. Enjoy!
Background
Up until now, working on multiple objects on
The Tool
Since s5cmd already has a worker pool, wildcard operations can be accomplished using a single worker for the ListObjects call (which can match further wildcards), then let other workers do the actual processing. It also supports shell autocompletion for bash and zsh, so if you’d like to use it as a more conventional CLI tool, you can just hit TAB and let it autocomplete options, buckets, and paths/objects for you.
Installation
Install s5cmd on Mac OS X:
$ brew tap peakgames/s5cmd https://github.com/peakgames/s5cmd
$ brew install s5cmd
The tool is written in Go, other platforms can compile and install it using:
$ go get -u github.com/peakgames/s5cmd
Set up credentials just as you would for the awscli tool: Use the ~/.aws/credentials file or environment variables, or a combination of both. (If you’re running on EC2, roles are also supported.)
Usage
Commands are in “command [command options] argument1 [argument2]” format. s5cmd also takes options, which affect all commands run. To get the list of s5cmd options:
$ s5cmd -help
To get a list of available commands, run without arguments:
$ s5cmd
s5cmd in Action
Say we have a bucket named “reports-bkt”, and we have some files inside. First, let’s download one:
$ s5cmd get s3://reports-bkt/a/2018/03/14/reports_19_13716285583145.csv.gz
# Downloading reports_19_13716285583145.csv.gz...
2018/03/21 11:46:05 +OK "get s3://reports-bkt/a/2018/03/14/reports_19_13716285583145.csv.gz ./reports_19_13716285583145.csv.gz"
Now, let’s scan all of last month’s CSV reports and match each day’s report with a wildcard:
$ s5cmd du -g -h s3://reports-bkt/a/2018/02/*/reports*csv.gz
+ 10.7M bytes in 367 objects: s3://reports-bkt/a/2018/02/*/reports*csv.gz [STANDARD]
2018/03/21 11:46:24 +OK "du s3://reports-bkt/a/2018/02/*/reports*csv.gz" (1)
Looks like there are 367 reports for the whole month, taking up about 10MB, all using standard storage. Let’s download them all:
$ s5cmd cp --parents s3://reports-bkt/a/2018/02/*/reports*csv.gz .
Using the --parents
option, each day of the month will be downloaded to its own directory. (This option creates directory structure starting from the first wildcard specified.)
These examples are just the tip of the iceberg. Piping commands using a file or another command’s output is another option.
You might have noticed that our bucket has the usual “letter prefix” scheme. Let’s say you want to download all files for a given date, for all prefixes. This structure is something like:
- a/[yyyy]/[mm]/[dd]/object_unique_id.gz
- b/[yyyy]/[mm]/[dd]/object_unique_id.gz
- c/[yyyy]/[mm]/[dd]/object_unique_id.gz
... up to ...
- z/[yyyy]/[mm]/[dd]/object_unique_id.gz
If you have hundreds of days and billions of objects, specifying the wildcard at the first level won’t really work. Since you already know the range (letters a to z), you can generate commands for each of the prefixes. Invoke the tool just once, and let it do the work. Try this:
$ for X in {a..z}; do echo get -n s3://reports-bkt/${X}/2018/03/14/reports*csv.gz; done | s5cmd -f -
2018/03/21 11:48:03 # Stats: Total 379 281 ops/sec 1.350311978s
The first command will generate a bunch of “get” commands, then pass the commands to s5cmd to do the work. Notice that we’ve used the “-n” (no-clobber) option to prevent overwriting if the object names are not really unique – we can’t use the --parents
option because the wildcard is not in the directory name. You can see how many operations were done (and how much time it took) by checking the stat counters.
Contributing
All contributions to the project are welcome, and managed using the issue tracker at github.com/peakgames/s5cmd. If you are going to submit a PR, we suggest you open an issue first to discuss it with the team.
This is a guest post from Peak Games, which leverages S3 as part of a comprehensive pipeline that distills data into knowledge, further enhancing user experience of their world-class mobile games.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
相關推薦
Parallelizing S3 Workloads with s5cmd
This open source project comes from our customer community. It was developed by Peak Games to assist with their own S3 workflow, and incl
Managing Application Workloads with Database Services
procedure middle req evel osi oge contains nds ice Managing Application Workloads with Database Services This section contains: About
Run Mixed Workloads with Amazon Redshift Workload Management
Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business
Image resize on-the-fly with Lambda and S3
A crash course on Serverless with AWS — Image resize on-the-fly with Lambda and S3Handling large images has always been a pain in my side since I started w
Tutorial for building a Web Application with Amazon S3, Lambda, DynamoDB and API Gateway
Tutorial for building a Web Application with Amazon S3, Lambda, DynamoDB and API GatewayI recently attended Serverless Day at the AWS Loft in downtown San
Restore Glacier Objects with Restore Tiers in the S3 Console
Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So
Restore S3 Object from Amazon Glacier Storage Class with AWS CLI
Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So
Use Wildcards With Explicit Deny With Principle Elements in an S3 Bucket Policy
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": "*",
Delete Bucket with Amazon S3 Console or AWS CLI
Warning: The following procedure permanently deletes any data in your S3 bucket. Data deleted and removed from a bucket can't be recovered. Bef
通過 s5cmd 在 S3 上處理多個物件任務
此開源專案來自我們的客戶社群。此專案由 Peak Games 開發,用於為其自己的 S3 工作流程提供幫助,該專案包含 Tab 補全以及支援在 S3 命令中對檔案使用內建萬用字元等諸多功能。希望您能喜歡!- Deirdré 背景 到目前為止,
AWS S3物件無法下載——This XML file does not appear to have any style information associated with it
最近,需要從AWS S3上下載渲染後的圖片,遇到了如下問題: This XML file does not appear to have any style information associat
(轉) Learning Deep Learning with Keras
trees create pda sse caffe latex .py encode you Learning Deep Learning with Keras Piotr Migda? - blog Projects Articles Publications Res
hihoCoder #1454 : Rikka with Tree II
return 一段 har 節點 sla include turn typedef ems Description 一個\(n\)個節點的樹,先根遍歷為\(1...n\)。已知兩個數組,一個數組表示是否是葉節點,另一個數組表示十分有右兄弟節點...‘?‘表示未知,求方案數
Local Authentication Using Challenge Response with Yubikey for CentOS 7
fail ins path api -m ica all use debug Connect Yubikey ,then initialize YubiKey slot 2: ykpersonalize -2 -ochal-resp -ochal-hmac -ohmac
here with you
vertical can more lose 音樂 and -a ember music Here With You - Asher Book To all my friends對我所有好友來講The night is young夜未央The music‘s loud樂未殤
[SCSS] Write similar classes with the SCSS @for Control Directive
att oop enc rem coo tro from mil for Writing similar classes with minor variations, like utility classes, can be a pain to write and upda
with ..do 簡化語句使用
ima col img class color 訪問 簡化 times mes 訪問對象的簡化語句可以用with; 通常訪問對象的屬性和方法需要在前面加上對象的名稱; 如: procedure TForm2.btn1Click(Sender: TObject); begi
Ng第二課:單變量線性回歸(Linear Regression with One Variable)
dll oba vcf 更多 dba cfq dpf gis avd 二、單變量線性回歸(Linear Regression with One Variable) 2.1 模型表示 2.2 代價函數 2.3 代價函數的直觀理解 2.4 梯度下降
【論文閱讀-REC】<<Recommending music on Spotify with deep learing>>閱讀
play ring 來源 調整 能力 表達 layers 書籍 訓練 1、協同過濾 協同過濾不使用item的具體信息,因此可適用性很強,在書籍、電影、音樂上都可用; 協同過濾不適用item的具體信息,因此強者愈強; 冷啟動問題無法解決 2、基於內容的推薦 使用聲音信號推薦
[CSS] Draw Simple Icons with CSS
cnblogs elements chang pre active pla com man simple Using pseudo-elements like ::before and ::after we can draw some simple icons withou