Accelerating Precision Medicine at Scale
This post courtesy of Aaron Friedman, Healthcare and Life Sciences Partner Solutions Architect, AWS and Angel Pizarro, Genomics and Life Sciences Senior Solutions Architect, AWS
Precision medicine is tailored to individuals based on quantitative signatures, including genomics, lifestyle, and environment. It is often considered to be the driving force behind the next wave of human health. Through new initiatives and technologies such as population-scale genomics sequencing and IoT-backed wearables, researchers and clinicians in both commercial and public sectors are gaining new, previously inaccessible insights.
Many of these precision medicine initiatives are already happening on AWS. A few of these include:
- PrecisionFDA – This initiative is led by the US Food and Drug Administration. The goal is to define the next-generation standard of care for genomics in precision medicine.
- Deloitte ConvergeHEALTH – Gives healthcare and life sciences organizations the ability to analyze their disparate datasets on a singular real world evidence platform.
Central to many of these initiatives is genomics, which gives healthcare organizations the ability to establish a baseline for longitudinal studies. Due to its wide applicability in precision medicine initiatives—from rare disease diagnosis to improving outcomes of clinical trials—genomics data is growing at a larger rate than Moore’s law across the globe. Many expect these datasets to grow to be in the range of tens of exabytes by 2025.
Genomics data is also regularly re-analyzed by the community as researchers develop new computational methods or compare older data with newer genome references. These trends are driving innovations in data analysis methods and algorithms to address the massive increase of computational requirements.
Edico Genome, an AWS Partner Network (APN) Partner, has developed a novel solution that accelerates genomics analysis using field-programmable gate arrays, or FPGAs. Historically, Edico Genome deployed their FPGA appliances on-premises. When AWS announced the Amazon EC2 F1 FPGA-based instance family in December 2016, Edico Genome adopted a cloud-first strategy, became a F1 launch partner, and was one of the first partners to deploy FPGA-enabled applications on AWS.
On October 19, 2017, Edico Genome partnered with the Children’s Hospital of Philadelphia (CHOP) to demonstrate their FPGA-accelerated genomic pipeline software, called DRAGEN. It can significantly reduce time-to-insight for patient genomes, and analyzed 1,000 genomes from the Center for Applied Genomics Biobank in the shortest time possible. This set a Guinness World Record for the fastest analysis of 1000 whole human genomes, and they did this using 1000 EC2 f1.2xlarge instances in a single AWS region. Not only were they able to analyze genomes at high throughput, they did so averaging approximately $3 per whole human genome of AWS compute for the analysis.
The version of DRAGEN that Edico Genome used for this analysis was also the same one used in the precisionFDA Hidden Treasures – Warm Up challenge, where they were one of the top performers in every assessment.
In the remainder of this post, we walk through the architecture used by Edico Genome, combining EC2 F1 instances and AWS Batch to achieve this milestone.
EC2 F1 instances and Edico’s DRAGEN
EC2 F1 instances provide access to programmable hardware-acceleration using FPGAs at a cloud scale. AWS customers use F1 instances for a wide variety of applications, including big data, financial analytics and risk analysis, image and video processing, engineering simulations, AR/VR, and accelerated genomics.
Edico Genome’s FPGA-backed DRAGEN Bio-IT Platform is now integrated with EC2 F1 instances. You can access the accuracy, speed, flexibility, and low compute cost of DRAGEN through a number of third-party platforms, AWS Marketplace, and Edico Genome’s own platform. The DRAGEN platform offers a scalable, accelerated, and cost-efficient secondary analysis solution for a wide variety of genomics applications. Edico Genome also provides a highly optimized mechanism for the efficient storage of genomic data.
Scaling DRAGEN on AWS
Edico Genome used 1,000 EC2 F1 instances to help their customer, the Children’s Hospital of Philadelphia (CHOP), to process and analyze all 1,000 whole human genomes in parallel. They used AWS Batch to provision compute resources and orchestrate DRAGEN compute jobs across the 1,000 EC2 F1 instances. This solution successfully addressed the challenge of creating a scalable genomic processing pipeline that can easily scale to thousands of engines running in parallel.
Architecture
A simplified view of the architecture used for the analysis is shown in the following diagram:
- DRAGEN’s portal uses Elastic Load Balancing and Auto Scaling groups to scale out EC2 instances that submitted jobs to AWS Batch.
- Job metadata is stored in their Workflow Management (WFM) database, built on top of Amazon Aurora.
- The DRAGEN Workflow Manager API submits jobs to AWS Batch.
- These jobs are executed on the AWS Batch managed compute environment that was responsible for launching the EC2 F1 instances.
- These jobs run as Docker containers that have the requisite DRAGEN binaries for whole genome analysis.
- As each job runs, it retrieves and stores genomics data that is staged in Amazon S3.
The steps listed previously can also be bucketed into the following higher-level layers:
- Workflow: Edico Genome used their Workflow Management API to orchestrate the submission of AWS Batch jobs. Metadata for the jobs (such as the S3 locations of the genomes, etc.) resides in the Workflow Management Database backed by Amazon Aurora.
- Batch execution: AWS Batch launches EC2 F1 instances and coordinates the execution of DRAGEN jobs on these compute resources. AWS Batch enabled Edico to quickly and easily scale up to the full number of instances they needed as jobs were submitted. They also scaled back down as each job was completed, to optimize for both cost and performance.
- Compute/job: Edico Genome stored their binaries in a Docker container that AWS Batch deployed onto each of the F1 instances, giving each instance the ability to run DRAGEN without the need to pre-install the core executables. The AWS based DRAGEN solution streams all genomics data from S3 for local computation and then writes the results to a destination bucket. They used an AWS Batch job role that specified the IAM permissions. The role ensured that DRAGEN only had access to the buckets or S3 key space it needed for the analysis. Jobs didn’t need to embed AWS credentials.
Walkthrough
In the following sections, we dive deeper into several tasks that enabled Edico Genome’s scalable FPGA genome analysis on AWS:
- Prepare your Amazon FPGA Image for AWS Batch
- Create a Dockerfile and build your Docker image
- Set up your AWS Batch FPGA compute environment
Prerequisites
In brief, you need a modern Linux distribution (3.10+), Amazon ECS Container Agent, awslogs driver, and Docker configured on your image. There are additional recommendations in the Compute Resource AMI specification.
Preparing your Amazon FPGA Image for AWS Batch
You can use any Amazon Machine Image (AMI) or Amazon FPGA Image (AFI) with AWS Batch, provided that it meets the Compute Resource AMI specification. This gives you the ability to customize any workload by increasing the size of root or data volumes, adding instance stores, and connecting with the FPGA (F) and GPU (G and P) instance families.
Next, install the AWS CLI:
pip install awscli
Add any additional software required to interact with the FPGAs on the F1 instances.
As a starting point, AWS publishes an FPGA Developer AMI in the AWS Marketplace. It is based on a CentOS Linux image and includes pre-integrated FPGA development tools. It also includes the runtime tools required to develop and use custom FPGAs for hardware acceleration applications.
For more information about how to set up custom AMIs for your AWS Batch managed compute environments, see Creating a Compute Resource AMI.
Building your Dockerfile
There are two common methods for connecting to AWS Batch to run FPGA-enabled algorithms. The first method, which is the route Edico Genome took, involves storing your binaries in the Docker container itself and running that on top of an F1 instance with Docker installed. The following code example is what a Dockerfile to build your container might look like for this scenario.
# DRAGEN_EXEC Docker image generator --
# Run this Dockerfile from a local directory that contains the latest release of
# - Dragen RPM and Linux DMA Driver available from Edico
# - Edico's Dragen WFMS Wrapper files
FROM centos:centos7
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# Install Basic packages needed for Dragen
RUN yum -y install \
perl \
sos \
coreutils \
gdb \
time \
systemd-libs \
bzip2-libs \
R \
ca-certificates \
ipmitool \
smartmontools \
rsync
# Install the Dragen RPM
RUN mkdir -m777 -p /var/log/dragen /var/run/dragen
ADD . /root
RUN rpm -Uvh /root/edico_driver*.rpm || true
RUN rpm -Uvh /root/dragen-aws*.rpm || true
# Auto generate the Dragen license
RUN /opt/edico/bin/dragen_lic -i auto
#########################################################
# Now install the Edico WFMS "Wrapper" functions
# Add development tools needed for some util
RUN yum groupinstall -y "Development Tools"
# Install necessary standard packages
RUN yum -y install \
dstat \
git \
python-devel \
python-pip \
time \
tree && \
pip install --upgrade pip && \
easy_install requests && \
pip install psutil && \
pip install python-dateutil && \
pip install constants && \
easy_install boto3
# Setup Python path used by the wrapper
RUN mkdir -p /opt/workflow/python/bin
RUN ln -s /usr/bin/python /opt/workflow/python/bin/python2.7
RUN ln -s /usr/bin/python /opt/workflow/python/bin/python
# Install d_haul and dragen_job_execute wrapper functions and associated packages
RUN mkdir -p /root/wfms/trunk/scheduler/scheduler
COPY scheduler/d_haul /root/wfms/trunk/scheduler/
COPY scheduler/dragen_job_execute /root/wfms/trunk/scheduler/
COPY scheduler/scheduler/aws_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/constants.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/job_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/logger.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/scheduler_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/webapi.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/wfms_exception.py /root/wfms/trunk/scheduler/scheduler/
RUN touch /root/wfms/trunk/scheduler/scheduler/__init__.py
# Landing directory should be where DJX is located
WORKDIR "/root/wfms/trunk/scheduler/"
# Debug print of container's directories
RUN tree /root/wfms/trunk/scheduler
# Default behaviour. Over-ride with --entrypoint on docker run cmd line
ENTRYPOINT ["/root/wfms/trunk/scheduler/dragen_job_execute"]
CMD []
Note: Edico Genome’s custom Python wrapper functions for its Workflow Management System (WFMS) in the latter part of this Dockerfile should be replaced with functions that are specific to your workflow.
The second method is to install binaries and then use Docker as a lightweight connector between AWS Batch and the AFI. For example, this might be a route you would choose to use if you were provisioning DRAGEN from the AWS Marketplace.
In this case, the Dockerfile would not contain the installation of the binaries to run DRAGEN, but would contain any other packages necessary for job completion. When you run your Docker container, you enable Docker to access the underlying file system.
Connecting to AWS Batch
AWS Batch provisions compute resources and runs your jobs, choosing the right instance types based on your job requirements and scaling down resources as work is completed. AWS Batch users submit a job, based on a template or “job definition” to an AWS Batch job queue.
Job queues are mapped to one or more compute environments that describe the quantity and types of resources that AWS Batch can provision. In this case, Edico created a managed compute environment that was able to launch 1,000 EC2 F1 instances across multiple Availability Zones in us-east-1. As jobs are submitted to a job queue, the service launches the required quantity and types of instances that are needed. As instances become available, AWS Batch then runs each job within appropriately sized Docker containers.
The Edico Genome workflow manager API submits jobs to an AWS Batch job queue. This job queue maps to an AWS Batch managed compute environment containing On-Demand F1 instances. In this section, you can set this up yourself.
To create the compute environment that DRAGEN can use:
aws batch create-compute-environment --cli-input-json file://<path_to_json_file>/F1OnDemand.json
Where your JSON file contains the following code (replace with your own resource IDs):
{
"computeEnvironmentName": "F1OnDemand",
"type": "MANAGED",
"state": "ENABLED",
"computeResources": {
"type": "EC2",
"minvCpus": 0,
"maxvCpus": 128,
"desiredvCpus": 0,
"instanceTypes": [
"f1.2xlarge",
"f1.16xlarge"
],
"subnets": [
"subnet-220c0e0a",
"subnet-1a95556d",
"subnet-978f6dce"
],
"securityGroupIds": [
"sg-cf5093b2"
],
"ec2KeyPair": "id_rsa",
"instanceRole": "ecsInstanceRole",
"tags": {
"Name": "Batch Instance – F1OnDemand"
}
},
"serviceRole": "arn:aws:iam::012345678910:role/service-role/AWSBatchServiceRole"
}
And the corresponding job queue:
aws batch create-job-queue --cli-input-json file://<path_to_json_file>/dragen.json
Where dragen.json is as follows:
{
"jobQueueName": "DRAGEN-OnDemand",
"state": "ENABLED",
"priority": 100,
"computeEnvironmentOrder":
[
{
"order": 1,
"computeEnvironment": "F1OnDemand"
}
]
}
An f1.2xlarge EC2 instance contains one FPGA, eight vCPUs, and 122-GiB RAM. As DRAGEN requires an entire FPGA to run, Edico Genome needed to ensure that only one analysis per time executed on an instance. By using the f1.2xlarge vCPUs and memory as a proxy in their AWS Batch job definition, Edico Genome could ensure that only one job runs on an instance at a time. Here’s what that looks like in the AWS CLI:
aws batch register-job-definition --job-definition-name dragen-wgs --type container --container-properties '{ "image": ${DRAGEN_IMAGE}, "vcpus": 8, "memory": 120000}'
Now, you can submit jobs easily to your DRAGEN environment:
aws batch submit-job --job-name dragen-run1 --job-queue DRAGEN-OnDemand --job-definition dragen-wgs --container-overrides command=${RUN_PARAMETERS}
You can query the status of your DRAGEN job with the following command:
aws batch describe-jobs --jobs <the job ID from the above command>
The logs for your job are written to the /aws/batch/job CloudWatch log group.
Conclusion
In this post, we demonstrated how to set up an environment with AWS Batch that can run DRAGEN on EC2 F1 instances at scale. If you followed the walkthrough, you’ve replicated much of the architecture Edico Genome used to set the Guinness World Record.
There are several ways in which you can harness the computational power of DRAGEN to analyze genomes at scale. First, DRAGEN is available through several different genomics platforms, such as the DNAnexus Platform. DRAGEN is also available on the AWS Marketplace. You can apply the architecture presented in this post to build a scalable solution that is both performant and cost-optimized.
For more information about how AWS Batch can facilitate genomics processing at scale, be sure to check out our aws-batch-genomics GitHub repo on high-throughput genomics on AWS.
相關推薦
Accelerating Precision Medicine at Scale
This post courtesy of Aaron Friedman, Healthcare and Life Sciences Partner Solutions Architect, AWS and Angel Pizarro, Genomics and Life Sciences
《The challenge of realistic music generation: modelling raw audio at scale》論文閱讀筆記
mes esc color del strac argmax bst repr 幫助 The challenge of realistic music generation: modelling raw audio at scale 作者:Deep mind三位大神
Personalization at Scale With Machine Learning: The Xero Story
When Nigel Piper, Executive General Manager, first joined Xero, the company only had 100,000 subscribers. In over ten years that number has risen to over 1
The new frontier: Agile automation at scale
Large-scale automation of business processes requires a new development approach. Across sectors, business processes are undergoing the most profound trans
Serverless Streaming At Scale with Cosmos DB
Serverless Streaming At Scale with Cosmos DBAbout 100% serverless Kappa Architecture implementation, singletons, scaling, and multi-threadingI’m doing a lo
tech businesses are beginning to use artificial intelligence at scale | AITopics
LIE DETECTORS ARE not widely used in business, but Ping An, a Chinese insurance company, thinks it can spot dishonesty. The company lets customers apply fo
Managing Software Dependency at Scale
Introduction At LinkedIn, we have more than 10,000 separate software codebases, referred to as multiproducts, which represent individual software product
Tracking User Behavior At Scale with Streaming Reactive Big Data Systems
Tracking User Behavior At Scale with Streaming Reactive Big Data SystemsBehavioral Analytics through Big Data Applications can be used to gain insights, an
First Business Models at Scale | Machine Learning Blog
Earlier this week, MIT, in collaboration with Boston Consulting Group, released their second global study looking at AI adoption in industry. A top fin
High Quality Video Encoding at Scale
High Quality Video Encoding at ScaleAt Netflix we receive high quality sources for our movies and TV shows and encode them to the best video streams possib
Learning at Scale & The End of “If -Then” Logic.
Learning at Scale & The End of “If -Then” Logic.In 2001, a group of Physicists were awarded the Nobel prize in Physics for creating an experiment that
Server Fleet Managemement at Scale
Amazon Web Services (AWS) customers who own a fleet of servers are sometimes unsure of how to best automate their fleet management for opera
Icahn School of Medicine at Mount Sinai Case Study
By using AWS and GenePool, Drs. Martignetti and Dottino can now rapidly mine thousands of patient records from The Cancer Genome Atlas projects
Will Spark Power the Data behind Precision Medicine?
Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services. This post was co-authored by Ujjwal Ratan, a S
Feature based GraphQL Modules at scale
This way, each module can declare only the part of the schema that it contributes, and the complete schema is a representation of all merged type definitio
Amazon Machine Learning – Make Data-Driven Decisions at Scale
Today, it is relatively straightforward and inexpensive to observe and collect vast amounts of operational data about a system, product, or proces
Launch – Hello Amazon Macie: Automatically Discover, Classify, and Secure Content at Scale
When Jeff and I heard about this service, we both were curious on the meaning of the name Macie. Of course, Jeff being a great researcher looked u
Quantum computing at scale: Scientists achieve compact, sensitive qubit readout
The research, conducted within the Simmons group at the Centre of Excellence for Quantum Computation and Communication Technology (CQC2T) with PhD student
Scrum At Scale® 指南-切實可行的規模化擴充套件敏捷
Scrum At Scale® 指南版權所有© 2006-2018 Jeff Sutherland 及 Scrum Inc.[email protected]是Scrum Inc.的註冊商標。本指南基於署名-相同方式共享許可協議4.0釋出。(CC BY-SA 4.0)簡體中文版原創翻譯團隊:申健 J
RDMA over Commodity Ethernet at Scale (I)
Abstract 在過去一年半的時間,我們已經使用RoCEv2來支援一些微軟高可靠性、延遲敏感的服務。本篇論文講述了在此過程中遇到的挑戰以及解決方案。為了把RoCEv2擴充套件到VLAN之外,我們設計了一個基於DSCP的優先順序流量控制機制(PFC)來確保大規模部署。我們已