Interactive Analysis of Genomic Datasets Using Amazon Athena
Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services
The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are processed in a multi-step workflow to identify where your genome differs from a standard reference. Your variations are stored in a Variant Call Format (VCF) file, which is then combined with other individuals to enable population-scale analyses. Many of these datasets are publicly available, and an increasing number are hosted on AWS as part of our
To mine genomic data for new discoveries, researchers in both industry and academia build complex models to analyze populations at scale. When building models, they first explore the datasets-of-interest to understand what questions the data might answer. In this step, interactivity is key, as it allows them to move easily from one question to the next.
Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on
In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms. I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.
Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.
Select, aggregate, annotate query pattern in genomics
Genomics researchers may ask different questions of their dataset, such as:
- What variations in a genome may increase the risk of developing disease?
- What positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the genomic reference?
- What variations in a genome influence how an individual may respond to a specific drug treatment?
- Does a group of individuals contain a higher frequency of a genomic variant known to alter response to a drug relative to the general population?
All these questions, and more, can be generalized under a common query pattern I like to call “Select, Aggregate, Annotate”. Some of our genomics customers, such as Human Longevity, Inc., routinely use this query pattern in their work.
In each of the above queries, you execute the following steps:
SELECT: Specify the cohort of individuals meeting certain criteria (disease, drug response, age, BMI, entire population, etc.).
AGGREGATE: Generate summary statistics of genomic variants across the cohort that you selected.
ANNOTATE: Assign meaning to each of the variants by joining on known information about each variant.
Dataset preparation
Properly organizing your dataset is one of the most critical decisions for enabling fast, interactive analyses. Based on the query pattern I just described, the table representing your population needs to have the following information:
- A unique sample ID corresponding to each sample in your population
- Information about each variant, specifically its location in the genome as well as the specific deviation from the reference
- Information about how many times in a sample a variant occurs (0, 1, or 2 times) as well as if there are multiple variants in the same site. This is known as a genotype.
The extract, transform, load (ETL) process to generate the appropriate data representation has two main steps. First, you use ADAM, a genomics analysis platform built on top of Spark, to convert the variant information residing a VCF file to Parquet for easier downstream analytics, in a process similar to the one described in the Will Spark Power the Data behind Precision Medicine? post. Then, you use custom Python code to massage the data and select only the appropriate fields that you need for analysis with Athena.
First, spin up an EMR cluster (version 5.0.3) for the ETL process. I used a c4.xlarge for my master node and m4.4xlarges with 1 TB of scratch for my core nodes.
After you SSH into your master node, clone the git repository. You can also put this in as a bootstrap action when spinning up your cluster.
sudo yum –y install git
git clone https://github.com/awslabs/aws-big-data-blog.git
You then need to install ADAM and configure it to run with Spark 2.0. In your terminal, enter the following:
# Install Maven
wget http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvzf apache-maven-3.3.9-bin.tar.gz
export PATH=/home/hadoop/apache-maven-3.3.9/bin:$PATH
# Install ADAM
git clone https://github.com/bigdatagenomics/adam.git
cd adam
sed -i 's/2.6.0/2.7.2/' pom.xml
./scripts/move_to_spark_2.sh
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests
export PATH=/home/hadoop/adam/bin:$PATH
Now that ADAM is installed, enter the working directory of the repo that you first cloned, which contains the code relevant to this post. Your next step is to convert a VCF containing all the information about your population.
For this post, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,504 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. You can scale the size of your cluster depending on whether you are looking at just chromosome 22, or the entire genome. In the following lines, replace mytestbucket with your bucket name of choice.
PARQUETS3PATH=s3://<mytestbucket>/thousand_genomes/grch37/chr22.parquet/
cd ~/aws-big-data-blog/aws-blog-athena-genomics/etl/thousand_genomes/
chmod +x ./convert_vcf.sh
./convert_vcf.sh $PARQUETS3PATH
After this conversion completes, you can reduce your Parquet file to only the fields you need.
TRIMMEDS3PATH=s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/
spark-submit --master yarn --executor-memory 40G --deploy-mode cluster create_trimmed_parquet.py --input_s3_path $PARQUETS3PATH --output_s3_path $TRIMMEDS3PATH
Annotation data preparation
For this post, use the ClinVar dataset, which is an archive of information about variation in genomes and health. While Parquet is a preferred format for Athena relative to raw text, using the ClinVar TXT file demonstrates the ability of Athena to read multiple file types.
In a Unix terminal, execute the following commands (replace mytestbucket with your bucket name of choice):
ANNOPATH=s3://<mytestbucket>/clinvar/
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
# Need to strip out the first line (header)
zcat variant_summary.txt.gz | sed '1d' > temp ; mv -f temp variant_summary.trim.txt ; gzip variant_summary.trim.txt
aws s3 cp variant_summary.trim.txt.gz $ANNOPATH --sse
Creating Tables using Amazon Athena
In the recent Amazon Athena – Interactive SQL Queries for Data in Amazon S3 post, Jeff Barr covered how to get started with Athena and navigate to the console. Athena uses the Hive DDL when you create, alter, or drop a table. This means you can use the standard Hive syntax for creating tables. These commands can also be found in the aws-blog-athena-genomics GitHub repo under sql/setup.
First, create your database:
CREATE DATABASE demo;
Next, define your table with the appropriate mappings from the previous ETL processes. Be sure to change to your appropriate bucket name.
For your population data:
If you want to skip the ETL process described above for your population data, you can use the following path for your analysis: s3://aws-bigdata-blog/artifacts/athena_genomics.parquet/ – be sure to substitute it in for the LOCATION below.
CREATE EXTERNAL TABLE demo.samplevariants
(
alternateallele STRING
,chromosome STRING
,endposition BIGINT
,genotype0 STRING
,genotype1 STRING
,referenceallele STRING
,sampleid STRING
,startposition BIGINT
)
STORED AS PARQUET
LOCATION ' s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/';
For your annotation data:
CREATE EXTERNAL TABLE demo.clinvar
(
alleleId STRING
,variantType STRING
,hgvsName STRING
,geneID STRING
,geneSymbol STRING
,hgncId STRING
,clinicalSignificance STRING
,clinSigSimple STRING
,lastEvaluated STRING
,rsId STRING
,dbVarId STRING
,rcvAccession STRING
,phenotypeIds STRING
,phenotypeList STRING
,origin STRING
,originSimple STRING
,assembly STRING
,chromosomeAccession STRING
,chromosome STRING
,startPosition INT
,endPosition INT
,referenceAllele STRING
,alternateAllele STRING
,cytogenetic STRING
,reviewStatus STRING
,numberSubmitters STRING
,guidelines STRING
,testedInGtr STRING
,otherIds STRING
,submitterCategories STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://<mytestbucket>/clinvar/';
You can confirm that both tables have been created by choosing the eye icon next to a table name in the console. This automatically runs a query akin to SELECT * FROM table LIMIT 10. If it succeeds, then your data has been loaded successfully.
Applying the select, aggregate, annotate paradigm
In this section, I walk you through how to use the query pattern described earlier to answer two common questions in genomics. You examine the frequency of different genotypes, which is a distinct combination of all fields in the samplevariants table with the exception of sampleid. These queries can also be found in the aws-blog-athena-genomics GitHub repo under sql/queries.
Population drug response
What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?
In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.
In this query, use sampleid as your predicate pushdown. The general steps are:
- Filter by the samples in your subpopulation
- Aggregate variant frequencies for the subpopulation-of-interest
- Join on ClinVar dataset
- Filter by variants that have been implicated in drug-response
- Order by highest frequency variants
To answer this question, you can craft the following query:
SELECT
count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
,cv.rsid
,cv.phenotypelist
,sv.chromosome
,sv.startposition
,sv.endposition
,sv.referenceallele
,sv.alternateallele
,sv.genotype0
,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
(SELECT count(1) AS numsamples
FROM
(SELECT DISTINCT sampleid
FROM demo.samplevariants
WHERE sampleid LIKE 'NA12%'))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
AND sv.startposition = cv.startposition - 1
AND sv.endposition = cv.endposition
AND sv.referenceallele = cv.referenceallele
AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
AND cv.clinicalsignificance LIKE '%response%'
AND sampleid LIKE 'NA12%'
GROUP BY sv.chromosome
,sv.startposition
,sv.endposition
,sv.referenceallele
,sv.alternateallele
,sv.genotype0
,sv.genotype1
,cv.clinicalsignificance
,cv.phenotypelist
,cv.rsid
,numsamples
ORDER BY genotypefrequency DESC LIMIT 50
Enter the query into the console and you see something like the following:
When you inspect the results, you can quickly see (often in a matter of seconds!) that this population of individuals has a high frequency of variants associated with the metabolism of debrisoquine.
Quality control
Are you systematically finding locations in your reference that are called as variation? In other words, what positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the reference? Are any of these variants implicated in clinically? If so, could this be a false positive clinical finding?
In this query, the entire population is needed so the SELECT predicate is not used. The general steps are:
- Aggregate variant frequencies for the entire Thousand Genomes population
- Join on the ClinVar dataset
- Filter by variants that have been implicated in disease
- Order by the highest frequency variants
This translates into the following query:
SELECT
count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
,cv.clinicalsignificance
,cv.phenotypelist
,sv.chromosome
,sv.startposition
,sv.endposition
,sv.referenceallele
,sv.alternateallele
,sv.genotype0
,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
(SELECT count(1) AS numsamples
FROM
(SELECT DISTINCT sampleid
FROM demo.samplevariants))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
AND sv.startposition = cv.startposition - 1
AND sv.endposition = cv.endposition
AND sv.referenceallele = cv.referenceallele
AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
AND cv.clinsigsimple='1'
GROUP BY sv.chromosome ,
sv.startposition
,sv.endposition
,sv.referenceallele
,sv.alternateallele
,sv.genotype0
,sv.genotype1
,cv.clinicalsignificance
,cv.phenotypelist
,numsamples
ORDER BY genotypefrequency DESC LIMIT 50
As you can see in the following screenshot, the highest frequency results have conflicting information, being listed both as potentially causing disease and being benign. In genomics, variants with a higher frequency are less likely to cause disease. Your quick analysis allows you to discount the pathogenic clinical significance annotations of these high genotype frequency variants.
Summary
I hope you have seen how you can easily integrate datasets of different types and values, and combine them to quickly derive new meaning from your data. Do you have a dataset you want to explore or want to learn more about Amazon Athena? Check out our Getting Started Guide and happy querying!
If you have questions or suggestions, please comment below.
About the author
Dr. Aaron Friedman a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with our ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s out hiking or learning a new thing to cook.
Related
相關推薦
Interactive Analysis of Genomic Datasets Using Amazon Athena
Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services The genomics industry is in the midst of a d
Movable Ink Gets Insights 50% Faster Using Amazon Athena
Movable Ink uses AWS to query seven years’ worth of historical data and get results in moments, with the flexibility to explore data for de
Analyzing Data in S3 using Amazon Athena
Neil Mukerje is a Solution Architect for Amazon Web Services Abhishek Sinha is a Senior Product Manager on Amazon Athena Amazon Athena is
Analyze and visualize your VPC network traffic using Amazon Kinesis and Amazon Athena
Network log analysis is a common practice in many organizations. By capturing and analyzing network logs, you can learn how devices on your netwo
Connect to Amazon Athena with federated identities using temporary credentials
Many organizations have standardized on centralized user management, most commonly Microsoft Active Directory or LDAP. Access to AWS resources is
How SimilarWeb analyze hundreds of terabytes of data every month with Amazon Athena and Upsolver
This is a guest post by Yossi Wasserman, a data collection & innovation team leader at Similar Web. SimilarWeb, in their own words: Si
Use the Results of an Amazon Athena Query in Another Query
Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So
Resolve RegexSerDe Error "Number of matching groups doesn't match the number of columns" in Amazon Athena
CREATE external table logs(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string ) ROW FORMAT SERDE 'org.apa
Amazon Athena — Serverless Interactive Query Service
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there
Amazon Athena – Interactive SQL Queries for Data in Amazon S3
The amount of data that we all have to deal with grows every day (I still keep a floppy disk or two around in order to remind myself that 1.44 MB
Unsupervised Learning and Text Mining of Emotion Terms Using R
true nio sha heatmap this trac tel examples sed Unsupervised learning refers to data science approaches that involve learning without a p
[Python] Statistical analysis of time series
win symbols values with nts pre pyplot lose val Global Statistics: Common seen methods as such 1. Mean 2. Median 3. Standard deviatio
根據 “Interactive Simulation of Elastic Deformable Materials“模擬的FEM材質的物體
swf inf ble die 穩定 nor 減少 pro pass “Interactive Simulation of Elastic Deformable Materials“ 是篇不錯的論文 可惜裏面公式推導有誤,修改後能得到很不錯的優點。
Scalability of Kafka Messaging using Consumer Groups
oge caused read over com together each HR same May 10, 2018By Suhita GoswamiNo Comments Categories: Data Ingestion Flume Kafka Use Case
An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀
記錄 測試的 one zhang 不可 策略 correct 抽象 alt 前言 本來想按照慣例來一個overview的,結果看到一篇十分不錯而且詳細的介紹,因此copy過來,自己在前面大體總結一下論文,細節不做贅述,引用文章講得很詳細。 論文概述 引用文章 以下內容來自:
Analysis of Ding Talk application
As we all know, if a enterprise want to success it need use some useful software to communicate in the inner of their enterprise.Therefore ,I wr
A Sensitivity Analysis of Convolutional Neural Networks for Sentence Classification
引言 Ye Zhang在2016年掛在arXiv上的論文,從名字大概可以看出來,這是一篇CNN調參指南。 概述 模型方面用的是單層CNN,主要是CNN用做文字分類方面的研究,模型結構如下所示: 上述模型來自Convolutional Neural Networks for
Automated generation of test oracles using a model-driven approach
一、基本資訊 標題:Automated generation of test oracles using a model-driven approach 時間:2013 出版源:Information and Software Technology 領域分類:軟體測試;自動化測試oracle;模型驅動的測試
one-way ANOVA(analysis of variance) 單向方差分析
問題描述:在比較兩組獨立正態分佈樣本的均值時用t test,那麼比較多組樣本的均值呢?要用one-way ANOVA。 Note: 使用ANOVA時,要假定k個組的方差相同。如果k個組的方差並不相同,就不應該使用ANOVA,要分別對兩組間用t檢驗。 當我們看到這種要同時比較多組樣本的
Design and Analysis of Algorithms (B-Trees)
B-Trees B-Trees are tree data structures that store sorted data. B-Trees can be seen as a generalization of Binary Search Trees where nod