Running R on Amazon Athena
Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.
In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.
Prerequisites
Before you get started, complete the following steps.
-
- Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s
- Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s
- Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athenauser-athena-r in the instructions that follow.
NOTE: In this blog post, I create all AWS resources in the US-East region. Use the Region Table to check the availability of Athena in other regions.
Set up R and RStudio on EC2
- Follow the instructions in the blog post “Running R on AWS” to set up R on an EC2 instance (t2.medium or greater) running Amazon Linux . Read the step below before you begin.
- In that blog post under “Advanced Details,” when you reach step 3 use the following bash script to install the latest version of RStudio. Modify the password for RStudio as needed.
#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.153-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-1.0.153-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd
Install Java 8
- SSH into this EC2 instance.
- Remove older versions of Java.
- Install Java 8. This is required to work with Athena.
- Run the following commands on the command line.
#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives --config java
#run command below to add Java support to R
sudo R CMD javareconf
#following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel
Set up .Renviron
You need to configure the R environment variable .Renviron with the required Athena credentials.
- Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
- Type the following command from the Linux command prompt and bring up the vi editor.
sudo vim /home/rstudio/.Renviron
Provide your Athena credentials in the following form into the editor:
ATHENA_USER=< AWS_ACCESS_KEY_ID >
ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>
- Save this file and exit from the editor.
Log in to RStudio
Next, you’ll log in to RStudio on your EC2 instance.
- Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.
- Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.
- Log in to RStudio with the username and password you provided previously.
Install R packages
Next, you’ll install and load the required R packages.
#--following R packages are required for connecting R with Athena
install.packages("rJava")
install.packages("RJDBC")
library(rJava)
library(RJDBC)
#--following R packages are required for the interactive application we build later
#--steps below might take several minutes to complete
install.packages(c("plyr","dplyr","png","RgoogleMaps","ggmap"))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)
Connect to Athena
The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.
#verify Athena credentials by inspecting results from command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
fil
list.files()
#set up driver connection to JDBC
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
#connect to Athena using the driver, S3 working directory and credentials for Athena
#replace ‘athenauser’ below with prefix you have set up for your S3 bucket
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://athenauser-athena-r",
user=Sys.getenv("ATHENA_USER"),
password=Sys.getenv("ATHENA_PASSWORD"))
#in case of error or warning from step above ensure rJava and RJDBC packages have #been loaded
#also ensure you have Java 8 running and configured for R as outlined earlier
Now you’re ready to start querying Athena from RStudio.
Sample Queries to test
# get a list of all tables currently in Athena
dbListTables(con)
# run a sample query
dfelb=dbGetQuery(con, "SELECT * FROM sampledb.elb_logs limit 10")
head(dfelb,2)
Interactive Use Case
Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.
Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”
#---sql create table statement in Athena
dbSendQuery(con,
"
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.gdeltmaster (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
;
"
)
dbListTables(con)
You should see this newly created table named ‘gdeltmaster’ appear in your RStudio console after executing the statement above.
Query this Athena table to get a count of all CAMEO events that took place in the US in 2015.
#--get count of all CAMEO events that took place in US in year 2015
#--save results in R dataframe
dfg<-dbGetQuery(con,"SELECT eventcode,count(*) as count
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
group by eventcode
order by eventcode desc"
)
str(dfg)
head(dfg,2)
#--get list of top 5 most frequently occurring events in US in 2015
dfs=head(arrange(dfg,desc(count)),5)
dfs
From the R output shown above, you can see that CAMEO event “042” has the highest count. From the CAMEO manual, you can determine that this event has the description “Travel to another location for a meeting or other event.”
Next, you’ll use the knowledge gained from this analysis to get a list of all geo-coordinates associated with this specific event from the Athena table.
#--get a list of latitude and longitude associated with event “042”
#--save results in R dataframe
dfgeo<-dbGetQuery(con,"SELECT actiongeo_lat,actiongeo_long
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
and eventcode = '042'
"
)
#--duration of above query will depend on factors like size of chosen EC2 instance
#--now rename columns in dataframe for brevity
names(dfgeo)[names(dfgeo)=="actiongeo_lat"]="lat"
names(dfgeo)[names(dfgeo)=="actiongeo_long"]="long"
names(dfgeo)
#let us inspect this R dataframe
str(dfgeo)
head(dfgeo,5)
Next, generate a map for the United States.
#--generate map for the US using the ggmap package
map=qmap('USA',zoom=3)
map
Now you’ll plot the geodata obtained from your Athena table onto this map. This will help you visualize all US locations where these events had occurred in 2015.
#--plot our geo-coordinates on the US map
map + geom_point(data = dfgeo, aes(x = dfgeo$long, y = dfgeo$lat), color="blue", size=0.5, alpha=0.5)
By visually inspecting the results, you can determine that this specific event was heavily concentrated in the Northeastern part of the US.
Conclusion
You’ve learned how to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides.
If you have questions or suggestions, please leave your feedback in the comments.
About the Author
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.
Related
相關推薦
Running R on Amazon Athena
Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is
Resolve "Access Denied" Errors When Running Amazon Athena Queries
Athena reads data from Amazon Simple Storage Service (Amazon S3) buckets using the AWS Identity and Access Management (IAM) credentials of the
Running Windows Containers on Amazon ECS
This post was developed and written by Jeremy Cowan, Thomas Fuller, Samuel Karp, and Akram Chetibi. — Containers have revolutioni
R で Amazon Athena を活用する
データサイエンティストはしばしば、R から SQL クエリを投げるときに、その裡側のビッグデータ基盤のインフラ管理を気に掛けなければなりません。Amazon Athena はインフラ管理の必要がなく、標準 SQL で簡単に S3 上のデータを直接分析できる、インタラクティブクエリサービスです
Best Practices for Running Apache Cassandra on Amazon EC2
Apache Cassandra is a commonly used, high performance NoSQL database. AWS customers that currently maintain Cassandra on-premises may want to take
Have You Tried Delphi on Amazon Linux? (就是AWS用的Linux)
enables custom customers servers nbsp ble exists compile targe The new Delphi Linux compiler enables customers to take new or existing Wi
install r on ubuntu
Language ase example plot ubuntu http dpkg .org library install r-base on ubuntu: sudo apt-get install -y r-base download rstudio: https:
Apache Spark 2.2.0 中文文檔 - SparkR (R on Spark) | ApacheCN
機器學習 matrix ren mes 網頁 eve growth ear 統計 SparkR (R on Spark) 概述 SparkDataFrame 啟動: SparkSession 從 RStudio 來啟動 創建 SparkDataFrames 從本地
Running Kubernetes on windows
docke .com target ase top mmu tps release targe docker-for-desktop minikube GKE cluster(?) docker-for-desktop https://doc
How to force https on amazon elastic beanstalk
conf 使用 quest The 沒有 last director 改變 etc 假設您已在負載平衡器安全組中啟用https,將SSL證書添加到負載平衡器,將443添加到負載平衡器轉發的端口,並使用Route 53將您的域名指向Elastic Beanstalk環境(或等
Analyze and visualize your VPC network traffic using Amazon Kinesis and Amazon Athena
Network log analysis is a common practice in many organizations. By capturing and analyzing network logs, you can learn how devices on your netwo
How to use common workflows on Amazon SageMaker notebook instances
Amazon SageMaker notebook instances provide a scalable cloud based development environment to do data science and machine learning. This blog post
Accelerate model training using faster Pipe mode on Amazon SageMaker
Amazon SageMaker now comes with a faster Pipe mode implementation, significantly accelerating the speeds at which data can be streamed from Amazon
Connect to Amazon Athena with federated identities using temporary credentials
Many organizations have standardized on centralized user management, most commonly Microsoft Active Directory or LDAP. Access to AWS resources is
Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices
This blog post provides guidance and best practices about how to migrate from Apache HBase on HDFS to Apache HBase on Amazon S3 on Amazon EMR.
How SimilarWeb analyze hundreds of terabytes of data every month with Amazon Athena and Upsolver
This is a guest post by Yossi Wasserman, a data collection & innovation team leader at Similar Web. SimilarWeb, in their own words: Si
time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog
Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are
Selling plants on Amazon: A forest of untapped opportunity
Lauri Baker, Cheryl Boyer, Hikaru Peterson, and Audrey King researched this facet of the $14 billion US horticulture industry to establish a starting poin
Start Cyber Monday early and save $60 on Beats Solo3 Headphones on Amazon
We continue to find deals to be thankful for this weekend, and right now our eyes are on the Beats Solo3 Wireless headphones which are now $60 off on Amazo
The Guardian’s Migration from MongoDB to PostgreSQL on Amazon RDS
轉載一片mongodb 遷移pg 資料庫的文章 原文: https://www.infoq.com/news/2019/01/guardian-mongodb-postgresql The Guardian migrated their CMS's datastor