1. 程式人生 > >Running R on Amazon Athena

Running R on Amazon Athena

Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.

In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.

Prerequisites

Before you get started, complete the following steps.

    1. Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s
      Identity and Access Management
      (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.

RAthena_1

  1. Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athenauser-athena-r in the instructions that follow.

NOTE: In this blog post, I create all AWS resources in the US-East region. Use the Region Table to check the availability of Athena in other regions.

Set up R and RStudio on EC2

  1. Follow the instructions in the blog post “Running R on AWS” to set up R on an EC2 instance (t2.medium or greater) running Amazon Linux . Read the step below before you begin.
  1. In that blog post under “Advanced Details,” when you reach step 3 use the following bash script to install the latest version of RStudio. Modify the password for RStudio as needed.
#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.153-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-1.0.153-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd

Install Java 8

  1. SSH into this EC2 instance.
  2. Remove older versions of Java.
  3. Install Java 8. This is required to work with Athena.
  4. Run the following commands on the command line.
#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives --config java
#run command below to add Java support to R
sudo R CMD javareconf

#following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel

Set up .Renviron

You need to configure the R environment variable .Renviron with the required Athena credentials.

  1. Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  1. Type the following command from the Linux command prompt and bring up the vi editor.
sudo vim /home/rstudio/.Renviron

Provide your Athena credentials in the following form into the editor:
ATHENA_USER=< AWS_ACCESS_KEY_ID >
ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>
  1. Save this file and exit from the editor.

Log in to RStudio

Next, you’ll log in to RStudio on your EC2 instance.

  1. Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.
  1. Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.
  1. Log in to RStudio with the username and password you provided previously.

Install R packages

Next, you’ll install and load the required R packages.

#--following R packages are required for connecting R with Athena
install.packages("rJava")
install.packages("RJDBC")
library(rJava)
library(RJDBC)

#--following R packages are required for the interactive application we build later
#--steps below might take several minutes to complete
install.packages(c("plyr","dplyr","png","RgoogleMaps","ggmap"))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)

Connect to Athena

The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.

#verify Athena credentials by inspecting results from command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
fil
list.files()
#set up driver connection to JDBC
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
#connect to Athena using the driver, S3 working directory and credentials for Athena 
#replace ‘athenauser’ below with prefix you have set up for your S3 bucket
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://athenauser-athena-r",
user=Sys.getenv("ATHENA_USER"),
password=Sys.getenv("ATHENA_PASSWORD"))
#in case of error or warning from step above ensure rJava and RJDBC packages have #been loaded 
#also ensure you have Java 8 running and configured for R as outlined earlier

Now you’re ready to start querying Athena from RStudio.

Sample Queries to test

# get a list of all tables currently in Athena 
dbListTables(con)
# run a sample query
dfelb=dbGetQuery(con, "SELECT * FROM sampledb.elb_logs limit 10")
head(dfelb,2)

RAthena_2

Interactive Use Case

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

#---sql  create table statement in Athena
dbSendQuery(con, 
"
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.gdeltmaster (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
;
"
)



dbListTables(con)

You should see this newly created table named ‘gdeltmaster’ appear in your RStudio console after executing the statement above.

RAthena_3

Query this Athena table to get a count of all CAMEO events that took place in the US in 2015.

#--get count of all CAMEO events that took place in US in year 2015 
#--save results in R dataframe
dfg<-dbGetQuery(con,"SELECT eventcode,count(*) as count
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
group by eventcode
order by eventcode desc"
)
str(dfg)
head(dfg,2)

RAthena_4

#--get list of top 5 most frequently occurring events in US in 2015
dfs=head(arrange(dfg,desc(count)),5)
dfs

RAthena_5

From the R output shown above, you can see that CAMEO event “042” has the highest count. From the CAMEO manual, you can determine that this event has the description “Travel to another location for a meeting or other event.”

Next, you’ll use the knowledge gained from this analysis to get a list of all geo-coordinates associated with this specific event from the Athena table.

#--get a list of latitude and longitude associated with event “042” 
#--save results in R dataframe
dfgeo<-dbGetQuery(con,"SELECT actiongeo_lat,actiongeo_long
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
and eventcode = '042'
"
)
#--duration of above query will depend on factors like size of chosen EC2 instance
#--now rename columns in dataframe for brevity
names(dfgeo)[names(dfgeo)=="actiongeo_lat"]="lat"
names(dfgeo)[names(dfgeo)=="actiongeo_long"]="long"
names(dfgeo)
#let us inspect this R dataframe
str(dfgeo)
head(dfgeo,5)

RAthena_6
Next, generate a map for the United States.

#--generate map for the US using the ggmap package
map=qmap('USA',zoom=3)
map

RAthena_7

Now you’ll plot the geodata obtained from your Athena table onto this map. This will help you visualize all US locations where these events had occurred in 2015.

#--plot our geo-coordinates on the US map
map + geom_point(data = dfgeo, aes(x = dfgeo$long, y = dfgeo$lat), color="blue", size=0.5, alpha=0.5)

RAthena_8

By visually inspecting the results, you can determine that this specific event was heavily concentrated in the Northeastern part of the US.

Conclusion

You’ve learned how to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides.

If you have questions or suggestions, please leave your feedback in the comments.

About the Author

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

Related

o_IoT_Minutes_1

相關推薦

Running R on Amazon Athena

Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is

Resolve "Access Denied" Errors When Running Amazon Athena Queries

Athena reads data from Amazon Simple Storage Service (Amazon S3) buckets using the AWS Identity and Access Management (IAM) credentials of the

Running Windows Containers on Amazon ECS

This post was developed and written by Jeremy Cowan, Thomas Fuller, Samuel Karp, and Akram Chetibi. — Containers have revolutioni

RAmazon Athena を活用する

データサイエンティストはしばしば、R から SQL クエリを投げるときに、その裡側のビッグデータ基盤のインフラ管理を気に掛けなければなりません。Amazon Athena はインフラ管理の必要がなく、標準 SQL で簡単に S3 上のデータを直接分析できる、インタラクティブクエリサービスです

Best Practices for Running Apache Cassandra on Amazon EC2

Apache Cassandra is a commonly used, high performance NoSQL database. AWS customers that currently maintain Cassandra on-premises may want to take

Have You Tried Delphi on Amazon Linux? (就是AWS用的Linux)

enables custom customers servers nbsp ble exists compile targe The new Delphi Linux compiler enables customers to take new or existing Wi

install r on ubuntu

Language ase example plot ubuntu http dpkg .org library install r-base on ubuntu: sudo apt-get install -y r-base download rstudio: https:

Apache Spark 2.2.0 中文文檔 - SparkR (R on Spark) | ApacheCN

機器學習 matrix ren mes 網頁 eve growth ear 統計 SparkR (R on Spark) 概述 SparkDataFrame 啟動: SparkSession 從 RStudio 來啟動 創建 SparkDataFrames 從本地

Running Kubernetes on windows

docke .com target ase top mmu tps release targe docker-for-desktop minikube GKE cluster(?) docker-for-desktop https://doc

How to force https on amazon elastic beanstalk

conf 使用 quest The 沒有 last director 改變 etc 假設您已在負載平衡器安全組中啟用https,將SSL證書添加到負載平衡器,將443添加到負載平衡器轉發的端口,並使用Route 53將您的域名指向Elastic Beanstalk環境(或等

Analyze and visualize your VPC network traffic using Amazon Kinesis and Amazon Athena

Network log analysis is a common practice in many organizations.  By capturing and analyzing network logs, you can learn how devices on your netwo

How to use common workflows on Amazon SageMaker notebook instances

Amazon SageMaker notebook instances provide a scalable cloud based development environment to do data science and machine learning. This blog post

Accelerate model training using faster Pipe mode on Amazon SageMaker

Amazon SageMaker now comes with a faster Pipe mode implementation, significantly accelerating the speeds at which data can be streamed from Amazon

Connect to Amazon Athena with federated identities using temporary credentials

Many organizations have standardized on centralized user management, most commonly Microsoft Active Directory or LDAP.  Access to AWS resources is

Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

This blog post provides guidance and best practices about how to migrate from Apache HBase on HDFS to Apache HBase on Amazon S3 on Amazon EMR.

How SimilarWeb analyze hundreds of terabytes of data every month with Amazon Athena and Upsolver

This is a guest post by Yossi Wasserman, a data collection & innovation team leader at Similar Web. SimilarWeb, in their own words: Si

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

Selling plants on Amazon: A forest of untapped opportunity

Lauri Baker, Cheryl Boyer, Hikaru Peterson, and Audrey King researched this facet of the $14 billion US horticulture industry to establish a starting poin

Start Cyber Monday early and save $60 on Beats Solo3 Headphones on Amazon

We continue to find deals to be thankful for this weekend, and right now our eyes are on the Beats Solo3 Wireless headphones which are now $60 off on Amazo

The Guardian’s Migration from MongoDB to PostgreSQL on Amazon RDS

轉載一片mongodb 遷移pg 資料庫的文章 原文: https://www.infoq.com/news/2019/01/guardian-mongodb-postgresql The Guardian migrated their CMS's datastor