1. 程式人生 > >Strategies for Reducing Your Amazon EMR Costs

Strategies for Reducing Your Amazon EMR Costs

This is a guest post by Prateek Gupta, a lead engineer at BloomReach

BloomReach has built a personalized discovery platform with applications for organic search, site search, content marketing and merchandizing. BloomReach ingests data from a variety of sources such as merchant inventory feed, sitefetch data from merchants’ websites and pixel data. The data is collected, parsed, stored and used to match user intent to content on merchants’ websites and to provide merchants with insights into consumer behavior and the performance of products on their sites.

A sample data ingestion flow for merchant data is shown in the figure below. BloomReach ingests merchant data including crawled merchant pages, merchant feed, and pixel data. There are ETL (extract-transform-load) flows that clean, filter and normalize the data and put it into the product database. Individual applications may use this data to produce derived relations. The product database also supports many applications including the “What’s Hot” application that displays relevant trending products to the user on merchant website.

Below is a sample workflow for personalization:

At BloomReach, we launch 1,500 to 2,000 Amazon EMR clusters and run 6,000 Hadoop jobs every day. As a growing company, we’ve seen our use of Amazon EMR rise dramatically in a short time:

It is critical that we keep our Amazon EMR costs down as we scale up. To that end, we’ve adopted the following strategies

  1. Use AWS Spot Instances rather than On-Demand Instances whenever possible.  Amazon Elastic Cloud Compute (Amazon EC2) Spot Instances are unused Amazon EC2 capacity that you bid on; the price you pay is determined by the supply and demand for Spot Instances. The cost of using Spot Instances can be 80% less than using On-Demand Instances. It’s important to manage Spot Instances because they can be terminated if the Spot market price exceeds your bid price. At BloomReach, we have written an orchestration system that schedules jobs on Amazon EMR. The system implements a Hartmann pipeline that can run a variety of jobs both locally and on Amazon EMR. It can also detect failures such as Spot Instance termination and reschedule jobs on different clusters as needed.
  1. Create a system that shares clusters among several small jobs rather than launching a separate cluster for every job. Remember, whether your job takes 10 minutes or 60 minutes, you’re paying for an hour of access. If you have four 10-minute jobs, you could share one cluster to do them all and be charged for one hour. Or you could employ one cluster for each and be charged for four hours. Sharing clusters among jobs also allows you to save the time and cost of bootstrapping a new cluster. The time savings alone can be a significant factor for real-time jobs.
  1. Use Amazon EMR tags for cost tracking. Using EMR tags lets you track the cost of your cloud usage by project or by department, which gives you deeper insight into return on investment and provides transparency for budgeting purposes.
  1. Create a lifecycle management system that allows you to track clusters and eliminate idle clusters.
  1. Use the right instance types for your jobs. For example, use c3 instance type for compute-heavy jobs. This can significantly reduce waste and costs based on the scale of your jobs. Below is an algorithm we have found useful for selecting the instance type with the best value for compute capacity based on its Spot price:
maxCpuPerUnitPrice = 0
optimalInstanceType = null
For each instance_type in (Availability Zone, Region) {
  cpuPerUnitPrice = instance.cpuCores/instance.spotPrice
  if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
     optimalInstanceType = instance_type;
  }
}

Incorporating these Amazon EMR strategies can help you increase efficiency, contain costs, and make a good thing even better.

If you are interested in working on these challenges at BloomReach, contact us at www. bloomreach.com/careers

If you have questions or suggestions, please add a comment below.

Prateek Gupta is not an Amazon employee and does not represent Amazon.

—————————————————–
Related:

—————————————————————

Love to work on open source? Check out EMR’s careers page.

—————————————————————-

相關推薦

Strategies for Reducing Your Amazon EMR Costs

This is a guest post by Prateek Gupta, a lead engineer at BloomReach BloomReach has built a personalized discovery platform with applicati

Top skills for Alexa: 6 tips for your Amazon Echo that will simplify your life

Here are 21 commands that even seasoned Echo users may not know. Many of them are useful, some are fun, and others give the illusion that Alexa is as cogni

Launch an edge node for Amazon EMR to run RStudio

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running

View Storage Use for Your Amazon Aurora DB Cluster

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

[Tools] Using mobile device for debugging your mobile web site

per ins conn build mode github and gpo actions 1. First you have enable "Developer mode" on your mobile device. (Different device might b

One place for all your web apps | Hacker News

Hi HN,We're Alex and Julien, the founders of Station (https://getstation.com/). Our free desktop app unifies all your work applications in one neat interfa

Best deals for Oct. 3: Amazon Echo Plus, Roomba robot vacuums, Beats and Bose headphones, Crock

We're rounding up the best deals on the products that will help you upgrade your life. Check out what Amazon, Walmart, Target, and Udemy have discounted ri

Show HN: TravelStreetview an app for locating your photos on Streetview

https://travelstreetview.comThis is a side project I've been working on for the last 2 years. It lets you select images on your phone and shows you the loc

Ask HN: Good books/tools/strategies for leading a remote team of coders?

I lead a team of 8+ coders & testers and I am always looking to improve my skills as a leader/project manager of the team. What books have you found t

Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

This blog post provides guidance and best practices about how to migrate from Apache HBase on HDFS to Apache HBase on Amazon S3 on Amazon EMR.

AWS Amplify Adds Support for Securely Embedding Amazon Sumerian AR/VR Scenes in Web Applications

AWS Amplify offers this functionality via a new XR category in the JavaScript library that makes it easy for developers to embed Amazon Sumeria

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

Artificial intelligence controls quantum computers: Neural networks enable learning of error correction strategies for computers

In 2016, the computer program AlphaGo won four out of five games of Go against the world's best human player. Given that a game of Go has more combination

4 Strategies for Multi

Tweet Share Share Google Plus Time series forecasting is typically discussed where only a one-st

Large-Scale Machine Learning with Spark on Amazon EMR

This is a guest post by Jeff Smith, Data Engineer at Intent Media. Intent Media, in their own words: “Intent Media operates a platform for adverti

Resolve "OutOfMemoryError" Hive Java Heap Space Exceptions on Amazon EMR that Occur when Hive Outputs the Query Results

export HIVE_CLIENT_HEAPSIZE=1024 export HIVE_METASTORE_HEAPSIZE=2048 export HIVE_SERVER2_HEAPSIZE=3072 if [ "$SERVICE" = "metastore" ] then exp

Resolve Amazon EMR Hive Query Failure because of an Intermittent Hive

2018-05-09T11:53:28,837 ERROR [HiveServer2-Background-Pool: Thread-64([])]: ql.Driver (SessionState.java:printError(1097)) - FAILED: Execution E

Assign a Static Private IP Address to the Master Node of an Amazon EMR Cluster

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Troubleshoot Cluster Launch Issues after Amazon EMR Release Version Upgrade

<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://<HOSTNAME OF YOUR EXTERNAL METASTO

Set Up a Spark SQL JDBC Connection on Amazon EMR

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So