AWS Big Data and Analytics Sessions at Re:Invent 2018
re:Invent 2018 is around the corner! This year, data and analytics tracks are bigger than ever.
This blog post highlights the data and analytics sessions at re:Invent 2018. If you’re attending this year, you want to check out the sessions, workshops, chalk talks, and builder sessions that we have at the conference. As in previous years, you can find these events in various topic categories, such as Analytics, Deep Learning, AI Summit, Serverless, Architecture, and Databases.
We have great sessions from Intuit, Nike, Intel, WuXi NextCODE, Warner Brothers, AutoDesk, NFL, SendGrid, McDonald, AirBnB, Hilton, Guardian Life, Amazon Go, Pfizer, and many more.
These sessions will be recorded and available on YouTube after the conference. Also, all slide decks from these sessions will be available on SlideShare.net after the conference.
Choose any of the links in this post to learn more about a breakout session.
Note: If you’re interested in machine learning, check out the AI Summit, and machine learning and AI workshops and sessions.The following breakout Analytics sessions compose this year’s session catalog.
There are two sessions led by Anurag Gupta, VP for AWS Analytics and DB Services and Swami Sivasubramanian, VP of
In this talk, Anurag Gupta, VP for AWS Analytic and Transactional Database Services, talks about some of the key trends we see in data lakes and analytics, and he describes how they shape the services we offer at AWS. Specific trends include the rise of machine generated data and semi-structured/unstructured data as dominant sources of new data, the move towards serverless, SPI-centric computing, and the growing need for local access to data from users around the world.
AIM202-L – Leadership Session: Machine Learning
Amazon has a long history in AI, from personalization and recommendation engines to robotics in fulfillment centers. Amazon Go, Amazon Alexa, and Amazon Prime Air are also examples. In this session, learn more about the latest machine learning services from AWS, and hear from customers who are partnering with AWS for innovative AI.
Deep dive customers use cases
Amazon Elasticsearch Service (Amazon ES) provides powerful, natural-language-based search features and a rich API to enable relevant search for applications like ecommerce, data lakes, and your application data. Nike upgraded the search engines for its web properties, including the Nike online store, standardizing on Amazon ES for these mission-critical workloads. With Amazon ES, Nike can focus on its core mission—enabling customers to find and purchase its products—without worrying about the hassle of deploying and scaling hardware, deploying Elasticsearch, configuring and securing its clusters, upgrading with security patches, or any of the low-value, operational tasks necessary to keep Elasticsearch maintained. Come to this session to learn the factors that Nike used in choosing Amazon ES. Get an overview of its architecture, and hear about the results of its migration.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. In this session, we live demo exciting new capabilities the team have been heads down building. SendGrid, a leader in trusted email delivery, discusses how they used Athena to reinvent a popular feature of their platform.
Amazon Kinesis makes it easy to speed up the time it takes for you to get valuable, real-time insights from your streaming data. In this session, we walk through the most popular applications that customers implement using Amazon Kinesis, including streaming extract-transform-load, continuous metric generation, and responsive analytics. Our customer Autodesk joins us to describe how they created real-time metrics generation and analytics using Amazon Kinesis and Amazon Elasticsearch Service. They walk us through their architecture and the best practices they learned in building and deploying their real-time analytics solution.
ANT301 – Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics
Companies have valuable data that they might not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. With the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with an Amazon Redshift lead engineer to ask questions and learn more about how you can extend your analytics beyond your data warehouse
ANT383 – Migrate from Teradata to Amazon Redshift: Best Practices with McDonald’s
Modernizing your data warehouse can unlock new insights while substantially improving query and data load performance, increasing scalability, and saving costs. In this chalk talk, we discuss how to leverage the AWS Database Migration Service and AWS Schema Conversion Tool to migrate from Teradata to Amazon Redshift. McDonald’s joins us to share their migration journey, after which they were able to run ~7,000 reports across four AWS Regions, enabling new reporting capabilities for marketing, franchises, supply chain, pricing, and many more business units.
ANT312 – Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS
Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop/Spark to AWS in order to save costs, increase availability, and improve performance. In this session, AWS customers Airbnb and Guardian Life discuss how they migrated their workload to Amazon EMR. This session focuses on key motivations to move to the cloud. It details key architectural changes and the benefits of migrating Hadoop/Spark workloads to the cloud.
ANT406 – Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfizer
Modernizing your data warehouse can unlock new insights while substantially improving query and data load performance, increasing scalability, and saving costs. In this chalk talk, we discuss how to migrate your Oracle data warehouse to Amazon Redshift and achieve agility and faster time to insights while reducing costs. Pfizer joins us to share their journey in building the Scientific Data Cloud—a Redshift-powered data lake that provides unprecedented analytical capabilities in R&D as well as a focus on near real-time access to R&D continuous manufacturing.
ANT311 – NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSight
Enabling interactive data and analytics for thousands of users can be expensive and challenging—from having to forecast usage, provisioning and managing servers, to securing data, governing access, and ensuring auditability. In this session, learn how Amazon QuickSight’s serverless architecture and pay-per-session pricing enabled the National Football League (NFL) and Forwood Safety to roll out interactive dashboards to hundreds and thousands of users. Understand how the NFL uses embedded Amazon QuickSight dashboards to provide clubs, broadcasters, and internal users with Next Gen Stats data collected from games. Also, learn about Forwood’s journey to enabling dashboards for thousands of Rio Tinto users worldwide, utilizing Amazon QuickSight readers, federated single sign-on, dynamic defaults, email reports, and more.
ANT389 – Ask an Amazon Redshift Customer Anything
Learn best practices from Hilton Hotels Worldwide as they built an Enterprise Data Lake/Management (EDM) platform on AWS to drive insights and analytics for their business applications, including worldwide hotel booking and reservation management systems. The EDM architecture is built with Hadoop clusters running on Amazon EC2 combined with Amazon Redshift and Amazon Athena for data warehousing and ad hoc SQL analytics. This is a great opportunity to get an unfiltered customer perspective on their road to data nirvana!
ANT208 – Serverless Video Ingestion & Analytics with Amazon Kinesis Video Streams
Amazon Kinesis Video Streams makes it easy to capture live video, play it back, and store it for real-time and batch-oriented ML-driven analytics. In this session, we first dive deep on the top five best practices for getting started and scaling with Amazon Kinesis Video Streams. Next, we demonstrate a streaming video from a standard USB camera connected to a laptop, and we perform a live playback on a standard browser within minutes. We also have on stage members of Amazon Go, who are building the next generation of physical retail store experiences powered by their “just walk out” technology. They walk through the technical details of their integration with Kinesis Video Streams and highlight their successes and difficulties along the way.
As Amazon’s consumer business continues to grow, so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines, such as Amazon EMR and Amazon Redshift
ANT210-S – WuXi NextCODE Scales up Genomic Sequencing on AWS
Genomic sequencing is growing at a rate of 100 million sequences a year, translating into 40 exabytes by the year 2025. Handling this level of growth and performing big data analytics is a massive challenge in scalability, flexibility, and speed. In this session, learn from pioneering genomic sequencing company WuXi NextCODE, which handles complex and performance-heavy database and genomic sequencing workloads, about moving from on premises to all-in on the public cloud. Discover how WuXi NextCODE was able to achieve the performance that its workloads demand and surpass the limits of what it was able to achieve previously in genomic sequencing. This session is brought to you by AWS partner, NetApp, Inc.
Real-time analytics has traditionally been analyzed using batch processing in DWH/Hadoop environments. Common use cases use data lakes, data science, and machine learning (ML). Creating serverless data-driven architecture and serverless streaming solutions with services like Amazon Kinesis, AWS Lambda, and Amazon Athena can solve real-time ingestion, storage, and analytics challenges, and help you focus on application logic without managing infrastructure. In this session, we introduce design patterns, best practices, and share customer journeys from batch to real-time insights in building modern serverless data-driven architecture applications. Hear how Intel built the Intel Pharma Analytics Platform using a serverless architecture. This AI cloud-based offering enables remote monitoring of patients using an array of sensors, wearable devices, and ML algorithms to objectively quantify the impact of interventions and power clinical studies in various therapeutics conditions.
ANT334 Get the Most out of Your Amazon Elasticsearch Service Domain
Amazon Elasticsearch Service (Amazon ES) makes it easy to deploy and use Elasticsearch in the AWS Cloud to search your data and analyze your logs. In this session, you get key insights into Elasticsearch, including information on how you can optimize your expenditure, minimize your index sizes to lower costs, as well as best practices for keeping your data secure. Also hear from youth sports technology company SportsEngine, about their experience engineering a member-management product of over 260 million documents on top of Elasticsearch. Relive their harrowing journey through tens of thousands of shards, crushed clusters, mountains of pending tasks, and never-ending snapshots. Hear how they went from disaster to delight with Amazon ES.
With the simplicity of Amazon Elasticsearch Service (Amazon ES), there are many opportunities to use it as a backend for real-time application and infrastructure monitoring. With this wealth of opportunities comes sprawl; developers in your organization are deploying Amazon ES for many different workloads and many different purposes. Should you centralize into one Amazon ES domain? What are the tradeoffs in scale and cost? How do you control access to the data and dashboards? How do you structure your indexes, by single tenant or multi-tenant? In this session, we explore whether, when, and how to centralize logging across your organization to minimize costs and maximize value. We also discuss how Autodesk built a unified log analytics solution using Amazon ES. Please join us for a speaker meet-and-greet following this session at the Speaker Lounge (ARIA East, Level 1, Willow Lounge). The meet-and-greet starts 15 minutes after the session and runs for half an hour
Builder Sessions and Chalk Talks
This section describes AWS Analytics Services Sessions for data lake architecture and best practices.
ANT364 – Best Practices in Streaming Data with Amazon Kinesis
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this builders session, we walk through a common use case for Amazon Kinesis Data Streams and the top five best practices we see customers implement when processing data in real time.
ANT348 – [BS] Amazon EMR: Optimize Transient Clusters for Data Processing and ETL
Amazon EMR was built for agility, enabling you to spin up and down resources for big data processing and analytics on demand, and realize the flexible potential of cloud. In this builders session, we detail how to efficiently start, stop, and resize your clusters for Apache Spark and Hadoop, reducing your costs, and accelerating your “time-to-completion” for jobs. Join us to hear expert advice on how to optimize your “one and done” workloads.
ANT333 – [BS] Building Advanced Workflows with AWS Glue
AWS Glue makes it easy to incorporate data from a variety of sources into your data lake on Amazon S3. In this builders session, we demonstrate building complex workflows using AWS Glue orchestration capabilities. Learn about different types of AWS Glue triggers to create workflows for scheduled and event-driven processing. We start with a customer scenario and build it step by step using AWS Glue capabilities.
ANT346 – [BS] Lock It Down: Configure End-to-End Security & Access Control on Amazon EMR
Amazon EMR helps you process all your data for analytics, but with great scale comes great responsibility—you need to make sure that data is secured by design. In this builders session, we walk through how to configure your environment to take full advantage of comprehensive security controls: including identifying sensitive data, encrypting your data and managing keys, authenticating and authorizing users, using fine-grained access controls, and using audit logs to demonstrate compliance.
ANT331 – [BS] Metrics-Driven Performance Tuning for AWS Glue ETL Jobs
AWS Glue provides a horizontally scalable platform for running ETL jobs against a wide variety of data sources. In this builders session, we cover techniques for understanding and optimizing the performance of your jobs using Glue job metrics. Learn how to identify bottlenecks on the driver and executors, identify and fix data skew, tune the number of DPUs, and address common memory errors.
ANT381 – Build Advanced Workflows with AWS Glue
AWS Glue makes it easy to incorporate data from a variety of sources into your data lake on Amazon S3. In this builders session, we demonstrate building complex workflows using AWS Glue orchestration capabilities. Learn about different types of AWS Glue triggers to create workflows for scheduled as well as event-driven processing. We start with a customer scenario and build it step by step using AWS Glue capabilities.
ANT344 – [BS] One Data Lake, Many Uses: Enable Multi-Tenant Analytics with Amazon EMR
One of the benefits of having a data lake is that the same data can be consumed by multi-tenant groups—an efficient way to share a persistent Amazon EMR cluster. The same business data can be safely used for many different analytics and data processing needs. In this builders session, we discuss steps to make an Amazon EMR cluster multi-tenant for analytics, best practices for a multi-tenant cluster, and solving common challenges. We also address security and governance aspects of a multi-tenant Amazon EMR cluster.
ANT363 – Build a Streaming Application Using Amazon Kinesis
Amazon Kinesis Data Analytics enables you to quickly build and easily manage applications that process streaming data in real-time. In this builders session, we walk through the steps required to build a streaming application, including the most common issues and best practices.
ANT368 – Delivering Fresh Data to Your Data Lake Using Amazon Kinesis
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data stores and analytics tools. In this builders session, we discuss how to use Kinesis Data Firehose to ingest, transform, and deliver data to Amazon S3 in a format that you can easily process.
ANT382 – Building Rich and Interactive Business Dashboards in Amazon QuickSight
Are you ready to move past static email reports, Excel spreadsheets, and one-time queries? In this builders session, learn how to build a rich and interactive business dashboard in Amazon QuickSight that allows your business stakeholders to filter, slice and dice, and deep-dive on their own. We demonstrate advanced Amazon QuickSight capabilities such as creating on-sheet filter controls, parameters, custom URLs, and table calculations to create rich and attractive executive dashboards.
ANT343 – Get the Most out of AWS Glue Data Catalog and Crawlers for Data Lake Analytics
In this builders session, we discuss common use cases across various AWS data analytics platforms that are integrated with AWS Glue Data Catalog, and we share best practices for using AWS Glue Data Catalog and crawlers on services such as Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR. Participants set up and launch crawlers on sample datasets and execute queries on various analytics services.
ANT390 – Getting Started with Streaming Video Using Amazon Kinesis Video Streams
In this builders session, we discuss how to capture, process, and analyze video streams using Amazon Kinesis Video Streams. We walk through a high-level, end-to-end architecture, and we discuss the first steps to start streaming video in real time.
ANT366 – Real-Time Machine Learning Using Amazon Kinesis and Amazon SageMaker
Amazon SageMaker is a fully managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning (ML) models at any scale. Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this builders session, we walk through how the two services can be used in conjunction to perform real-time ML at any scale.
ANT378 – Serverless Analytics with Amazon QuickSight
Querying and analyzing big data can be complicated and expensive. It requires you to set up and manage databases, data warehouses, and business intelligence (BI) applications—all of which require time, effort, and resources. Using Amazon Athena and Amazon QuickSight, you can avoid the cost and complexity by creating a fast, scalable, and serverless cloud analytics solution without the need to invest in databases, data warehouses, complex ETL solutions, and BI applications. In this builders session, we demonstrate how you can build a serverless big data analytics solution using Amazon Athena and Amazon QuickSight.
Streaming data ingestion and near real-time analysis give you immediate insights into your data. By using AWS Lambda with Amazon Kinesis, you can obtain these insights without the need to manage servers. In this builders session, we discuss how you can use Lambda and Kinesis together to build an end-to-end serverless solution.
ANT347 – Use Auto Scaling, Spot Pricing, and More Expert Strategies
Amazon EMR is a powerful service, enabling you to process and analyze big data at any scale. In this builders session, we share proven strategies to maximize your utilization while minimizing your costs for long-running clusters. We cover how to get the most leverage from features like Auto Scaling and Spot pricing. We also discuss how changing your design architecture by decoupling compute and storage impacts TCO. Not least, we show you how appropriately sizing instances, clusters, and jobs helps you save.
As customers are looking to build Data lakes to AWS, managing security, catalog and data quality becomes a challenge. Once data is put on Amazon S3, there are multiple processing engines to access it. This could be either through a SQL interface, programmatic, or using API. Customers require federated access to their data with strong controls around Authentication, Authorization, Encryption, and Audit. In this session, we explore the major AWS analytics services and platforms that customers can use to access data in the data Lake and provide best practices on securing them.
Amazon Elasticsearch Service (Amazon ES) makes it easy to deploy and operate your Elasticsearch cluster in the AWS Cloud. Amazon ES provides you a great deal of flexibility in selecting the instance type and count for your data and master instances. Your choices of instance type and count are critical to your success, and the right answer is not obvious. In this chalk talk, learn how to size your Amazon ES cluster for your workload. We discuss the underlying processing model, and how to handle concurrency and tenancy for read- and write-heavy workloads.
Amazon Elasticsearch Service (Amazon ES) provides security at several different layers. Let’s talk about how you can integrate with Amazon Cognito to provide single sign-on for Kibana, how to deploy into your VPC, and how to get data into Amazon ES domains in VPC. We also talk about how to write your IAM policies to scope your users’ interaction with the service.
The Elasticsearch query DSL is feature-rich, providing you with the ability to write queries that quickly isolate and expose the information that’s most important to you. Whether you’re searching log data or application data, you need to understand the capabilities and how to use them. Let’s dig in together and build the queries and ranking that you want.
Workshops
ANT307 – Enabling Your Organization’s Redshift Adoption – Going from Zero to Hero
Ever wonder why some companies are able to achieve business goals around Amazon Redshift adoption at breakneck speed? Does figuring out the right architecture for a Amazon Redshift deployment for your organization keep you up at night? Proven patterns and “quickstart” environments are the keys to success. As a stakeholder in your company’s success, you want to bring a clear and concise business solution to the table that fits the business need. In this session, we focus on using infrastructure as code to present a variety of common Amazon Redshift deployment patterns used across other AWS customers so that you can hit the ground running. Additionally, presentations coupled with hands-on labs reinforce the patterns presented in this session.
ANT303 – Have Your Front End and Monitor It, Too
Amazon Elasticsearch Service (Amazon ES) is both a search solution and a log monitoring solution. In this session, we address both. We build a front-end, PHP web server that provides a search experience on movie data as well as backend monitoring to send Apache web logs, syslogs, and application logs to Amazon ES. We tune the relevance for the search experience and build Kibana visualizations for the log data. In addition, we use security best practices and deploy everything into a VPC.
ANT371 – Migrate Your On-Premises Data Warehouse to Amazon Redshift with AWS DMS and AWS SCT
Customers with on-premises data warehouses find them complex and expensive to manage, especially with respect to data load and performance. Amazon Redshift is a fast, simple, cost-effective data warehouse service that can extend queries to your data lake using your existing business intelligence tools. Migrating your on-premises data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. This workshop leverages AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) to migrate an existing Oracle data warehouse to Amazon Redshift. Prerequisites: an AWS account with IAM admin permissions and sufficient limits for the AWS resources above; a comfortable working knowledge of the AWS Management Console, relational databases, and Amazon Redshift.
ANT325 – One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR
One of the benefits of having a data lake is that same data can be consumed by multi-tenant groups—an efficient way to share a persistent Amazon EMR cluster. The same business data can be safely used for many different analytics and data processing needs. In this session, we discuss steps to make an Amazon EMR cluster multi-tenant for analytics, best practices for a multi-tenant cluster, and solutions to common challenges. We also address the security and governance aspects of a multi-tenant Amazon EMR cluster.
ANT302 – Search Your DynamoDB Data with Amazon Elasticsearch Service
Both Amazon DynamoDB and Amazon ES are database technologies. Their strengths are different and complementary. DynamoDB is an excellent, durable store, providing high throughput at reliable latencies with nearly infinite scale. Elasticsearch provides a rich query API, supporting high throughput, low-latency search across numeric and string data and with a built-in capability of bringing relevant results for your queries. In this lab, we explore the joint power of these technologies. You deploy a DynamoDB table, bootstrap it with data, then using Dynamo Streams, replicate that bootstrapped data to Amazon ES. You use Elasticsearch’s query language to query your data directly. Finally, you send updates to your DynamoDB table and use Elasticsearch analytics capabilities to monitor changes occurring in your table.
In this workshop, learn how to automatically catalog datasets in your Amazon S3 data lake using AWS Glue crawlers. Also, learn how to interactively author ETL scripts in an Amazon SageMaker notebook connected to an AWS Glue development endpoint. Finally, learn how to deploy your ETL scripts into production by turning your ETL script into managed AWS Glue jobs and add appropriate AWS Glue scheduling and triggering conditions. The resulting datasets will automatically get registered in the AWS Glue Data Catalog, and you can then query these new datasets from Amazon Athena. Knowledge of Python and familiarity with big data applications is preferred but not required. Attendees must bring their own laptops.
Realizing the value of social media analytics can bolster your business goals. This type of analysis has grown in recent years due to the large amount of available information and the speed at which it can be collected and analyzed. In this workshop, we build a serverless data processing and machine learning (ML) pipeline that provides a multi-lingual social media dashboard of tweets within Amazon QuickSight. We leverage API-driven ML services, AWS Glue, Amazon Athena and Amazon QuickSight. These building blocks are put together with very little code by leveraging serverless offerings within AWS.
Video is ‘big data.’ Image sensors—in our smartphones, smart home devices, traffic cameras—are getting Internet-connected. Massive streams of video data are generated, but currently not mined for real-time insights to drive businesses forward. In this workshop, learn to capture, process, and analyze video streams. Build and configure your camera device’s media pipeline to start streaming video into the AWS Cloud using Amazon Kinesis Video Streams. Next, build and deploy your own machine learning (ML) model in Amazon SageMaker to generate inferences about objects or activities in your video stream. Finally, build a browser-based web player to view the video in Live and On-Demand modes, including the analyzed video stream. In this workshop, you use Amazon Kinesis Video Streams, Amazon SageMaker, Amazon Rekognition Video, and Amazon ECS.
Amazon Redshift offers a common query interface against data stored in fast, local storage as well as data from high-capacity, inexpensive storage (S3). This workshop will cover the basics of this tiered storage model and outline the design patterns you can leverage to get the most from large volumes of data. You will build out your own Redshift cluster with multiple data sets to illustrate the trade-offs between the storage systems. By the time you leave, you’ll know how to distribute your data and design your DDL to deliver the best data warehouse for your business.
A modern application service consists of many microservices working together. But how do you get visibility on how they are interconnected into a larger application service and how well they are working together, or whether they’re working together at all? How can you get better visibility into your microservices environment and outwit entropy? The answer is logs, plus a strong and automated log analyzer. In this lab, you deploy a containerized application on Amazon Elasticsearch Service (Amazon ES). You use a combination of Fluentd and Beats to send your instance, container, and application logs to Amazon ES. You then explore these logs with Kibana, building a dashboard to gain visibility into your application service and monitor key parameters of your application.
ANT362 – Use Streaming Data to Gain Real-Time Insights into Your Business
In recent years, there has been an explosive growth in the number of connected devices and real-time data sources. Because of this, data is being continuously produced, and its production rate is accelerating. Businesses can no longer wait for hours or days to use this data. To gain the most valuable insights, they must use this data immediately so they can react quickly to new information. In this workshop, you will learn how to take advantage of streaming data sources to analyze and react in near real time. We provide several requirements for a real-world streaming data scenario, and you’re tasked with creating a solution that successfully satisfies the requirements using services such as Amazon Kinesis, AWS Lambda and Amazon SNS.
As data exponentially grows in organizations, there is an increasing need to use machine learning (ML) to gather insights from this data at scale and to use those insights to perform real-time predictions on incoming data. In this workshop, we walk you through how to train an Apache Spark model using Amazon SageMaker that is pointed to Apache Livy and running on an Amazon EMR Spark cluster. We also show you how to host the Spark model on Amazon SageMaker to serve a RESTful inference API. Finally, we show you how to use the RESTful API to serve real-time predictions on streaming data from Amazon Kinesis Data Streams.
We are looking forward to meeting you at re:Invent 2018!
About the Author
Roy Ben-Alta is the head of global Big Data & Analytics practice of AWS Professional Service. He focuses on data analytics and ML technologies, working with AWS customers to build innovative data-driven products.