1. 程式人生 > >Samza 1.0: Stream Processing at Massive Scale

Samza 1.0: Stream Processing at Massive Scale

Samza SQL also supports implementing custom user logic by specifying user-defined functions (UDFs) in Java. Our support for SQL leverages Apache Calcite for its implementation and builds on the foundations offered by Samza’s core engine.

Joining streams with tables made easy

Event-driven applications typically need access to additional data (in databases or in other REST-based services) to process their events. For example, consider a streaming pipeline that ranks notifications to be sent to LinkedIn members. Sending a notification requires access to the member’s profile—which devices they have the LinkedIn app installed on, their notification settings, etc. The “Samza Table API” simplifies scenarios like these where data in a stream needs to be joined with additional data from other sources. It provides common features like throttling and caching when accessing datasets.

The Table API also allows for composition—i.e., you can build composite tables by combining individual ones. For example, if you already have a table backed by a remote web service, you can add a Couchbase as a cache in front of it. At LinkedIn, we have added integrations with Couchbase, Espresso, and

Rest.li. We are excited about the endless possibilities this unlocks in simplifying access to your data.

Features provided by the Table API include:
Throttling:
Streaming systems can usually ingest and process messages at a high rate. Making remote calls to external services at the same rate when accessing datasets could bring them down. For this reason, Samza Tables enforce quotas on the client-side, allowing you to specify read and write limits for your services.

Caching: Samza Tables can also provide caching to further lower access latencies. In-memory and disk-backed options are currently offered for caching data locally. Alternatively, you can use a remote cache if your storage requirements are more than a single disk.

Async IO: When accessing remote tables or datasets exposed through web services, we can often issue non-blocking requests and improve overall throughput. Samza Tables natively support async-interactions when accessing remote sources.

Samza standalone: Bring your own Cluster Manager

Prior to Samza 1.0, Samza required YARN for resource management and distributed execution of applications. This worked well when running stream processing as a managed service on a YARN cluster. But as Samza gained momentum, our users desired the flexibility to run stream processing in any environment —Kubernetes, Mesos, or on the cloud. Samza 1.0 addresses this by offering a standalone mode of running applications.

This mode allows Samza to be embedded as a lightweight library within an application and run on any resource manager of your choice. You can increase parallelism by simply spinning up more instances of your application. The individual instances will then coordinate among themselves using Zookeeper to distribute their tasks. When an instance fails, its tasks are assigned to the remaining instances that are live.

The standalone mode does not yet support stateful stream processing like windowing and joins. We are actively working to address this by taking data locality into account when assigning tasks to hosts.

Improving testability of Samza applications

Testability is one of the key challenges when building any data processing framework. Samza users have typically tested their applications by spinning up a local Kafka cluster, producing a few messages to it, and verifying their output results by consuming from Kafka. This usually involved starting multiple components to set up the test environment. It also meant that the tests themselves ran for a longer duration.

Starting with the 1.0 release, we are excited to announce a new framework for unit-testing Samza applications. This is a significant step towards improving developer productivity. The framework allows you to provide inputs to your application using in-memory collections and run your logic through them. You can also run assertions on the contents of these collections and inspect results.

Samza is 1.0, but we are far from being done

With an ever-expanding list of use cases, we are at an exciting juncture in stream processing. While 1.0 is a significant milestone for the project, there is still a lot more to be done on improving our ease of use. It is a great time to be involved in the community. You can read up on Samza, check out our hello-samza tutorials, or even contribute some bug-fixes.

Here are some areas we are actively investing in:

  • Adding support for other languages, like Python

  • Hot-standby containers to support applications with strict downtime requirements

  • Making it easy to auto-scale and auto-tune Samza applications

  • Supporting machine learning related use cases

Want to work on similar problems in large-scale distributed systems? The Streams Infrastructure team at LinkedIn is hiring engineers at all levels!

Acknowledgements