Traditional data management software and algorithms are built on the concept of persistent data sets that are reliably stored in stable storage and can be queried and updated several times throughout their lifetimes. In reality, however, data is rarely complete and stationary. In its latest report, the International Data Corporation predicts that in 5 years time, as high as 30% of all data will be real-time: data continuously produced by mobile devices, sensors, web activity, and monitoring applications. This prediction poses a grand challenge to data management systems and makes stream processing now more relevant than ever.
A data stream is a data set that is produced incrementally over time, rather than being available in full before its processing begins. Data streams have unknown, possibly unbounded length, thus, storing the entire stream in an accessible way is impossible. Further, as streams are produced by external sources, the consumer has no control over the arrival order or input rate. The field of stream processing develops methods and tools to manage and analyze such continuous and unbounded datasets. Stream processing lifts the requirement to first store and then process data and enables immediate actions, alerts, and continuous analytics.
This post briefly examines the history of stream processing systems, provides an overview of recent advances, and highlights emerging applications and future directions.
Stream processing in the heart of the modern data analytics stack
Today, stream processing is widely deployed and used by all kinds of businesses and organizations. A big part of this success is due to major efforts by open-source communities. There exist tens of stream processing frameworks under the Apache Software Foundation umbrella alone. These projects have global communities, hundreds of contributors, and thousands of users across the globe, building and maintaining robust, battle-tested streaming systems that power remarkable applications: real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and fraud detection in financial transactions.
All major cloud providers offer streaming dataflow pipelines and online analytics as managed services: AWS Kinesis and Alibaba Realtime Compute are based on Apache Flink, Microsoft offers Azure Stream Analytics, which relies on Trill, Google Cloud provides streaming solutions with Dataflow and Apache Beam. Besides cloud providers, in the past few years, we also witnessed an increasing number of companies building their own in-house streaming platforms so that their entire organization can leverage the power of online analytics. Some examples include Uber’s AthenaX, the Netflix Keystone Platform, and Turbine by Facebook.
Evolution of Streaming Systems
It comes to many as a surprise, but stream processing is not such a recent field. It is almost 30 years old. Continuous Queries over Append-Only Databases, a paper proposing queries that are issued once and henceforth run continually over the database, appears in the proceedings of SIGMOD 1992.
Data stream Management Systems. The first specialized data stream managers were developed in the early 2000’s. These systems coined the term “data stream management system” (DSMS) and their design followed closely that of a DBMS, with some distinct features. A DSMS accepts both ad-hoc and continuous queries and executes them in a query execution engine, similar to that of a DBMS. An input manager component is responsible for ingesting streams, while a quality monitor observes stream input rates and query performance. The load shedder selectively drops records in order to meet target latency requirements. The purpose of a DSMS was to provide query results as fast as possible, sacrificing some accuracy if necessary. During this early period of stream processing, researchers devoted their efforts in defining fundamental notions and systems aspects, such as streaming representations, operator and time semantics, late data handling, synopsis maintenance, load management, and high availability.
The MapReduce era. After the foundations had been laid out, MapReduce/Hadoop essentially marked the beginning of the golden open-source era for large-scale data management. MapReduce shifted the focus towards data parallelism, fault tolerance, transparent distribution, and the use of general-purpose programming languages, such as C++ and Java, for expressing analytics on large unstructured data. Apache S4 and Apache Storm were among the first MapReduce-like systems for stream processing. Around that time, stream processing technology witnesses its first production deployments alongside batch processors. This setup, known as the “lambda architecture”, routes incoming data to two processing layers. The speed layer processes records as they arrive with a best-effort low-latency stream processing system. The batch layer, stores data persistently and periodically submits batch jobs for offline analytics that complement the often lossy online results. Stream processing started being perceived as a synonym for approximate results.
Distributed Dataflow Systems. The MapReduce philosophy affected the design and implementation of the next generation stream processing systems significantly. These systems, which I call “distributed dataflow systems” originated either in large companies (Samza at LinkedIn, Millwheel at Google, Naiad at Microsoft Research) or at universities (Spark Streaming at UC Berkeley and Apache Flink at TUBerlin). The most fundamentally novel characteristic of this generation of systems is their philosophy about the capabilities of a stream processor. Distributed dataflow systems rejected the notion that stream processing is a synonym of approximate and lossy results. Instead, they demonstrated that stream processors can produce exact and correct results, even under failures.
Distributed dataflow systems are commonly deployed on shared-nothing clusters of machines. Workers are connected over the network and execute operators in a data-parallel fashion. Computations are represented as dataflow graphs whose nodes define transformations on streams and edges denote data dependencies. Programs are written in a high-level declarative API in a general-purpose language like Java, Scala, or Python. Modern streaming systems are no longer extensions of relational execution engines. They are efficient, scalable, and robust distributed frameworks offering state management, exactly-once processing guarantees, reconfiguration capabilities, and out-of-order computation.
What’s next for streaming systems?
As more and more people realize how many tasks can be modeled as streaming computations, there is a growing demand for stream processing systems to serve use-cases beyond data analytics.
Streaming systems are being tested as backends for general event-driven architectures, cloud services, and microservice orchestration. Flink Stateful Functions demonstrates how a modern stream processor can be leveraged for Function-as-a-Service applications. It essentially uses the Flink runtime as an orchestrator and its state management capabilities as a distributed store that manages stateful functions. Function invocations are modeled as events and dataflow operators encapsulate and run functions, while they can exchange arbitrary messages with each other via dataflow channels.
Stream processing can also address the velocity dimension in scientific and training data. Existing streaming systems lack support for complex data streams (hierarchical, nested, connected data) and when there is a need to use a stream processor for more complex analytics, such as ML or graph analysis, users need to build ad-hoc, specialized solutions or perform snapshot-based analysis. While streaming systems are already being used for model serving, training and inference are still performed by external frameworks. Dataflow operators either need to issue RPCs to a ML library or maintain pre-trained models in their managed state. For stream processors, to fully serve online training and continuous graph analytics, they need to provide additional features such as iteration support, dynamic task execution, shared state, and support for hardware acceleration.
In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. Many exciting problems and ideas are ahead of us and I look forward to following the progress in this evolving field.
Bio: Vasiliki (Vasia) Kalavri is an Assistant Professor in Computer Science at Boston University. Her main interests are in distributed stream processing and large-scale graph analytics.
Disclaimer: This post was written by Vasiliki Kalavri for the SIGOPS blog. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGOPS, or their parent organization, ACM.