Data Streaming: Real-time data for real-time decisions (Palantir RFx Blog Series, #8)
Often the most business critical decisions are also the most time-sensitive. Data streaming technologies let organizations act on information (almost) as quickly as it comes in.
Editor’s note: This is the eighth post in the Palantir RFx Blog Series, which explores how organizations can better craft RFIs and RFPs to evaluate digital transformation software. Each post focuses on one key capability area within a data ecosystem, with the goal of helping companies ask the right questions to better assess technology. Previous installments have included posts on Ontology, Data Connection, Version Control, Interoperability, Operational Security, Privacy-Enhancing Technologies, and the Modeling Objective.
Introduction
“Real-time data” has transformed the way people live and businesses operate. The ability to deliver data as soon as it is collected has touched and improved so many of our most basic functions — from checking the weather to ordering food to sending money — and, seemingly overnight, changed age-old expectations about how quickly information, products, and services can be delivered and consumed. For the purposes of this blog post, we define real-time data as any data that can be ingested, processed, and acted on with low latency, and data streaming as the set of capabilities needed to incorporate real-time data into an enterprise data ecosystem.
For enterprises, real-time data technologies have taken longer to take full effect but are no less transformative to the overall effectiveness of a given business. As data ecosystems have evolved to process data more and more quickly, so have organizational expectations about how those data streams can be used to inform critical decisions. In practical terms, food and beverage manufacturers can now halt production of a given product minutes after identifying a defect. Airlines can swap aircraft and adjust crew schedules minutes after identifying an adverse weather pattern. Banks can suspend an account for review seconds after flagging a suspicious transaction. By reducing the gap between data collection and decision-making to close to zero, real-time data has brokered material improvements to countless business processes and in some cases enabled entirely new capabilities.
Like so many other transformative technologies, the gulf between the promises of real-time data and the implementation of real-time data can be quite vast. This is especially true in the context of RFPs, where promises of “gigabyte throughput” and “millisecond latencies” are common but impossible to verify on paper. In this blog post, we explore some of the important considerations of real-time data and the streaming technologies that enable them, drawing from our experience helping many organizations in industry and government generate value from real-time data.
What is Data Streaming?
Data streaming refers to the set of capabilities needed to process real-time information within a data ecosystem.¹ Streaming can refer to any number of operations, including data collection, ingestion, transformation, modeling, contextualization, write-back, analysis, and actions — and it’s critical for organizations to define exactly what they mean by data streaming before seeking out possible solutions. After all, “data streaming” and “real-time data” can mean very different things in different contexts. What’s considered an “acceptable lag” for ordering lunch or transferring money to a friend is very different from acceptable lags for dispatching paramedics or flagging a malfunction in a submarine’s GPS system. With every use case, the data sources, system architectures, and consumer expectations vary widely, which in turn changes the very definition of data streaming when applied to those contexts.
Streamed data sources are typically differentiated against “batch-processed” data sources, which eponymously describe data that is processed in discrete groupings and on a specific schedule.² Broadly speaking, the defining elements for data streams are speed and continuity. Data streaming pipelines process data at low latencies (often in seconds or less) and at rapid, continuous intervals. Streaming data typically enters the data ecosystem as a continuous series of time-stamped “events,” which is why it’s often referred to as “event stream processing” and “real-time data streams.”
In this sense, the definition of data streaming is simple — move information around a data ecosystem quickly and continuously. But defining data streaming in the context of real-world data ecosystems can be difficult when you consider all the operations that comprise a streaming pipeline, and more generally all the things a data ecosystem does. So far in this RFx Blog Series, we’ve discussed a few key components of a data ecosystem — ontology, data integration, data pipeline version control, interoperability, operational security, privacy, AI/ML modeling [LINKS] — and for a data ecosystem to truly incorporate streamed data, it must be able to apply each of these capabilities (and more) to high-volume, low-latency data. The table below highlights some of the key capabilities of a functional data streaming architecture, separated into three phases of data ingestion, data transformation, and data consumption.
Data Streaming — Key Capabilities
From a platform architecture standpoint, the big question is not how to design a standalone system for streamed data but rather how to design a system that allows end users to engage with streamed data sources alongside conventional batch data sources. In other words, the most effective data streaming solution is the one where system users can interact with all integrated data sources in the same way, with the same governance policies, without regard as to whether the data is from one kind of data source or another. What makes this challenging is that streamed data and batch data require very different backend technologies. One cannot just pipe streamed data into a platform designed for conventional data sources any more than one can strap bigger engines onto a 737 and expect it to fly safely at supersonic speed.
Data Streaming — Functional Architecture
Given the wide range of use cases and data sources related to streaming, there are different ways to construct an effective data streaming architecture. But the main functional components tend to be similar from system to system. The diagram above presents some of these key components and how they fit together.
- Data Ingestion brings streaming data from source systems into the data ecosystem. The Data Connection API executes long-running tasks for ingestion, while the Stream Proxy exposes endpoints to receive new records from external systems before validating and ingesting the information into the platform for further processing.
- Data Transformation applies several processes to the ingested data. The data is stored in a clustered and highly-available data storage solution such as Kafka. A Compute Engine such as Apache Flink is used to run distributed streaming transformations and computations on the stored data. This engine is comprised of a Job Manager to coordinate work done in the cluster and a Task Manager to execute the transformations directly on the data. The Cluster Manager manages lifecycle tasks such as keeping Flink versions up to date and downloading user code jars, and the Stream Worker helps spin up new clusters. The Stream Archiver reads streaming data into cold storage, organizing the information into known topics and their corresponding data sets. How the system balances storage considerations (low latency but expensive hot storage vs. slow but cost-effective cold storage) is a critical consideration for any streaming solution, as is discussed in the final bullet of the Requirements section below. The Stream Catalog maintains a data mapping between streaming (hot) storage and archival (cold) storage to ensure efficient storage operations.
- Data Consumption includes mechanisms for users to engage with the streaming data according to their specific needs and workflows. Pluggable Sinks provide an opinionated framework for ultra-low-latency data flows, receiving data from the Compute Engine and sending it to a set destination for use in downstream applications. Ontology write-back connects streaming data to an ontology’s objects and actions, allowing end users to understand streaming data in context. Rules and alerts are a popular streaming workflow, as organizations will establish logic to monitor systems or assets and alert users when some threshold is met. Users can set up sinks to write streaming data directly to target databases such as Time Series Databases and other Customer Data Stores.
Defining data streaming turns out to be quite difficult and expansive because it refers to “all the things“ a data ecosystem does, only with streamed data. And because most data ecosystems were designed for more common, batched data sources, the ability to incorporate streamed data in a seamless way is a major challenge for many enterprises. For this reason it’s important to evaluate streaming technologies based not only on whether they can meet strict standards related to latency, availability, and scalability, but also on how well they interact with other data assets and platform components.
Why does Data Streaming matter?
Streaming matters when seconds matter. The faster an enterprise can process a real-time data feed, the faster it can fashion a response to whatever that data feed is telling them.
Exactly how real-time data streams are integrated, transformed, and acted upon are of course different across organizations and industries. A shipping company could use emergent weather data to inform the path and schedule of a particular fleet of container ships; a financial services company could use real-time transaction data to estimate the likelihood of fraud or money-laundering for a given set of accounts or transactions; an auto manufacturer could use sensor data from welding robots on the factory floor to identify a particular defect and pause production accordingly. For many of our customers, data streaming has significantly reduced timeframes of critical workflows. For others, it has enabled entirely new workflows that were not possible beforehand.
Take the construction of offshore oil platforms as an example. Starting up an oil rig is a dangerous and capital-intensive endeavor. It requires many teams of engineers and construction workers to establish both the well itself and the ecosystem of pipes, cranes, storage, and living quarters above it. Modern platforms have sensors installed all along the primary well pipeline and the platform itself, which stream large quantities of time-stamped data related to temperatures, pressures, and operating limits. Data streaming technologies allow oil and gas companies to ingest, harmonize, and run complex computational tasks on this data, offering live-monitoring capabilities to identify well performance issues in real-time. Combined with a robust dashboarding and alerting infrastructure, data streaming makes oil platforms significantly safer and more effective by preventing injury, environmental accidents, and well shutdowns.
More broadly, organizations have come to expect real-time data processing. Much like individual consumers have embraced the “now economy” for things like news, food, and ride-sharing, so too have enterprises come to demand immediate access to all of the most current information about their business, regardless of scale or complexity. To use the example above, customers in the oil and gas industry now rely on streamed sensor data to ensure safety and performance along their pipeline networks because these capabilities represent a transformative shift in their ability to keep workers safe and oil wells online. Similarly, customers in other industries have now baked in expectations of data streaming into their most critical workflows as they enable workflows and outcomes that are deemed essential.
Data streaming can be seen as a necessary and inevitable step forward for modern data ecosystems. Technology gets better; computers get faster; data gets more varied and complex, necessitating the need for more immediate computation and delivery.
Requirements
First, a note about latency requirements. The most common (and sometimes only) technical requirement related to data streaming concerns latency SLAs (Service Level Agreements). For example, RFPs commonly include a streaming requirement that reads something like: “the solution must have sub-two second end-to-end latency for streamed data sources.” While this “need for speed” is real and important, it ignores some critical considerations that affect the effectiveness of a streaming solution. More specifically, latency SLAs must be specific and realistic about which operations are expected within a given time delineation. Does the two second SLA include reading, ingesting, and harmonizing the data to an ontology? Does it include transformations, joins, and read-back to the ontology? As data pipelines grow and the list of operational expectations grow along with it, sub-second SLAs can lack utility at best or become prohibitive at worst, crowding out important data pipeline functions that cannot realistically fit within an arbitrary temporal window.
The streaming solution must support comprehensive integration with structured/batch data sources. If one of the primary goals of a data ecosystem is to unify all data enterprise assets and provide harmonized point of access for system users, that system must be able to integrate streamed data with batch data. This fusion of data sources helps users explore real-time data in context with other data sources, providing the most complete picture possible for subsequent action or decision-making.
The solution must provide immediate access to the complete history of streamed data. One major challenge for streamed data is the cost of storage; data volumes and update frequencies are so high that it often becomes prohibitively costly to make the entire data history available to system users. Still, organizations need reliable and persistent access to historical versions of streamed data sources, otherwise the streaming system will become siloed in ways the data ecosystem was set up to prevent. To make historical versions of streamed data persistently available, streaming solutions should include automatic processes that move data from hot storage to cold storage, as well as a well-defined library that allows for seamless reading of data between the two.
The solution must enable flexible retention and deletion policies on streaming data. The nature of streaming data — massive, noisy, and often relevant for only a short period of time — heightens the need for a strong retention service. For this reason, streaming solutions must include a rule-based retention service that can be configured to automatically identify and, if desired, delete information according to customer-specific requirements. This service must enable granular policies (i.e., to specify which policies apply to which data sources) and a fully-customizable range of data retention/deletion policies. For example, a user should be able to implement an automatic data retention policy that identifies a specific subset of streaming data with particular metadata characteristics, makes that subset inaccessible to specific groups of users for a defined period of time, and then permanently purges that data from the system.
The solution must connect data streams to ontology objects. Contextualizing real-time data requires more than just combining data streams with other data sources; the data streams must also be systematically mapped to meaningful semantic concepts via an ontology. In a previous blog post, we discussed how the notion of an ontology — a map that links together data and meaning by defining objects, properties, and links that are meaningful to an organization — is a foundational element of any effective data ecosystem. By extension, the process of contextualizing streamed data within the broader ecosystem involves binding real-time data to their relevant ontological objects — such as an oil rig, aircraft, or person. Only then can organizations have a full contextual view of their real-time data.
The solution must support versioning and branching of streamed data, including granular replays of timed events. In the context of a data pipeline, version control refers to a system to help track and manage changes to the data, while branching refers to a separate (or “branched”) environment where a developer can write, test, and revise code before it is approved for production and merged back into the main branch. In a previous blog post, we discussed how important these technologies are for a data ecosystem, enabling collaboration, experimentation, and granular histories of all changes while engendering collective trust in the ecosystem itself. All of these arguments about why versioning and branching capabilities are critical apply to streamed data sources — perhaps especially so as streaming data tends to present in time-series, which makes replays of timed events all the more useful and important.
The solution must enable real-time data transformation on streamed data, enabling rapid integration and continuous builds. Data pipelines involve multiple transformations, where a series of incremental changes are applied to raw data in order to make it useful for system users downstream. Streaming data presents multiple data transformation challenges due to the volume and frequency of incoming data updates. For a streaming architecture to be effective, it must enable complex and persistent data transformations that take one or more streaming datasets as input and produce one or more streaming datasets as output. Developers also need a robust authoring environment to define these continuous builds and create user-defined functions (UDFs).
The solution must include pre-built, out-of-the-box connectors to common streaming data sources. Data streams come from many different sources. Common system types include Apache Kafka, Google Pub/Sub, OSI PI, Amazon Kinesis, and Azure Event Hub. One of the major costs of integrating streamed sources into a data ecosystem is the custom development required to establish a data connection. These costs can be reduced or eliminated by a system with pre-built connectors to common streaming data sources and a guided UI that walks users though the integration process.
Conclusion
Evaluating streaming technologies can be very difficult given the wide range of components necessary in a robust solution and the difficulty of assessing these components on paper. Effective solutions require both a dedicated streaming architecture to run ingestion, transformation, and consumption tasks, as well as a mechanism to integrate that data into the broader ecosystem. This integration component — where organizations must connect streaming data to other, non-streaming data sources via a defined data model — is central to the streaming solution and often overlooked in RFPs, which tend to focus on time latencies and data source compatibilities. Evaluators would be well served to gain a deep understanding of how streaming systems contextualize real-time data within the broader data ecosystem, and how these streaming technologies have achieved impact in real-world environments.
[1] “Real-time data” is, of course, a misnomer as it’s impossible to process data with zero lag. One of our goals in writing this post is to define this widely accepted term more precisely. “Near-real-time data” or “Low latency data” are more accurate phrasings (though also less catchy) as they imply minimal lag between the time data is collected and when it is used.
[2] This distinction between streamed and batched data is not always accurate or helpful, as streaming data is also delivered in batches and on a defined schedule — only those batches are typically smaller and far more frequent. For this reason, it’s usually more helpful to see streamed and batch data on a continuum, rather than as a difference in kind.