Real-time data processing with data streaming: new tools for a new era

Real-time data streaming is still early in its adoption, but over the next few years organizations with successful rollouts will gain a competitive advantage

Real-time data processing with data streaming: new tools for a new era
Thinkstock

Today, there are many data sources—such at IoT devices, user interaction events from mobile applications, financial service transactions, and health monitoring systems—that broadcast critical information in real time. Developers working with these data sources need to think about the architecture to capture real time streaming data at varying scales and complexities.

It used to be that processing real time information at significant scale was hard to implement. Hardware architectures needed to be engineered for low latency while software needed more advanced programming techniques that combined receiving data, processing it, and shipping it efficiently.

The paradigm shift in data processing

I recently attended the Strata Data Conference and discovered a paradigm shift: There are multiple frameworks (both open source and commercial) that let developers handle data streaming or real time data-processing payloads. There are also commercial tools that simplify the programming, scaling, monitoring, and data management of data streams.

The world isn’t batch anymore, and the tools to process data streams is a lot more accessible today than just two or three years ago.

Today’s the tools, architectures, and approaches all are very different from those used historically for data integration and data warehousing, which grew up during an era of batch processing. You developed scripts or jobs that extracted data mostly from flat files, transformed it into a usable structure, and loaded it into a database or other data-management system. These ETL (extract, transform, load) scripts were deployed directly to servers and scheduled to run with tools like Unix cron, or they were services that ran when new data was available, or they were engineered in an ETL platform from Informatica, Talend, IBM, Microsoft, or other provider.

Today, once data is captured, there is a growing need to process analytical and machine learning functions in real time. Some of this is done for competitive advantage such as banks that need to process news, social media, and financial information and enable their traders to respond to market conditions with real time analytics. It is also used to facilitate real time customer experiences such as consumer retail platforms that recognizes customers when they walk into a store and suggests personalized product offerings as they navigate the merchandise. It can also be a matter of life and death in hospitals, airports, construction zones, and power plants where critical information analyzed in real time can identify anomalies or safety conditions and alert people to action.

Determining technical requirements for data streaming

Before selecting technologies for managing data streams, it’s important to understand the data sources, data-processing requirements, and targeted analytics to help select architecture, platforms, and implementation requirements. Based on my discussions on streaming with several practitioners and solution providers at the Strata Data Conference, here are some factors to consider:

  • The number of data sources, their data formats (JSON, XML, CSV, etc.), their interfaces (API, flat files, source databases), schema complexity, data-quality factors, and the velocity of data are all factors when designing data-stream processors. It’s also good to know whether data sources publish full records or if they only broadcast changed records and modified fields. Developers should review any data dictionaries or other documentation provided by the data source’s publisher to gain a firm understanding on the meaning and business rules around the data.
  • When selecting and configuring data streaming platforms, it’s essential to consider the volume and velocity of data, as well as the duration of data that’s required for the targeted analytics. In addition, it’s important to have defined and realistic requirements around latency, which is the delay from when the source shares new data to the time when the data or analytics is fully processed by the data stream. Higher volumes, velocity, and storage needs, and lower latency requirements will drive platform and architecture choices and be factors in the scale and cost of the underlying infrastructure.
  • Focus on the type of analytics that will be done, the size of data it will access, and the frequency it needs to be updated. Developers should also consider how frequent the analytics will change and whether there are any reprocessing requirements for when new versions of the algorithms are deployed.
  • Developers should consider whether the data stream will be deployed to public clouds, to private clouds, or on edge devices. Many IoT use cases require a subset of the data processing to be performed on the device or locally to a group of devices before sending aggregate data to centralized analytic systems. An example case involves autonomous cars that process the data to make driving decisions and then share traffic or road conditions with a centralized analytics processor.

Data-streaming platforms: Kafka, Spark, and alternatives

These requirements help determine a high-level architecture to support data streaming, and design low volume pilots to validate the approach. The data-streaming architecture often consists of three architectural components:

  • A messaging component that captures and begins processing data from data sources. Apache Kafka was very dominant at the Strata conference and discussed at many sessions. Alternatives include Apache Pulsar and Amazon Kinesis.
  • A distributed, fault-tolerant compute system that can run the analytics. Apache Spark Streaming and the newer APIs for Spark Structured Streaming were the technologies most discussed at the conference, but alternatives include Apache Storm, Kafka Streams, Apache Flink, and Apache Heron.
  • Downstream systems to share or store the results. This can be a big data platforms like Hadoop or Cassandra, SQL relational databases, flat files, data-visualization tools, low-latency storage tools such as Apache BookKeeper, GPU databases such as Kinetica, or distributed databases such as MemSQL.

One critical design factor in considering Kafka, Storm, Flink, and Spark Streaming is whether your application requires native streaming that processes data as it arrives or if you can support some latency and micro-batch the processing. If your processing requirements is basic, using Kafka with Kafka Streams may be sufficient. If you need native processing, Storm and Flint are more mature than Spark Streaming. The combination of Kafka and Spark Stream was the common architecture discussed at the Strata conference, with presenters stating its ease of use, scalability, and versatility.

Data-streaming architecture options

I judge a maturing architecture by the size of the ecosystem. As more teams achieve success with the platform, it becomes stronger, and support from providers increases. The providers not only provide expertise, but their tools also make the technology easier and more accessible to a wider audience of organizations and types of use cases.

You can configure the architecture yourself using Amazon Web Services, Microsoft’s Azure HDInsight, Google Cloud’s Stream Analytics Solution, or IBM Cloud’s Streaming Analytics. With these services, you are more likely to be taking on the work to set up, configure, and maintain the different architecture components.

Larger enterprises can obtain data-streaming capabilities and support from big data platform vendors like Cloudera, MapR, and Hortonworks. In addition, enterprises that are heavily invested in ETL can review data streaming capabilities from vendors such as Informatica Big Data Streaming and Talend Data Streams.

Other vendors are optimizing on alternative architectures. Streamlio, for example, uses a combination of Apache Pulsar for messaging, Apache Heron for stream processing, and Apache BookKeeper for storage, and it claims this is an easier architecture to build and support compared to Apache Spark.

If you’re just getting started with these technologies, you might want to try the free DataBricks Community Edition and StreamAnalytix, which offers a free trial. With these tools, you can start loading data and developing streaming algorithms without having to configure any infrastructure.

Start with a short list of requirements

Whatever approach you select, a best practice is to start by defining the technical requirements and short-listing an approach based on these factors, costs, and other considerations.

With a short list, development teams should implement proof of concepts with lower volumes and velocities of data. A key success factor for these proofs of concepts is to evaluate the ease of development and versatility in delivering the desired analytics. After that, development teams should look to scale up the volume and velocity of the data streams to evaluate performance and stability.

Real-time data streaming is still relatively early in its adoption, but there’s no doubt that over the next few years, organizations with successful rollouts will gain a competitive advantage.

Copyright © 2018 IDG Communications, Inc.