Enterprise applications belong to a vibrant ecosystem and consequently the data they generate is large and varied. Enterprises both benefit and suffer from this nature of applications and data. Whenever a new application is to be deployed in an enterprise which integrates with the other applications in the ecosystem, the pre-condition is a “expansive data definition with referential value” on day 1 to start integration. Traditionally, this approach to data integration involves identifying a target data structure, and force fitting data form all sources into it, to ensure a ‘seamless’ integration, nevermind the loss of data considered irrelevant.
Traditional Data warehouses use the following approach for data ingestion:
Ⅰ. Fixed target structure in which to ingest data
Ⅱ. Source data is transformed to fit into the target structure
Ⅲ. Any “alien” data is just ignored and dropped
Ⅳ. Unstructured data is sparingly allowed
Ⅴ. Reporting and analytics is run on top of this target structure
Ⅵ. Structure is reviewed periodically for change in definitions
So how do we absorb data without defining so that we can discover the structure and value of data on an ongoing basis?
Stream Based Data Integration
In real world, data gets collected or elicited through events (enterprise initiated or customer initiated). The data collection approach from transaction systems to analytical systems can follow a similar approach if event data can have a variable structure. This works better if data flows as streams from multiple sources. This approach is called as event based data modelling using data streams.
It provides the following benefits:
Ⅰ. Unit of integration is a data packet which contains a series of connected name-value pairs.
Ⅱ. Data packets flow into the data ingestion environment into one or more target streams. The packet is considered for processing on a stream if the minimum variables required for a stream are present in the packet.
Ⅲ. Each data packet can provide foundation information for multiple different events.
Ⅳ. The data gets stored as events and data for the same event can be provided incrementally
Ⅴ. The variables in a packet which are not known to the data ingestion environment are not ignored or dropped. They are retained at all events that are identified from the data packet
Ⅵ. Since events are real world concepts they form an excellent foundation for analytical models targeted towards behavioral outcomes.
Ⅶ. Data ingestion can start very quickly with conformance to minimum set of variables required per stream.
Ⅷ. Adjunct variables can be discovered after ingestion and used for analytical value
Ⅸ. Works very well with the real-time paradigm in which the businesses of today compete
In consumer-based enterprise businesses where relationships are long term and are influenced by experience the following streams are essential:
These streams abstract the natural structures of the processes that govern the operations of the business. And to top it all it can be kick started quickly and improved upon on an ongoing basis.
Advantages of Stream Based Integration
Following are some of the technical advantages of stream based integration
★ Real-time integration
o Enterprise environments with ESB can easily connect in real-time to the web service end-point of the related stream
o Applications which can call web-services can also connect at real-time
★ Batch based upload
o Data which is available after EOD processes or is available from external systems as flat files can also integrate on streams
★ Legacy application integration
o Legacy applications which allow connectivity via Message Queues or flat files can also integrate in a stream based environment
★ Cloud Technologies
o Applications like Salesforce hosted on cloud can be connected to via Native API or integration APIs like MuleSoft by using adapters.
Stream based approach to data integration preserves the sanctity of data through its lifecycle.