Bigger Data, Better Architectures

4 min readMay 4, 2018

Are you leaving behind valuable data insights because your data architecture isn’t robust enough to handle your data volume? You may need to upgrade to a mixture of processing models to realize the value hidden in your data. We offer our insights on the technology architectures required to sustain and enable new business use cases.

Data technology has evolved over the last three decades

Initially, data was processed by either a request-response system or by an overnight batch reporting process (e.g. powerful mainframe systems with simple user terminals presenting reports after a request was submitted).

As technology advanced, data processing tasks migrated from the mainframe to commodity-based platforms. Business users started to demand sophisticated intra-day reporting and ad-hoc insights to drive business opportunities. This platform was powered by the relational database and variants, which were primed by schedule-based batch processing system implemented using the ETL (Extract, Transform and Load) methodology. These systems are still used for business intelligence reporting and intra-day querying but lack the ability to provide “ on-event” business intelligence known as Streaming Intelligence.

Streaming platforms require a different storage and processing architecture due to three non-functional characteristics (velocity, volume and variety), further driven by latency requirements. To enable real-time decision making, the underlying data architecture has to scale to answer — “ what is happening now” in addition to “ what had happened earlier “. This has led to more complex data processing requirements.

This rapid evolution of data technologies presents a plethora of options, potentially leaving you overwhelmed by choice and puzzled on the best approach to move forward.

What is the functional mechanism of big data architecture?

The most important thing to remember about data architecture: one size doesn’t fit all! A good data architecture should have three broad, crucial layers:

The Three Cs

A) Capture and ingest data (Ingestion layer)

B) Curate data (Transformation and storage layer)

C) Consume (Presentation layer)

These layers together form the architectural backbone to ensure data from various sources can be ingested, processed and accessed reliably. A sample representation of the architectural components is shown below.

As business requirements evolved, various software architectural patterns emerged to cater to these varied business intelligence processing needs. The following sections describe three major data architectures that address the batch and stream processing platform characteristics.

Lambda architecture

This is an ideal fit to augment traditional batch-based systems with streaming data use cases, as both batch and streaming layers co-exist within the data ecosystem. Batch layer collects, processes and exposes data and analytics at fixed processing cycles; whereas the streaming layer applies the same analytics to data collected between batch layer runs. As an example, the batch layer performs the start of day data image; while streaming layer modifies the image with delta records, throughout the data. A key consideration of this architecture is the need to manage multiple disparate software and hardware components which can become complex to govern overtime.

Kappa architecture

This architectural pattern is more suitable when building a greenfield data architecture that primarily processes streaming data. The streaming layer replaces the batch layer by switching the processing paradigm to “everything is a stream” model i.e. batch processing is treated as a subset of stream processing that enables core business intelligence functionality. This also has additive functionality such as data replay from a single code base.

Some key aspects to be mindful of:

A) Data elements become events; not all data can or should be abstracted as events.

B) Complete re-tooling required to upgrade

C) Business rules on real-time and batch layers for historical data should be identical.

Unified architecture

This evolving architectural pattern brings together batch and streaming data within a unified platform. Standardization of services is at the core of the architecture as depicted in the figure. This limits unnecessary components, while decreasing architectural complexity. The attractiveness of this architecture solution is that complex business use case can be implemented with fewer moving components, i.e. real-time density populations within interested geo-zones, while also serving standard reporting needs.

Key takeaways

Data processing architectures will continue to improve and evolve in the future. These guidelines should help you make good decisions.

A) Scale to future requirements: Architectural decisions should future proof the data landscape with standard common interfaces and modular design.

B) Leverage your current investments: Investments already made in enterprise data platforms need not be thrown away while embracing upcoming technologies; multiple options are available that can help scale incrementally.

C) Keep it simple and pragmatic: Flexibility and utilizing fit-for-purpose data architectural components should be the key guiding principle.

D) Keep pace with evolving technologies and stay ahead: Data technologies will keep evolving; understanding these trends and experimenting with them is important to stay ahead of the curve.

E) Build a strong data architect pool: Data Transformation is a journey and continuous improvement and customisation of design is a crucial factor for success.

Originally published at https://medium.com on May 4, 2018.

Bigger Data, Better Architectures

Written by FractalWorks