A Data Engineer cryopreserved in 2010 and woken up by mischief today would no longer understand much about modern stack data.
Remember, in only a few years, the way of collecting, extracting, transporting, storing, preparing, transforming, redistributing, and activating data has completely changed.
We changed the world, and opportunities to generate business through data have never been greater.
What does modern stack data look like?
We can start with a macro diagram.
The most striking development is the centralized place occupied (gradually) by the Data Warehouse Cloud, which has become the pivotal system of the IT infrastructure. From this flow all other notable transformations:
Let’s get to the heart of the matter.
We’ll introduce you to the modern stack data outlines; we chose two angles:
🌱 The changes behind modern stack data
Modern stack data defines the set of tools and databases used to manage the data that feed business applications.
The Stack data’s architecture has undergone profound transformations in recent years, marked by:
The rise of DataWarehouse Cloud (DWH)
Data Warehouse technology is as old as the world, or almost. In any case, it’s not a new word. And yet, we’ve seen a dramatic transformation of the Data Warehousing landscape over the past decade.
Traditional DWH solutions gradually give way to cloud solutions: the Data Warehouse Cloud. We can precisely date this evolution: October 2012, the date of marketing Amazon’s DWH Cloud solution: Redshift. There’s a clear before and an after, even if Redshift is losing ground today.
The main impetus that birthed the modern stack data came from Amazon, with Redshift. All the other solutions on the market that followed owe a debt to the American giant: Google BigQuery, Snowflake, and a few others. This development is linked to the difference between MPP (Massively parallel processing) or OLAP systems like Redshift and OLTP systems like PostgreSQL. But this discussion deserves a whole article that we’ll probably produce one day.
In short, Redshift can process SQL queries and perform joins on huge data volumes 10 to 10,000 times faster than OLTP databases.
But note that Redshift is not the first MPP database. The first ones appeared a decade earlier, but on the other hand, Redshift is:
In recent years, BigQuery and especially Snowflake have risen in power. These two solutions now have the best offers on the market, both in terms of price and computing power. Special mention for Snowflake, which offers a very interesting pricing model since storage billing is independent of computing billing.
But because we have to give back to Caesar his due – Caesar being Redshift here, let’s remember these few figures:
What changes with Cloud Data Warehouse?
The advent of Redshift and other Data Warehouse Cloud solutions that followed have made it possible to improve on several levels:
The transition from ETL solutions to EL(T)
Extract-Transform-Load = ETL while Extract-Load-(Transform) = EL(T).
Listing these acronyms makes the difference quite easy to understand:
What are the underlying issues of such an inversion? It’s quite simple.
Transformations consist of:
That is the challenge.
Transforming before Loading helps to evacuate part of data and therefore, avoid overloading the master database too much.
Indeed, that is why all traditional Data Warehouse solutions worked with heavy ETL solutions. It was vital to sort before loading into the DWH with limited storage capacities.
With Cloud Data Warehouses, storage’s cost has become a commodity, and computing power has increased dramatically.
Result: No need to transform before loading.
The DWH On-Premise – ETL On-Premise combination gradually gives way to the modern DWH Cloud – EL(T) Cloud combination.
Loading data into the Data Warehouse before any transformations avoids asking strategic and business questions when extracting and integrating data into the DWH.
Pipeline management’s cost is considerably reduced; we can afford to load everything into the DWH “without getting carried away” – and thus, we do not deprive ourselves of future use cases for the data.
The trend towards self-service Analytics
We have talked about the Data Warehouse Cloud, which is becoming the backbone of the modern data stack. Upstream, we have the EL(T) tools that connect the multiple data systems and the data warehouse. Data Warehouse Cloud data is then used for BI, data analysis, dashboarding, and reporting.
The advent of DWH Cloud has contributed to “cloudify” not only integration solutions (ETL/ELT) but also BI solutions.
Today, we have dozens of cloud-based BI solutions vendors on the market that are affordable and designed for business users. These easy-to-use solutions offer native connectors with the main Cloud Data Warehouses on the market.
Power BI, Looker, or Tableau are reference Cloud BI solutions:
🔎 Zoom on modern data stack’s bricks
We will now review the main bricks that make up the modern data stack in more detail, starting from the diagram presented in the introduction.
Typical diagram of a modern stack data
We appreciate this modern stack data diagram proposed by a16z.
From left to right, we find:
The Data Lake is a bathtub in which all data is poured in bulk without any transformations or processing, in its raw state. This Data Scientists’ tool serves for very advanced data use cases such as Machine Learning.
The Data Warehouse remains a “warehouse” organizing data in a structured way, even if its capacities to integrate semi-structured data are clearly increasing. These capabilities’ development elsewhere makes DWH increasingly pivotal – unlike the “pure” Data Lake, which increasingly plays a secondary role. We shall return to it.
Let’s now review each of these bricks of the modern stack data.
The Data Warehouse Cloud
The DWH Cloud is the foundation of the modern data stack, the pivotal solution around which all other tools revolve.
It stores structured and semi-structured enterprise data and is not just a database. It’s also a data laboratory, a veritable machine. It is a place of data preparation and transformation via one main tool: SQL, even if Python is increasingly used (but that is another subject).
The Data Warehouse Cloud is sometimes built downstream of a Data Lake which serves, as a catch-all, a tub of data stored in their raw state.
We can well use both Data Lake and Data Warehouse Cloud. You don’t necessarily have to choose between the two technologies. To be honest, they fulfill different roles and can be complementary…even if it’s a safe bet that the two technologies are called to merge.
Also, note that some actors, such as Snowflake, offer integrated Data Warehouse Cloud and Data Lake solutions. It’s a possible article’s title: Are Data Lake and Data Warehouse Cloud destined to merge? Though it’s not the subject of this article, it is a debate that stirs many expert heads!
However, the entire modern data stack is organized around the Cloud Data Warehouse, connected or merged with the Data Lake.
As seen in the first section of the article, EL(T) solutions prevail over traditional ETL tools. This evolution reflects a transformation in the data integration process, a significant evolution in building the data pipeline.
A question you may have asked yourself: Why put “T” into parentheses?
For the simple and good reason, the tool used to build the data pipeline between the source systems and the Cloud Data Warehouse no longer needs to transform the data.
EL(T) Cloud solutions (Fivetran, Stitch Data, etc.) are primarily used to organize piping. This is their main role. It is now Data Warehouse Cloud solutions and third-party tools that support the transformation phases.
A DWH Cloud solution helps transform data tables with a few lines of SQL.
We’ll have to talk about this evolution again: Most Data Preparation and Data Transformation operations can now be performed in the Cloud Data Warehouse itself, using SQL.
Data Engineers (in Java, Python, and other Scala) are increasingly leaving Data transformations to Data Analysts and business teams using SQL. This also raises a real question: What role for the Data Engineer tomorrow? Its role in the organization and maintenance of modern stack data is not assured.
The goal of modern stack data is to empower the end-users of the data. It’s increasingly at the service of the business teams and the marketing stack they handle.
The modern data stack breaks down barriers between data and marketing; it is the sine qua non of an efficient Data-Marketing, the condition for becoming truly “data-driven.”
Preparation / transformation solutions
In a modern data stack, data preparation and transformation take place:
Or the most frequent case, by the DWH reinforced by third-party tools.
Data preparation or transformation is the art of making data usable. This phase consists of answering a simple question: How to transform raw data into a data set that the business can use?
Data transformation is a multifaceted process involving different processing types:
Preparation and transformation tools are also used to maintain Data Quality. Because data “transformation” refers to operations of a different nature, it’s not surprising that the modern stack data hosts several tools belonging to this large multifaceted family.
Data management solutions (Data governance)
Data Warehouse Cloud’s accessibility and usability by a large number of users is, of course, a positive thing. The potential problem is the chaos these expanded accesses can cause in terms of Data Management.
To avoid falling into this trap, the company must absolutely:
Issues around Data Governance are more topical than ever
The first reason we have just mentioned is open access to and editing stack data solutions.
The second reason is the explosion of data volumes which requires the implementation of strict governance rules.
The third reason is the strengthening of the rules governing the use of personal data. The famous GDPR in particular…
Data governance is a sensitive subject that organizations generally deal with inadequately. It’s not the sexiest subject, but it clearly needs to be integrated into the roadmap.
Reverse ETL solutions
Let’s end with a whole new family of data solutions, much sexier and with a promising future: the Reverse ETL. We will soon publish a complete article on the Reverse ETL, its role, its place in the modern stack data. Let’s summarize here in a few words the issues and functionalities offered by these new kinds of solutions.
The challenge is straightforward: Data from various and varied data sources goes up in the Cloud Data Warehouse, but still has a lot of trouble going back down into the management activation tools: CRM, Marketing Automation, ticketing solution, e-commerce, etc.
Reverse ETL is the solution that organizes and facilitates the transfer of DWH data into the tools used by operational teams.
With a Reverse ETL, data from the Cloud Data Warehouse is no longer just used to feed the BI solution; it is also used to benefit the business teams in their daily tools.
This is why we speak of “Reverse ETL”.
Where the ETL (or ELT) moves the data up in the DWH, the Reverse ETL does the opposite. It pushes the data down from the DWH into the tools.
The Reverse ETL is the solution that connects the stack data and the marketing stack in the broad sense. It’s at the interface of the two.
An example? With a Reverse ETL, you can feed web activity data (stored in the DWH) into the CRM software to help sales reps improve their prospect/customer relationship. But it is one use case among many others… The multiple cases are likely to increase in the coming months and years.
Census, HighTouch, and of course Octolis are three examples of Reverse ETL.
Infrastructures, technologies, practices, and even Data marketing professions have evolved at an incredible speed in the broadest sense. We’ve seen the central place this modern data stack gives to the Data Warehouse Cloud. Everything revolves around this point of gravity.
Certain recent developments and this particularly include the fashion for on-the-shelf Customer Data Platforms, somewhat distort the understanding of what’s really going on.
But make no mistake, the arrow of the future is clearly pointing towards Data Warehouses (which no longer have anything to do with their On-Premise ancestors).
Towards the DWH Cloud side…and the whole ecosystem of tools revolving around it: EL(T), BI Cloud solutions…and of course Reverse ETL.