Definition and analysis of Modern Stack Data

A Data Engineer cryopreserved in 2010 and woken up by mischief today would no longer understand much about modern stack data.

Remember, in only a few years, the way of collecting, extracting, transporting, storing, preparing, transforming, redistributing, and activating data has completely changed.

We changed the world, and opportunities to generate business through data have never been greater.

What does the modern data stack look like?

We can start with a macro diagram.

stack data moderne

The most striking development is the centralized place occupied (gradually) by the Cloud Data Warehouse, which has become the pivotal system of the IT infrastructure. From this flow all other notable transformations:

  • The exponential increase in computing power and storage costs’ collapse.
  • The replacement of traditional ETL tools by EL(T) Cloud solutions.
  • “Self-service” Cloud BI solutions development.
  • The Reverse ETLs’ recent emergence that allows data from the Cloud Data Warehouse to be sent down into business tools, as to finally put the stack data at the service of the marketing stack.
stack data moderne dnowpl
Source: Snowplow Analytics.

Let’s get to the heart of the matter.

We’ll introduce you to the modern data stack outlines; we chose two angles:

  • The historical angle: What led to the emergence of modern data stack?
  • The geographic/topographic angle: We’ll review the different bricks that make up this modern data stack.

🌱 The changes behind the modern data stack

Modern data stack defines the set of tools and databases used to manage the data that feed business applications.

The Stack data’s architecture has undergone profound transformations in recent years, marked by:

  • The rise of Cloud Data Warehouses (DWH) which are gradually becoming the main source of data. The DWH is destined to become the pivot of the data stack, and we’ll have the opportunity to talk about it at length in our blog posts. If you still believe in the on-the-shelf Customer Data Platform, abandon all hope.
  • Switching from ETL (Extract-Transform-Load) to EL(T): Extract – Load – (Transform). “ETL” is a concept, process, as well as a term tool (ETL software). In a modern data stack, data is loaded into the master database before being transformed via EL(T) solutions in the cloud, which are lighter than traditional ETL tools.
  • The growing use of self-service analytics solutions (like Tableau) to do BI, generate reports, and other data visualizations.

The rise of the Cloud DataWarehouses (DWH)

Data Warehouse technology is as old as the world, or almost. In any case, it’s not a new word. And yet, we’ve seen a dramatic transformation of the Data Warehousing landscape over the past decade.

Traditional DWH solutions gradually give way to cloud solutions: the Cloud Data Warehouse. We can precisely date this evolution: October 2012, the date of marketing Amazon’s Cloud DWH solution: Redshift. There’s a clear before and an after, even if Redshift is losing ground today.

The main impetus that birthed the modern data stack came from Amazon, with Redshift. All the other solutions on the market that followed owe a debt to the American giant: Google BigQuery, Snowflake, and a few others. This development is linked to the difference between MPP (Massively parallel processing) or OLAP systems like Redshift and OLTP systems like PostgreSQL. But this discussion deserves a whole article that we’ll probably produce one day.

In short, Redshift can process SQL queries and perform joins on huge data volumes 10 to 10,000 times faster than OLTP databases.

But note that Redshift is not the first MPP database. The first ones appeared a decade earlier, but on the other hand, Redshift is:

  • The first cloud-based MPP database solution.
  • The first BDD MPP solution that’s financially accessible to all companies. A small business with limited needs can use Redshift for a few hundred euros per month while you have to count close to 100k€ of annual license with the classic On-Premise solutions.

In recent years, BigQuery and especially Snowflake have risen in power. These two solutions now have the best offers on the market, both in terms of price and computing power. Special mention for Snowflake, which offers a very interesting pricing model since storage billing is independent of computing billing.

But because we have to give back to Caesar his due – Caesar being Redshift here, let’s remember these few figures:

  • RedShift was launched in 2012.
  • BigQuery, Google’s Cloud DWH solution, only integrated the SQL standard in 2016.
  • Snowflake only became mature in 2017-2018.

What changes with Cloud Data Warehouse?

The advent of Redshift and other Cloud Data Warehouse solutions that followed have made it possible to improve on several levels:

  • Speed. This is what we have just seen. A Cloud DWH can significantly reduce the processing time of SQL queries. The slowness of calculations was the main obstacle to the massive exploitation of data. Redshift broke many barriers.
  • Connectivity. The Cloud makes it much easier to connect data sources to the Data Warehouse. More generally, a Cloud DWH manages far more formats & data sources than a traditional data warehouse installed on company servers (On-Premise).
  • User access. In a classic, “heavy” Data Warehouse installed on the company’s servers, the number of users is deliberately limited to reduce the number of requests and save server resources. This classic technological option, therefore, has repercussions at the organizational level:
    • DWH On-Premise: Managed by a central team. Restricted/indirect access for end-users.
    • Cloud DWH: Accessible and usable by all target users. Virtual servers allow the launching of simultaneous SQL queries on the same database.
  • Flexibility & Scalability. Cloud Data Warehouse solutions are much more affordable than traditional On-Premise solutions (such as Informatica or Oracle). They are also and above all far more flexible, with pricing models based on the volume of data stored and/or the computing resources consumed. In this sense, the advent of Cloud Data Warehouses has made it possible to democratize access to this type of solution. While classic DWHs were cumbersome solutions accessible only to large companies, Cloud DWHs are lightweight, flexible solutions accessible to a very small business/startup.

The transition from ETL solutions to EL(T)

Extract-Transform-Load = ETL while Extract-Load-(Transform) = EL(T).

Listing these acronyms makes the difference quite easy to understand:

  • When using an ETL process (and the ETL tools that allow this process to operate), the data is transformed before loading into the target database: the Data Warehouse.
  • When you use an EL(T) process, you start by loading all the structured or semi-structured data into the master database (DWH) before considering the transformations.

What are the underlying issues of such an inversion? It’s quite simple.

Transformations consist of:

  • adapting the format of data to the target database,
  • cleaning,
  • deduplicating,
  • carrying out some treatments on data from the sources to adapt them to the design of the Data Warehouse and avoid cluttering it too much.

That is the challenge.

Transforming before Loading helps to evacuate part of the data and therefore, avoids overloading the master database too much.

Indeed, that is why all traditional Data Warehouse solutions worked with heavy ETL solutions. It was vital to sort before loading into the DWH with limited storage capacities.

With Cloud Data Warehouses, storage cost has become a commodity, and computing power has increased dramatically.

Result: No need to transform before loading.

The DWH On-Premise – ETL On-Premise combination gradually gives way to the modern Cloud DWH – EL(T) Cloud combination.

Loading data into the Data Warehouse before any transformations avoids asking strategic and business questions when extracting and integrating data into the DWH.

Pipeline management’s cost is considerably reduced; we can afford to load everything into the DWH “without getting carried away” – and thus, we do not deprive ourselves of future use cases for the data.

The trend toward self-service Analytics

We have talked about the Cloud Data Warehouse, which is becoming the backbone of the modern data stack. Upstream, we have the EL(T) tools that connect the multiple data systems and the data warehouse. Data Warehouse Cloud data is then used for BI, data analysis, dashboarding, and reporting.

The advent of Cloud DWH has contributed to “cloudify” not only integration solutions (ETL/ELT) but also BI solutions.

Today, we have dozens of cloud-based BI solutions vendors on the market that are affordable and designed for business users. These easy-to-use solutions offer native connectors with the main Cloud Data Warehouses on the market.

Power BI, Looker, or Tableau are reference Cloud BI solutions:

Source: Medium.
  • A solution like Tableau allows you to connect all data sources in a few clicks and create tailor-made reports based on simplified data models.
  • A BI solution allows overall performance management based on omnichannel attribution models, unlike reporting modules offered by business applications or web analytics solutions (Google Analytics, etc.).
  • A tool like Looker, connected to the Data Warehouse, disrupts data analysis. BI is one of the main use cases of a Data Warehouse. With Cloud DWH’s advent, the development of BI SaaS solutions was inevitable. And it happened.
  • Cloud Data Warehouse, EL(T), “self-service” analytics solutions: closely linked, these three families of tools are the cornerstones of a modern data stack.

🔎 Zoom on modern data stack’s bricks

We will now review the main bricks that make up the modern data stack in more detail, starting from the diagram presented in the introduction.

Typical diagram of a modern data stack

We appreciate this modern data stack diagram proposed by a16z.

stack data moderne mapping
Source: a16z.

From left to right, we find:

  • The data source – all systems, bases, and tools providing data that can be internal or external (enrichment solutions, etc.). Linked to the development of digital, we see the explosion not only of data volumes but also of data sources – and therefore of formats and data structures. This effervescence is both an enormous potential and a great challenge.
  • Data ingestion and/or transformation solutions. Here we find all technologies for carrying out the process Extract – Load and possibly Transform: EL(T), which means the solutions that allow the routing (with or without transformations) of data coming from the sources in the master database(s).
  • The master database(s) for storing data. There are two families of solutions here: Cloud Data Warehouses and Data Lakes. The DWH stores structured or semi-structured data, while the Data Lake can store any data type.

The Data Lake is a bathtub in which all data is poured in bulk without any transformations or processing, in its raw state. This Data Scientists’ tool serves for very advanced data use cases such as Machine Learning.

The Data Warehouse remains a “warehouse” organizing data in a structured way, even if its capacities to integrate semi-structured data are clearly increasing. These capabilities’ development elsewhere makes DWH increasingly pivotal – unlike the “pure” Data Lake, which increasingly plays a secondary role. We shall return to it.

  • Data preparation and processing tools. We have seen the Cloud Data Warehouse tends to become the reference tool for transforming data via SQL. Many solutions can support the DWH in this data transformation process for BI or business uses. Preparation and transformation tools design the widest and most heterogeneous family of data solutions.
  • BI Tools and Enabling Tools, which are Cloud Data Warehouse data destination tools. The DWH, basically used for BI, is increasingly used to feed business applications in near real-time. This is where Reverse ETLs like Octolis come in. We will introduce you to Reverse ETLs’ role in modern data stack in a few moments.

Let’s now review each of these bricks of the modern data stack.

The Cloud Data Warehouse

The Cloud DWH is the foundation of the modern data stack, the pivotal solution around which all other tools revolve.

It stores structured and semi-structured enterprise data and is not just a database. It’s also a data laboratory, a veritable machine. It is a place of data preparation and transformation via one main tool: SQL, even if Python is increasingly used (but that is another subject).

Légende : Medium. Mai 2020. Redshift plafonne, BigQuery monte, Snowflake explose.

The Cloud Data Warehouse is sometimes built downstream of a Data Lake which serves, as a catch-all, a tub of data stored in its raw state.

We can well use both Data Lake and Cloud Data Warehouse. You don’t necessarily have to choose between the two technologies. To be honest, they fulfill different roles and can be complementary…even if it’s a safe bet that the two technologies are called to merge.

Also, note that some actors, such as Snowflake, offer integrated Cloud Data Warehouse and Data Lake solutions. It’s a possible article’s title: Are Data Lake and Cloud Data Warehouse destined to merge? Though it’s not the subject of this article, it is a debate that stirs many expert heads!

However, the entire modern data stack is organized around the Cloud Data Warehouse, connected or merged with the Data Lake.

EL(T) solutions

As seen in the first section of the article, EL(T) solutions prevail over traditional ETL tools. This evolution reflects a transformation in the data integration process, a significant evolution in building the data pipeline.

Source: AWS. ETL Vs ELT.

A question you may have asked yourself: Why put “T” into parentheses?

For a simple and good reason, the tool used to build the data pipeline between the source systems and the Cloud Data Warehouse no longer needs to transform the data.

EL(T) Cloud solutions (Portable, Fivetran, Stitch Data, etc.) are primarily used to organize piping. This is their main role. It is now Cloud Data Warehouse solutions and third-party tools that support the transformation phases.

A Cloud DWH solution helps transform data tables with a few lines of SQL.

We’ll have to talk about this evolution again: Most Data Preparation and Data Transformation operations can now be performed in the Cloud Data Warehouse itself, using SQL.

Data Engineers (in Java, Python, and other Scala) are increasingly leaving Data transformations to Data Analysts and business teams using SQL. This also raises a real question: What role for the Data Engineer tomorrow? Its role in the organization and maintenance of modern data stack is not assured.

The goal of a modern data stack is to empower the end-users of the data. It’s increasingly at the service of the business teams and the marketing stack they handle.

The modern data stack breaks down barriers between data and marketing; it is the sine qua non of efficient Data-Marketing, the condition for becoming truly “data-driven.”

Preparation/transformation solutions

In a modern data stack, data preparation, and transformation take place:

  • Either in the Cloud Data Warehouse itself, as we have seen.
  • Or downstream of the Cloud Data Warehouse, via ETL tools.

Or the most frequent case, by the DWH, reinforced by third-party tools.

Data preparation or transformation is the art of making data usable. This phase consists of answering a simple question: How to transform raw data into a data set that the business can use?

An example of a raw data preparation solution: Dataform.

Data transformation is a multifaceted process involving different processing types:

  • data cleaning,
  • deduplication,
  • setting up tailor-made data update rules,
  • data enrichment,
  • creation of aggregates,
  • dynamic segments, etc.

Preparation and transformation tools are also used to maintain Data Quality. Because data “transformation” refers to operations of a different nature, it’s not surprising that the modern data stack hosts several tools belonging to this large multifaceted family.

Data management solutions (Data governance)

Cloud Data Warehouse’s accessibility and usability by a large number of users is, of course, a positive thing. The potential problem is the chaos these expanded accesses can cause in terms of Data Management.

To avoid falling into this trap, the company must absolutely:

  • Integrate one or more Data Management tools into the data stack
  • Document and implement data governance rules.

Issues around Data Governance are more topical than ever

The first reason we have just mentioned is open access to and editing data stack solutions.

The second reason is the explosion of data volumes which requires the implementation of strict governance rules.

The third reason is the strengthening of the rules governing the use of personal data. The famous GDPR in particular…
Data governance is a sensitive subject that organizations generally deal with inadequately. It’s not the sexiest subject, but it clearly needs to be integrated into the roadmap.

Reverse ETL solutions

Let’s end with a whole new family of data solutions, much sexier and with a promising future: the Reverse ETL. We will soon publish a complete article on the Reverse ETL, its role, and its place in the modern data stack. Let’s summarize here in a few words the issues and functionalities offered by these new kinds of solutions.

The challenge is straightforward: Data from various and varied data sources goes up in the Cloud Data Warehouse but still has a lot of trouble going back down into the management activation tools: CRM, Marketing Automation, ticketing solutions, e-commerce, etc.

Reverse ETL is the solution that organizes and facilitates the transfer of DWH data into the tools used by operational teams.
With a Reverse ETL, data from the Cloud Data Warehouse is no longer just used to feed the BI solution; it is also used to benefit the business teams in their daily tools.

This is why we speak of “Reverse ETL”.

Where the ETL (or ELT) moves the data up in the DWH, the Reverse ETL does the opposite. It pushes the data down from the DWH into the tools.

The Reverse ETL is the solution that connects the data stack and the marketing stack in a broad sense. It’s at the interface of the two.

An example? With a Reverse ETL, you can feed web activity data (stored in the DWH) into the CRM software to help sales reps improve their prospect/customer relationship. But it is one use case among many others… The multiple cases are likely to increase in the coming months and years.

Census, HighTouch, and of course Octolis are three examples of Reverse ETL.

🏁 Conclusion

Infrastructures, technologies, practices, and even Data marketing professions have evolved at an incredible speed in the broadest sense. We’ve seen the central place this modern data stack gives to the Cloud Data Warehouse. Everything revolves around this point of gravity.

Certain recent developments and particularly including the fashion for on-the-shelf Customer Data Platforms, somewhat distort the understanding of what’s really going on.

But make no mistake, the arrow of the future is clearly pointing toward Data Warehouses (which no longer have anything to do with their On-Premise ancestors).

Towards the Cloud DWH side…and the whole ecosystem of tools revolving around it: EL(T), BI Cloud solutions…and of course Reverse ETL.