Back to the future of the Datawarehouse Episode 1/3

Antoine Castex
Beauty Tomorrow
Published in
4 min readMay 31, 2021

--

Traveling through time in a modified DeLorean car I have been accidentally sent back to a few years ago..

When Datawarehouse was one of the top famous discussion around the table !(before hashtag, like #data)

I was already part of deploying a dw in another company, but now things have changed !Wait, what ?

I come from an ETL world where the most important part was the “T”

ETL describes the end-to-end process by which a company takes its full breadth of data—structured and unstructured and managed by any number of teams from anywhere in the world—and gets it to a state where it’s actually useful for business purposes.

Now things have changed and we are more seeing ELT, because the data warehouse can also transform the data himself by using SQL, the usage of specific tool to transform seems to be useless in most of the cases.

ELT offers a modern alternative to ETL in which analysts load data into the warehouse before transforming it. This supports a more flexible and agile way of working. Since the transformation happens in the warehouse, it can be defined using SQL. This allows analysts to contribute to (or entirely own) the transformation logic.

So why is it beneficial to transform your data after loading it into the data warehouse, rather than before?

  • Agility: all the data is stored in the warehouse and readily available to use. You don’t have to think about how to structure the data before you load it into the warehouse. The data modelling to transform the raw data can be set up as and when it is needed.
  • Simplicity: Transformations in the data warehouse are generally written in SQL, a language that the entire data team (data engineers, data scientists, data analyst) understands. This enables the entire team to contribute to the transformation logic.
  • Self service analytics: If all of your raw data is within the warehouse, you can use BI tools to drill down from aggregated summary statistics to the raw data underlying them.
  • Fixing bugs: If you find errors in your transformation pipeline, you can fix the bug and re-run just the transformations to fix your data. With an ETL approach, the entire extract-load-transform process may need to be re-run.

Fundamentally the main advantage of using an ELT approach is that it makes life easier for the data team. All of the data is easily accessible in one place, and mistakes in your transformation logic can be fixed easily because you can always get access to the raw underlying data when needed.

The technology that was used before to transform before the loading was Dataflow on Google Cloud Platform, the best #cloud service provider, now the transformation is done inside BigQuery himself, the famous big data product of Google Cloud.

Because the product matures,

Because it has a lot of functionality and features for that,

And of course, because it’s very powerful, efficient and cheap !

To improve data users’ experience at L’Oreal involves addressing the following technological challenges:

  • Security & Access Management: Configured for sensitive data with role-based access control (RBAC) for data access.
  • Scalability on Demand: Timely access to capacity is very important to satisfy workload constraints. Hence, the analytics data infrastructure needs to scale up and down as demand grows and shrinks.
  • Standardisation: Analyst and data scientist productivity would be improved if data types and access patterns were provided with one and only format/language , here it’s SQL of course!
  • Access from BI tools: Since business intelligence is crucial to communicate insights, the analytics infrastructure should plug and play well with the tools that we use : Power BI or Looker.
  • Showback: Data users and manager need to have a clear view of their resource consumption.
  • Sustainable solution: The Data platform has to leverage as much as possible the best technologies on the market which are the most efficient, meaning consume only when you really need !
  • Redundancy: The infrastructure should be fully resilient in any case of failure coming from the cloud service provider.

See you in the next episode #2 Secure my Data to discuss about the challenges !

--

--