beyondleft.blogg.se - Lakehouse databricks paper

#LAKEHOUSE DATABRICKS PAPER PLUS#

Before you had to create complex lambda architecture patterns with different tools and approaches – with Delta you can unify this into one, much more simplified architecture. With features such as Time Travel it allows you to query data as it was at a previous state, such as by timestamp or version (similar to SQL temporal tables).ĭelta provides the ability for tables to serve as both a batch AND streaming sink/source. No longer do you have to manage files and folders, worry about ensuring data in the lake has been left in an consistent state etc. The Delta Lake provides ACID (atomicity, consistency, isolation and durability) transactions to the data lake, allowing you to run secure, complete transactions on the data lake in the same way you would on a database. The Delta Lake (the secret sauce referred to previously) is an open source project that allows data warehouse-like functionality directly ON the data lake, and is summarised well in the image below – So what is Delta Lake, and why is it so special? You can see that the main difference in terms of approach is there is no longer a separate compute engine for the data warehouse, instead we serve our outputs from the lake itself using the Delta Lake format. The conceptual architecture for Lakehouse is shown below – Delta Lake – This is the secret sauce of the Lakehouse pattern and looks to resolve the challenges highlighted in the Databricks whitepaper.There is no separate data warehouse and data lake.In a nutshell (and this is a bit of an over simplification but will do for this article) the core theme of the Lakehouse is 2 things –

#LAKEHOUSE DATABRICKS PAPER PLUS#

It has multiple storage layers – different zones and file types in the lake PLUS inside the data warehouse itself PLUS often in the BI tool also.It can lead to challenges around data governance in the lake itself, leading to data swamps and not data lakes.It’s hard/complicated to mix appends, updates and deletes in the data lake.This whitepaper highlighted issues as encountered in the current approach, and highlighted the following main challenges – These perceived challenges led to the emergence of the “Lakehouse” architecture pattern, and officially emerged on the back of an excellent white paper from Databricks who are the founders of the lakehouse pattern in its current guise. There are potential issues with this approach, especially around how we handle data and files in the data lake itself. However whilst this is the most common pattern, it’s not the unicorn approach for everyone.

This pattern is very common, and is still probably the one I see most of in the field. The curated zone is then pushed into a cloud data warehouse such as Synapse Dedicated SQL Pools which then acts as a serving layer for BI tools and analyst.Enriched is where data is cleaned, deduped etc, whereas curated is where we create our summary outputs, including facts and dimensions, all in the data lake. A processing engine will then handle cleaning and transforming the data through zones of the lake, going from raw – > enriched -> curated (others may know this pattern as bronze/silver/gold).Batch data typically arrives as csv files. The data is untyped, untransformed and has had no cleaning activities on it. Ingestion of data into a cloud storage layer, specifically in a “raw” zone of the data lake.In the architecture above, the key themes are as follows – There are nuances around usage and services, but they largely follow this kind of conceptual architecture –

We’re all largely familiar with the common modern data warehouse pattern in the cloud, which essentially delivers a platform comprising a data lake (based on a cloud storage account like Azure Data Lake Storage Gen2) AND a data warehouse compute engine such as Synapse Dedicated Pools or Redshift on AWS.