Directly Query Databricks’ Delta Lake From Azure Synapse
Azure Databricks and Azure Synapse Analytics are two flagship big data solutions in Azure. Many customers use both solutions. Databricks is commonly used as a scalable engine for complex data transformation & machine learning tasks on Spark and Delta Lake technologies, while Synapse is loved by users who are familiar with SQL & native Microsoft technologies with great support for high concurrency & low latency queries. When used together, output from Databricks pipeline is sent to Synapse for downstream analytics use cases. Most users would store Databricks’ output data in the high performance Delta Lake format but so far, Synapse has not yet officially supported Databricks’ Delta Lake format. Customers with both technologies would have to run an extra step in Databricks to export data to a format that Synapse can understand. In this post, I’ll demonstrate a simple method to allow your Synapse to directly query data from Delta lake using either Synapse SQL on-demand or Synapse Spark pool. You can adapt this approach to enable other technologies outside Databricks such as Azure ML to do the same.
First, let’s quickly review what Delta Lake is. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. To simply put, Delta Lake is built on top of Apache Parquet format together with a change log mechanism.
Although Synapse can read raw Parquet files in a Delta Lake, the data may contain historical updates so simply taking all parquet files will lead to wrong result. To read it correctly, Synapse or any technology outside Databricks would need to implement Delta Lake specification which is not a small effort.
Fortunately, Databricks engine has a useful feature that simplifies the integration. You can enable Azure Databricks to output the current list of Parquet files that you should read from Delta Lake to have the correct result. Then once you have this, you just need to configure your query engine to correctly parse the data. Here is the end to end process with examples:
Step 1: Configuring Azure Databricks to automatically output current list of Parquet files (Manifest file)
- Enable the feature in Azure Databricks
%sql SET spark.databricks.delta.symlinkFormatManifest.fileSystemCheck.enabled = false
- Manually generate or configure Databricks to automatically update the Manifest file
%sql
GENERATE symlink_format_manifest FOR TABLE delta.`/path_to_your_delta_table`
--You can generate the file every time like the above or you can enable Databricks to automatically update the file for youALTER TABLE delta.`/path_to_your_delta_table` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
- Note: in Azure Databricks, I only tested with ABFS driver (driver for ADLS Gen 2) because ADLS Gen 2 is what I used for Synapse. You need to do the same so that file paths in Manifest file are in the format of abfss://filesystem@storageaccount.dfs.core.windows.net/path_to_delta which is the same format Synapse uses for ADSL Gen 2.
Step 2 — Option 1: Reading Delta table with Synapse Spark
- Create a table to point to Delta table’ parquet files
(columns here are from my example, feel free to modify )
CREATE TABLE `delta_raw_tbl` (`Feb` BIGINT, `Jan` BIGINT, `Mar` BIGINT, `account` STRING) USING Parquet OPTIONS (path ‘abfss://file_system@adlsgen2account.dfs.core.windows.net/path_to_delta_table’ )
- Create a table to point to Manifest file
CREATE TABLE `delta_manifest_table` (`file_name` STRING) USING Csv OPTIONS (path 'abfss://file_system@adlsgen2account.dfs.core.windows.net/path_to_delta_table/_symlink_format_manifest/manifest'
- Create the final table that represent the correct files that you need to read. The key here is the Spark’s built-in function input_file_name() that gives you the full file name for every row in your table.
%%sql
create table default.net_view_test as
(SELECT *, input_file_name()FROM default.testTblWHERE input_file_name() IN (SELECT file_name FROM default.delta_manifest_table2))
Then when select from the final table, you’ll see the same result as you see in Databricks
Note: You need to Refresh the table that points to Delta location so that if there’s underlying change, the table definition is updated
Step 2 — Option 2: Reading Delta table with SQL on-demand
With SQL on-demand, you need more work to get this done although the approach is similar with Spark Pool. The challenge is that in SQL on-demand, data from ADLS Gen 2 is in the form of https://adlsgen2account.dfs.core.windows.net/file_system/path_to_folder as opposed to Spark’s format abfss://file_system@adlsgen2account.dfs.core.windows.net/path_to_folder. So a little more work needs to be done to match a file in the Spark’s format with the same file Synapse’s format.
Similar to Spark, Synapse’s SQL on-demand also has a built-in virtual function called filepath(). I don’t see that in SQL Pool yet.
Here is a query to show the same result from Databricks’ Delta table. It’s a little bit complex because of the transformation mentioned above.
select * from (select *, SUBSTRING( rows.filepath() , LEN(rows.filepath()) - CHARINDEX('/',REVERSE(rows.filepath())) + 2 , LEN(rows.filepath()) ) filename
from openrowset(
bulk 'https://adlsgen2account.dfs.core.windows.net/path_to_delta_table/*.parquet',
format = 'parquet') as rows) as deltatbl where filename in
(select SUBSTRING( C1 , LEN(C1) - CHARINDEX('/',REVERSE(C1)) + 2 , LEN(C1) )
from openrowset(
bulk 'https://adlsgen2account.dfs.core.windows.net/path_to_delta_table/_symlink_format_manifest/manifest',
format = 'csv',
parser_version = '2.0',
firstrow = 1) as filelist)
For partitioned Delta table
In case the Delta table is partitioned, for example by account column then the version of the query is like this
select Feb, Jan, Mar, Account from (select *,rows.filepath(1) Account , SUBSTRING( rows.filepath() , LEN(rows.filepath()) - CHARINDEX('/',REVERSE(rows.filepath())) + 2 , LEN(rows.filepath()) ) filename
from openrowset(bulk 'https://adlsgen2account.dfs.core.windows.net/path_to_delta_table/account=*/*.parquet',format = 'parquet') as rows) as deltatbl where filename in(select SUBSTRING( C1 , LEN(C1) - CHARINDEX('/',REVERSE(C1)) + 2 , LEN(C1) )
from openrowset(bulk 'https://adlsgen2account.dfs.core.windows.net/path_to_delta_table/_symlink_format_manifest/*/manifest',format = 'csv',parser_version = '2.0',firstrow = 1) as filelist)
A manifest file is partitioned in the same Hive-partitioning-style directory structure as the original Delta table. This means that each partition is updated atomically, and Synapse will see a consistent view of each partition but not a consistent view across partitions. Furthermore, since all manifests of all partitions cannot be updated together, concurrent attempts to generate manifests can lead to different partitions having manifests of different versions.
So that’s it. You can now query Databricks’ Delta table directly from Synapse Spark/SQL on-demand without any extra copy & transformation step!
Update on May-26–2021: Azure Synapse team has released preview support to query data from serverless SQL pool at Query Delta Lake format using serverless SQL pool (preview) — Azure Synapse Analytics | Microsoft Docs!
Credit: This post is based on original approach and reference materials from Phil Coachman, my colleague at Microsoft
Disclaimer: This is personal work and is not an official capability from either Synapse or Databricks.