Databricks Ecosystem Deep Dive - Open Source Foundations and Governance
Databricks Ecosystem Deep Dive - Open Source Foundations and Governance
“Databricks looks like an all-in-one platform for data ingestion, storage, analytics, ML, and governance – but under the hood it’s powered by well-known open source projects.”
This post looks at Databricks not just as a “managed Spark service”, but as a unified data & AI platform built on top of open source.
We’ll also cover data governance (especially Unity Catalog) and how Databricks works together with dbt rather than replacing it.
📚 Big Picture: Layered View of the Databricks Platform
It’s easiest to understand Databricks by looking at it layer by layer.
User Layer (User & Tools)
├─ Notebooks, Dashboards, SQL Editor
├─ Databricks SQL / Databricks Workflows
├─ Partner tools: dbt, Power BI, Tableau, Looker, Fivetran, Airflow, etc.
└─ REST / SDK / JDBC / ODBC
Engine & Runtime Layer
├─ Apache Spark (optimized in Databricks Runtime)
├─ Photon (C++ vectorized query engine, commercial)
├─ ML Runtime (Spark + curated ML libraries)
└─ Databricks Model Serving / Vector Search
Storage & Metadata Layer
├─ Delta Lake (open source table format)
├─ Unity Catalog (catalog, permissions, lineage – now open source)
└─ Cloud storage (S3, ADLS, GCS, etc.)
Orchestration & Quality Layer
├─ Databricks Workflows (job / pipeline scheduling)
├─ Delta Live Tables / Lakeflow (declarative pipelines, data quality)
└─ Integration with dbt, Airflow, Dagster and others
ML & MLOps Layer
├─ MLflow (experiments, models, registry – open source)
├─ Feature Store, AutoML
└─ Model Serving, Monitoring
Now let’s see which open source projects power each layer, and what Databricks adds on top.
🔧 The Big Three Open Source Pillars: Spark, Delta Lake, MLflow
1. Apache Spark – The Distributed Compute Engine
- The original core of Databricks and still the main engine
- Handles batch, streaming, SQL, ML, and graph workloads in a single engine
- Databricks optimizes Spark in Databricks Runtime (DBR) with performance patches, stability fixes, and connectors
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DatabricksExample") \
.getOrCreate()
df = spark.read.format("delta").load("/mnt/datalake/sales")
agg = (
df.groupBy("country")
.agg({"amount": "sum"})
.withColumnRenamed("sum(amount)", "total_sales")
)
Open Source Spark vs Databricks Runtime
- Open source Spark: You run the vanilla engine, manage clusters, configs, tuning, and connectors yourself.
- Databricks Runtime adds:
- Auto scaling clusters, auto optimizer tuning
- Photon-powered high‑performance SQL
- Native integration with Unity Catalog for permissions
2. Delta Lake – Lakehouse Table Format
- Open source table format originated by Databricks
- Adds a transaction log and metadata layer on top of Parquet files
- Provides ACID, Time Travel, Schema Evolution, CDC, and Change Data Feed
-- Databricks SQL example
CREATE TABLE sales_delta
USING DELTA
LOCATION '/mnt/datalake/sales_delta' AS
SELECT * FROM raw_sales;
-- Time Travel
SELECT * FROM sales_delta VERSION AS OF 5;
-- CDC / Change Data Feed
SELECT * FROM table_changes('sales_delta', 5, 10);
Open Source Delta Lake vs Databricks
- Open source Delta Lake
- Core ACID, Time Travel, Schema Evolution
- Readable from many engines (Spark, Flink, Trino, Snowflake, etc.)
- On Databricks
- Photon + Delta for very fast SQL
- Deep integration with Delta Live Tables / Lakeflow for data quality rules
- Centralized governance, lineage, and tagging via Unity Catalog
3. MLflow – ML Lifecycle Management
- Covers the full ML lifecycle: experiments, metrics, artifacts, models, registry
- A fully open source project; Databricks runs it as a managed control plane
import mlflow
import mlflow.sklearn
with mlflow.start_run() as run:
model = train_model(train_df)
metrics = evaluate(model, test_df)
mlflow.log_params({"max_depth": 5, "n_estimators": 100})
mlflow.log_metrics(metrics)
mlflow.sklearn.log_model(model, "model")
Open Source MLflow vs Databricks
- Open source: You host the tracking server, back-end DB, and artifact storage yourself.
- On Databricks:
- Experiments, runs, and models are integrated into the workspace
- Model registry is tied into Unity Catalog (permissions, lineage, audit)
- First‑class integrations with Feature Store and Model Serving
🧭 Unity Catalog and Data Governance
What is Unity Catalog?
- Databricks’ unified governance layer for data and AI
- Attaches permissions, lineage, and tags to catalogs, schemas, tables, views, models, and features
- Recently open sourced, evolving into a multi‑engine, multi‑format catalog
-- Unity Catalog namespace example
-- catalog.schema.table
CREATE CATALOG prod;
CREATE SCHEMA prod.sales;
CREATE TABLE prod.sales.orders
USING DELTA
LOCATION 's3://lakehouse/prod/sales/orders';
GRANT SELECT ON TABLE prod.sales.orders TO `analyst_role`;
Core Capabilities of Unity Catalog
- Centralized access control
- Permissions at catalog / schema / table / column level
- One security model across SQL, Python, R, and Scala
- Data lineage
- Automatic tracking of how tables are produced from sources
- Lineage can extend all the way to BI dashboards and ML models
- Data quality and policy integration
- Quality expectations from Delta Live Tables / Lakeflow stored as UC metadata
- Policy changes and audit logs exposed via system tables
-- Example: Query audit logs from Unity Catalog system tables
SELECT *
FROM system.access.audit
WHERE principal = 'analyst_role'
AND object_name = 'prod.sales.orders'
AND action_name = 'SELECT';
🧱 Databricks and dbt: Clear Separation of Responsibilities
What Databricks Is Good At
- Platform layer
- Cluster and workspace management, storage integration, governance (UC)
- Engine layer
- Spark Runtime, Photon, Delta Lake
- Workflow layer
- Databricks Workflows, Delta Live Tables / Lakeflow
What dbt Is Good At
- Modeling and transformation logic (the T in ELT)
- SQL-based model definitions
- Reusable patterns via Jinja and macros
- Testing and documentation
- Tests and descriptions via
schema.yml - Auto-generated docs site
- Tests and descriptions via
- Environment management
- Clean separation of dev / staging / prod
-- dbt model example: models/sales/orders_daily.sql
WITH base AS (
SELECT
order_id,
user_id,
total_amount,
order_date
FROM
WHERE status = 'COMPLETED'
),
daily AS (
SELECT
order_date,
COUNT(*) AS order_count,
SUM(total_amount) AS total_revenue
FROM base
GROUP BY order_date
)
SELECT * FROM daily;
# tests & docs: models/sales/schema.yml
version: 2
models:
- name: orders_daily
description: "Daily order counts and revenue"
columns:
- name: order_date
tests:
- not_null
- name: order_count
tests:
- not_null
- greater_than: 0
- name: total_revenue
tests:
- not_null
Databricks + dbt Integration Pattern
- Data layer
- S3/ADLS/GCS + Delta Lake + Unity Catalog
- Transform layer
- Databricks SQL Warehouse or all‑purpose clusters as dbt targets
- Governance
- dbt‑created tables/views are registered in Unity Catalog
- Permissions, lineage, and tags are managed in Unity Catalog
# dbt profiles.yml example (Databricks)
databricks_lakehouse:
target: prod
outputs:
prod:
type: databricks
catalog: prod # Unity Catalog
schema: analytics
host: adb-123.45.azuredatabricks.net
http_path: /sql/1.0/warehouses/xxxx
token: ""
threads: 8
🧪 Pipelines & Quality: Delta Live Tables / Lakeflow
Databricks is moving toward declarative pipelines via Lakeflow / Delta Live Tables (DLT), where data quality expectations become first‑class metadata on tables.
Declarative Pipelines
import dlt
from pyspark.sql.functions import *
@dlt.table(
name="raw_orders",
comment="Raw order data",
)
def raw_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/raw/orders")
)
@dlt.table(
name="clean_orders",
comment="Validated and cleaned order data",
)
@dlt.expect("valid_amount", "amount >= 0")
@dlt.expect_or_drop("valid_status", "status IN ('CREATED','COMPLETED','CANCELLED')")
def clean_orders():
return (
dlt.read("raw_orders")
.withColumn("order_ts", to_timestamp("order_time"))
)
Integrating Expectations with Unity Catalog
- DLT/Lakeflow expectations are stored as Unity Catalog table metadata
- Quality rules become versioned, auditable, and discoverable
- Governance teams can manage quality as policy, not hidden code
🛡 Databricks Through a Governance Lens
Databricks governance is designed as end‑to‑end governance, not just basic table permissions.
1. Data Governance
- Unity Catalog: fine‑grained permissions (catalog / schema / table / column)
- Lineage: from raw sources through pipelines to reports and ML models
- Policy‑based access control (tag/label‑based masking, etc.)
2. ML Governance
- MLflow integrated with Unity Catalog
- Model registry with versions and stages (prod / staging / dev)
- Feature Store for reusable, governed features
3. Audit & Compliance
- System tables expose access logs, policy changes, and pipeline definition changes
- Designed to support regulated industries (financial services, healthcare, etc.)
📌 How to Think About Databricks
Databricks is…
- A commercial lakehouse platform built on open source
- Apache Spark, Delta Lake, MLflow, and open‑sourced Unity Catalog
- An enterprise‑grade data & AI operations platform
- Central place for permissions, lineage, quality, and audit
- A hub in a larger ecosystem with dbt, Airflow, Fivetran, etc.
- Databricks is the “platform”; dbt is the “modeling & transformation DSL”
Closing Thoughts
- If you see Databricks only as a “Spark notebook service”, you’re missing half the story.
- The real question is which open source components are combined in which way to solve which problems – especially governance, operations, and productivity.
- In real projects, combinations matter:
- Databricks + dbt + Fivetran + Power BI
- Databricks + Airflow/Dagster + MLflow + Feature Store
Going forward, the open sourcing of Unity Catalog and interoperability across table formats like Iceberg/Delta/Hudi will matter even more.
Hopefully this guide helps you see the Databricks ecosystem as a whole, rather than just one tool in isolation.