Databricks Ecosystem Deep Dive - Open Source Foundations and Governance

“Databricks looks like an all-in-one platform for data ingestion, storage, analytics, ML, and governance – but under the hood it’s powered by well-known open source projects.”

This post looks at Databricks not just as a “managed Spark service”, but as a unified data & AI platform built on top of open source.
We’ll also cover data governance (especially Unity Catalog) and how Databricks works together with dbt rather than replacing it.

📚 Big Picture: Layered View of the Databricks Platform

It’s easiest to understand Databricks by looking at it layer by layer.

User Layer (User & Tools)
 ├─ Notebooks, Dashboards, SQL Editor
 ├─ Databricks SQL / Databricks Workflows
 ├─ Partner tools: dbt, Power BI, Tableau, Looker, Fivetran, Airflow, etc.
 └─ REST / SDK / JDBC / ODBC

Engine & Runtime Layer
 ├─ Apache Spark (optimized in Databricks Runtime)
 ├─ Photon (C++ vectorized query engine, commercial)
 ├─ ML Runtime (Spark + curated ML libraries)
 └─ Databricks Model Serving / Vector Search

Storage & Metadata Layer
 ├─ Delta Lake (open source table format)
 ├─ Unity Catalog (catalog, permissions, lineage – now open source)
 └─ Cloud storage (S3, ADLS, GCS, etc.)

Orchestration & Quality Layer
 ├─ Databricks Workflows (job / pipeline scheduling)
 ├─ Delta Live Tables / Lakeflow (declarative pipelines, data quality)
 └─ Integration with dbt, Airflow, Dagster and others

ML & MLOps Layer
 ├─ MLflow (experiments, models, registry – open source)
 ├─ Feature Store, AutoML
 └─ Model Serving, Monitoring

Now let’s see which open source projects power each layer, and what Databricks adds on top.

🔧 The Big Three Open Source Pillars: Spark, Delta Lake, MLflow

1. Apache Spark – The Distributed Compute Engine

The original core of Databricks and still the main engine
Handles batch, streaming, SQL, ML, and graph workloads in a single engine
Databricks optimizes Spark in Databricks Runtime (DBR) with performance patches, stability fixes, and connectors

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DatabricksExample") \
    .getOrCreate()

df = spark.read.format("delta").load("/mnt/datalake/sales")

agg = (
    df.groupBy("country")
      .agg({"amount": "sum"})
      .withColumnRenamed("sum(amount)", "total_sales")
)

Open Source Spark vs Databricks Runtime

Open source Spark: You run the vanilla engine, manage clusters, configs, tuning, and connectors yourself.
Databricks Runtime adds:
- Auto scaling clusters, auto optimizer tuning
- Photon-powered high‑performance SQL
- Native integration with Unity Catalog for permissions

2. Delta Lake – Lakehouse Table Format

Open source table format originated by Databricks
Adds a transaction log and metadata layer on top of Parquet files
Provides ACID, Time Travel, Schema Evolution, CDC, and Change Data Feed

-- Databricks SQL example
CREATE TABLE sales_delta
USING DELTA
LOCATION '/mnt/datalake/sales_delta' AS
SELECT * FROM raw_sales;

-- Time Travel
SELECT * FROM sales_delta VERSION AS OF 5;

-- CDC / Change Data Feed
SELECT * FROM table_changes('sales_delta', 5, 10);

Open Source Delta Lake vs Databricks

Open source Delta Lake
- Core ACID, Time Travel, Schema Evolution
- Readable from many engines (Spark, Flink, Trino, Snowflake, etc.)
On Databricks
- Photon + Delta for very fast SQL
- Deep integration with Delta Live Tables / Lakeflow for data quality rules
- Centralized governance, lineage, and tagging via Unity Catalog

3. MLflow – ML Lifecycle Management

Covers the full ML lifecycle: experiments, metrics, artifacts, models, registry
A fully open source project; Databricks runs it as a managed control plane

import mlflow
import mlflow.sklearn

with mlflow.start_run() as run:
    model = train_model(train_df)
    metrics = evaluate(model, test_df)

    mlflow.log_params({"max_depth": 5, "n_estimators": 100})
    mlflow.log_metrics(metrics)

    mlflow.sklearn.log_model(model, "model")

Open Source MLflow vs Databricks

Open source: You host the tracking server, back-end DB, and artifact storage yourself.
On Databricks:
- Experiments, runs, and models are integrated into the workspace
- Model registry is tied into Unity Catalog (permissions, lineage, audit)
- First‑class integrations with Feature Store and Model Serving

🧭 Unity Catalog and Data Governance

What is Unity Catalog?

Databricks’ unified governance layer for data and AI
Attaches permissions, lineage, and tags to catalogs, schemas, tables, views, models, and features
Recently open sourced, evolving into a multi‑engine, multi‑format catalog

-- Unity Catalog namespace example
-- catalog.schema.table

CREATE CATALOG prod;
CREATE SCHEMA prod.sales;

CREATE TABLE prod.sales.orders
USING DELTA
LOCATION 's3://lakehouse/prod/sales/orders';

GRANT SELECT ON TABLE prod.sales.orders TO `analyst_role`;

Core Capabilities of Unity Catalog

Centralized access control
- Permissions at catalog / schema / table / column level
- One security model across SQL, Python, R, and Scala
Data lineage
- Automatic tracking of how tables are produced from sources
- Lineage can extend all the way to BI dashboards and ML models
Data quality and policy integration
- Quality expectations from Delta Live Tables / Lakeflow stored as UC metadata
- Policy changes and audit logs exposed via system tables

-- Example: Query audit logs from Unity Catalog system tables
SELECT *
FROM system.access.audit
WHERE principal = 'analyst_role'
  AND object_name = 'prod.sales.orders'
  AND action_name = 'SELECT';

🧱 Databricks and dbt: Clear Separation of Responsibilities

What Databricks Is Good At

Platform layer
- Cluster and workspace management, storage integration, governance (UC)
Engine layer
- Spark Runtime, Photon, Delta Lake
Workflow layer
- Databricks Workflows, Delta Live Tables / Lakeflow

What dbt Is Good At

Modeling and transformation logic (the T in ELT)
- SQL-based model definitions
- Reusable patterns via Jinja and macros
Testing and documentation
- Tests and descriptions via schema.yml
- Auto-generated docs site
Environment management
- Clean separation of dev / staging / prod

-- dbt model example: models/sales/orders_daily.sql



WITH base AS (
    SELECT
        order_id,
        user_id,
        total_amount,
        order_date
    FROM 
    WHERE status = 'COMPLETED'
),
daily AS (
    SELECT
        order_date,
        COUNT(*)         AS order_count,
        SUM(total_amount) AS total_revenue
    FROM base
    GROUP BY order_date
)

SELECT * FROM daily;

# tests & docs: models/sales/schema.yml

version: 2

models:
  - name: orders_daily
    description: "Daily order counts and revenue"
    columns:
      - name: order_date
        tests:
          - not_null
      - name: order_count
        tests:
          - not_null
          - greater_than: 0
      - name: total_revenue
        tests:
          - not_null

Databricks + dbt Integration Pattern

Data layer
- S3/ADLS/GCS + Delta Lake + Unity Catalog
Transform layer
- Databricks SQL Warehouse or all‑purpose clusters as dbt targets
Governance
- dbt‑created tables/views are registered in Unity Catalog
- Permissions, lineage, and tags are managed in Unity Catalog

# dbt profiles.yml example (Databricks)

databricks_lakehouse:
  target: prod
  outputs:
    prod:
      type: databricks
      catalog: prod            # Unity Catalog
      schema: analytics
      host: adb-123.45.azuredatabricks.net
      http_path: /sql/1.0/warehouses/xxxx
      token: ""
      threads: 8

🧪 Pipelines & Quality: Delta Live Tables / Lakeflow

Databricks is moving toward declarative pipelines via Lakeflow / Delta Live Tables (DLT), where data quality expectations become first‑class metadata on tables.

Declarative Pipelines

import dlt
from pyspark.sql.functions import *

@dlt.table(
    name="raw_orders",
    comment="Raw order data",
)
def raw_orders():
    return (
        spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "json")
             .load("/mnt/raw/orders")
    )

@dlt.table(
    name="clean_orders",
    comment="Validated and cleaned order data",
)
@dlt.expect("valid_amount", "amount >= 0")
@dlt.expect_or_drop("valid_status", "status IN ('CREATED','COMPLETED','CANCELLED')")
def clean_orders():
    return (
        dlt.read("raw_orders")
           .withColumn("order_ts", to_timestamp("order_time"))
    )

Integrating Expectations with Unity Catalog

DLT/Lakeflow expectations are stored as Unity Catalog table metadata
Quality rules become versioned, auditable, and discoverable
Governance teams can manage quality as policy, not hidden code

🛡 Databricks Through a Governance Lens

Databricks governance is designed as end‑to‑end governance, not just basic table permissions.

1. Data Governance

Unity Catalog: fine‑grained permissions (catalog / schema / table / column)
Lineage: from raw sources through pipelines to reports and ML models
Policy‑based access control (tag/label‑based masking, etc.)

2. ML Governance

MLflow integrated with Unity Catalog
Model registry with versions and stages (prod / staging / dev)
Feature Store for reusable, governed features

3. Audit & Compliance

System tables expose access logs, policy changes, and pipeline definition changes
Designed to support regulated industries (financial services, healthcare, etc.)

📌 How to Think About Databricks

Databricks is…

A commercial lakehouse platform built on open source
- Apache Spark, Delta Lake, MLflow, and open‑sourced Unity Catalog
An enterprise‑grade data & AI operations platform
- Central place for permissions, lineage, quality, and audit
A hub in a larger ecosystem with dbt, Airflow, Fivetran, etc.
- Databricks is the “platform”; dbt is the “modeling & transformation DSL”

Closing Thoughts

If you see Databricks only as a “Spark notebook service”, you’re missing half the story.
The real question is which open source components are combined in which way to solve which problems – especially governance, operations, and productivity.
In real projects, combinations matter:
- Databricks + dbt + Fivetran + Power BI
- Databricks + Airflow/Dagster + MLflow + Feature Store

Going forward, the open sourcing of Unity Catalog and interoperability across table formats like Iceberg/Delta/Hudi will matter even more.
Hopefully this guide helps you see the Databricks ecosystem as a whole, rather than just one tool in isolation.