Managed vs External Tables?

Managed tables store data in the location managed by Databricks (usually DBFS or Unity Catalog managed storage). Dropping the table deletes the data. External tables reference data stored in your own cloud storage paths (S3/ADLS). Dropping the table only removes metadata.

Delta Lake ACID Transactions?

Delta Lake uses a transaction log (Delta Log) to record every change made to the table. This enables atomicity (all or nothing), consistency, isolation (via optimistic concurrency control), and durability.

Z-Ordering is a data skipping technique that co-locates related information in the same set of files. It is most effective for columns that are frequently used in query predicates (WHERE clauses) and have high cardinality.

Photon is a native vectorized query engine written in C++ to improve query performance. It is designed to accelerate SQL workloads and DataFrame API calls by taking advantage of modern hardware instruction sets.

Unity Catalog Governance?

Unity Catalog is the unified governance solution for Data & AI on the Lakehouse. It provides a central place to manage permissions, audit logs, and data lineage across multiple Databricks workspaces.

Intel — Databricks Interview Questions & Knowledge Base

DBSWORDPROJECT ALICE

Missions Start Training

Initializing System...

SYS.STATUS: LOADING

CPU: 98%MEM: 87%NET: SYNC

Databricks Runtime is the set of software artifacts that run on cluster machines. It includes Apache Spark plus additional components that improve usability, performance, and security. Key components include optimized I/O layers, enhanced query execution (Photon engine), integrated libraries for ML (MLlib, scikit-learn, TensorFlow), and native Delta Lake support. Different runtime versions exist for ML, GPU, and genomics workloads.

# Check runtime version in a notebook
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")

# Example: "13.3.x-scala2.12" indicates DBR 13.3

Key Points:

Optimized Spark distribution
Photon engine for faster SQL queries
Pre-installed ML libraries
Multiple flavors: Standard, ML, GPU, Genomics

How do Databricks clusters work?▼

Databricks clusters are sets of computation resources where you run data engineering and data science workloads. A cluster consists of a driver node and worker nodes. You can create all-purpose clusters for interactive analysis or job clusters for automated workloads. Clusters support autoscaling, spot instances for cost savings, and can be configured with specific Databricks Runtime versions.

# Cluster configuration example (JSON)
{
  "num_workers": 4,
  "cluster_name": "my-cluster",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "i3.xlarge",
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  }
}

Key Points:

Driver node coordinates, workers execute tasks
All-purpose clusters for interactive work
Job clusters for automated pipelines
Autoscaling and spot instances for cost optimization

Delta Lake

What is Delta Lake and what problems does it solve?▼

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and data lakes. It solves common data lake problems including: lack of transactions (dirty reads, failed writes), schema enforcement issues, difficulty handling updates/deletes, and no data versioning. Delta Lake stores data in Parquet format with a transaction log that tracks all changes.

Key Points:

ACID transactions on data lakes
Schema enforcement and evolution
Time travel for data versioning
Efficient upserts, updates, and deletes

Explain Delta Lake time travel and its use cases.▼

Time travel allows you to query previous versions of a Delta table using version numbers or timestamps. Each write creates a new version stored in the transaction log. Use cases include: auditing data changes, reproducing ML experiments, recovering from accidental deletes, and debugging data pipelines by comparing before/after states.

# Query by version number
df = spark.read.format("delta").option("versionAsOf", 5).load("/path/to/table")

# Query by timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-01-15").load("/path/to/table")

# SQL syntax
SELECT * FROM my_table VERSION AS OF 5
SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15'

Key Points:

Query any historical version of data
Use version numbers or timestamps
Retention controlled by table properties
Essential for auditing and reproducibility

What is the MERGE operation in Delta Lake?▼

MERGE (upsert) allows you to update, insert, and delete records in a Delta table based on a matching condition with source data. It's an atomic operation that handles complex scenarios like slowly changing dimensions (SCD Type 2). MERGE is more efficient than separate UPDATE/INSERT operations and is the recommended approach for incremental data loading.

# PySpark MERGE example
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/path/to/target")

deltaTable.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set={
    "name": "source.name",
    "updated_at": "current_timestamp()"
}).whenNotMatchedInsert(values={
    "id": "source.id",
    "name": "source.name",
    "created_at": "current_timestamp()"
}).execute()

Key Points:

Atomic upsert operation
Supports UPDATE, INSERT, DELETE in one operation
Match conditions determine action
Ideal for CDC and SCD patterns

How does Delta Lake handle schema evolution?▼

Delta Lake supports automatic schema evolution when writing data. You can add new columns, and Delta will merge the schemas. Schema enforcement prevents writing data that doesn't match the table schema. Use mergeSchema option to add columns automatically or overwriteSchema to replace the schema entirely (use with caution).

# Enable automatic schema merge
df.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save("/path/to/table")

# Or overwrite entire schema
df.write.format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .save("/path/to/table")

Key Points:

Schema enforcement prevents bad data
mergeSchema adds new columns automatically
overwriteSchema replaces entire schema
Column mapping enables rename/drop without rewrite

What is liquid clustering and how does it improve performance?▼

Liquid clustering is a modern data layout optimization that replaces traditional partitioning and Z-ordering. It automatically clusters data based on specified columns, adapting the layout as data and query patterns change. Unlike partitioning, you can change clustering columns without rewriting data, and it handles high-cardinality columns efficiently.

-- Create table with liquid clustering
CREATE TABLE events (
  event_id BIGINT,
  event_date DATE,
  user_id STRING,
  event_type STRING
) CLUSTER BY (event_date, user_id);

-- Change clustering columns (no data rewrite)
ALTER TABLE events CLUSTER BY (event_type, event_date);

-- Trigger optimization
OPTIMIZE events;

Key Points:

Replaces partitioning and Z-ordering
Adaptive layout without rewrites
Handles high-cardinality columns
Automatically maintained during optimization

Pyspark

What is the difference between transformations and actions in Spark?▼

Transformations are lazy operations that define a new DataFrame/RDD without immediately computing results (e.g., select, filter, groupBy, join). Actions trigger actual computation and return results to the driver or write to storage (e.g., collect, count, show, write). Spark builds a DAG of transformations and only executes when an action is called.

# Transformations (lazy - no computation yet)
df_filtered = df.filter(df.age > 25)  # Nothing happens
df_selected = df_filtered.select("name", "age")  # Still nothing

# Action (triggers computation)
df_selected.show()  # NOW the entire chain executes
result = df_selected.collect()  # Returns data to driver

Key Points:

Transformations are lazy (build execution plan)
Actions trigger actual computation
DAG optimization happens before execution
Understanding this is key for performance tuning

Explain the difference between narrow and wide transformations.▼

Narrow transformations (map, filter, select) process data within the same partition without shuffling data across nodes. Wide transformations (groupBy, join, repartition) require shuffling data across partitions and nodes, which is expensive. Minimizing wide transformations and shuffles is crucial for Spark performance optimization.

# Narrow transformations (no shuffle)
df.filter(col("status") == "active")  # Each partition processed independently
df.select("id", "name")  # No data movement

# Wide transformations (require shuffle)
df.groupBy("department").count()  # Data shuffled to group
df1.join(df2, "id")  # Both DataFrames shuffled for join

Key Points:

Narrow: single partition, no shuffle
Wide: multiple partitions, requires shuffle
Shuffles are expensive network operations
Stage boundaries occur at wide transformations

How do you optimize joins in PySpark?▼

Join optimization strategies include: broadcast joins for small tables (< 10MB by default), using partitioning on join keys, filtering data before joining, and choosing appropriate join types. Broadcast joins avoid shuffles by sending the small table to all workers. For large-large joins, ensure data is pre-partitioned on join keys.

from pyspark.sql.functions import broadcast

# Broadcast join (small table)
result = large_df.join(broadcast(small_df), "customer_id")

# Optimize join with pre-filtering
df1_filtered = df1.filter(col("date") >= "2024-01-01")
result = df1_filtered.join(df2, "id")

# Check join strategy in explain plan
result.explain(mode="extended")

# Adjust broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024)  # 50MB

Key Points:

Use broadcast() for small dimension tables
Filter before joining to reduce data
Pre-partition large tables on join keys
Use explain() to verify join strategy

What are User Defined Functions (UDFs) and when should you use them?▼

UDFs allow you to extend Spark with custom Python functions. However, Python UDFs serialize data between JVM and Python, causing significant overhead. Prefer built-in Spark functions when possible. Use Pandas UDFs (vectorized) for better performance with Python code, as they use Apache Arrow for efficient data transfer.

from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import StringType
import pandas as pd

# Regular UDF (slower - avoid if possible)
@udf(StringType())
def upper_case(s):
    return s.upper() if s else None

# Pandas UDF (vectorized - much faster)
@pandas_udf(StringType())
def upper_case_vectorized(s: pd.Series) -> pd.Series:
    return s.str.upper()

# Usage
df.select(upper_case_vectorized(col("name")))

# Best: Use built-in functions
from pyspark.sql.functions import upper
df.select(upper(col("name")))  # Fastest

Key Points:

Built-in functions are fastest (pure JVM)
Python UDFs have serialization overhead
Pandas UDFs use Arrow for vectorized processing
Always prefer built-in functions when available

How do you handle skewed data in PySpark?▼

Data skew occurs when some partitions have significantly more data than others, causing slow tasks (stragglers). Solutions include: salting keys by adding random prefixes, using adaptive query execution (AQE), repartitioning with more partitions, and isolating skewed keys for separate processing. Databricks AQE automatically handles many skew scenarios.

from pyspark.sql.functions import col, concat, lit, rand

# Enable AQE (handles skew automatically in most cases)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Manual salting for extreme skew
salted_df = skewed_df.withColumn(
    "salted_key",
    concat(col("join_key"), lit("_"), (rand() * 10).cast("int"))
)

# Expand the other table to match salt values
expanded_df = other_df.crossJoin(
    spark.range(10).withColumnRenamed("id", "salt")
).withColumn(
    "salted_key",
    concat(col("join_key"), lit("_"), col("salt"))
)

# Join on salted keys
result = salted_df.join(expanded_df, "salted_key")

Key Points:

AQE handles most skew automatically
Salting adds random prefix to distribute keys
Monitor Spark UI for uneven task durations
Consider isolating extreme outlier keys

Sql

What are window functions and when would you use them?▼

Window functions perform calculations across a set of rows related to the current row, without collapsing the result into a single row like GROUP BY. Common uses include running totals, rankings, moving averages, and comparing values to previous/next rows. The OVER clause defines the window partition and ordering.

-- Running total partitioned by customer
SELECT 
  order_id,
  customer_id,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id 
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) as running_total
FROM orders;

-- Rank within department
SELECT 
  employee_id,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
  LAG(salary) OVER (PARTITION BY department ORDER BY salary DESC) as prev_salary
FROM employees;

Key Points:

PARTITION BY groups rows for the calculation
ORDER BY defines the sequence within partitions
Supports ROWS/RANGE for frame specification
Common functions: ROW_NUMBER, RANK, LAG, LEAD, SUM, AVG

Explain the Medallion Architecture (Bronze, Silver, Gold).▼

The Medallion Architecture is a data design pattern for organizing data in a lakehouse. Bronze layer stores raw ingested data with minimal transformation. Silver layer contains cleaned, conformed, and validated data. Gold layer has business-level aggregates and metrics ready for reporting. This progressive refinement ensures data quality while maintaining lineage.

Key Points:

Bronze: Raw data, append-only, preserve source format
Silver: Cleaned, deduplicated, schema-enforced data
Gold: Business aggregates, dimensional models, KPIs
Each layer adds quality and business value

What is the difference between managed and external tables?▼

Managed tables have both metadata and data managed by the metastore. When dropped, both metadata and data files are deleted. External tables only have metadata in the metastore; data files exist in a specified location. Dropping an external table removes only metadata, preserving data files. Use external tables when data is shared across systems or requires specific storage locations.

-- Managed table (data stored in metastore location)
CREATE TABLE managed_sales (
  id INT,
  amount DECIMAL(10,2)
);

-- External table (data at specified location)
CREATE TABLE external_sales (
  id INT,
  amount DECIMAL(10,2)
)
LOCATION 's3://my-bucket/sales/';

-- Check table type
DESCRIBE EXTENDED managed_sales;

Key Points:

Managed: metastore controls data lifecycle
External: data persists after DROP TABLE
Use external for shared or pre-existing data
Unity Catalog prefers managed tables for governance

How do you implement slowly changing dimensions (SCD) in Databricks?▼

SCD Type 1 overwrites old values (use MERGE with UPDATE). SCD Type 2 maintains history with effective dates and current flags. In Delta Lake, implement SCD Type 2 using MERGE with multiple WHEN MATCHED clauses to expire old records and insert new versions. Delta's time travel also provides implicit Type 2 capability.

-- SCD Type 2 implementation with MERGE
MERGE INTO dim_customer AS target
USING staged_updates AS source
ON target.customer_id = source.customer_id AND target.is_current = true

-- Expire existing record
WHEN MATCHED AND target.name != source.name THEN UPDATE SET
  is_current = false,
  end_date = current_date()

-- Insert new version (handled separately or with INSERT)
WHEN NOT MATCHED THEN INSERT (
  customer_id, name, is_current, start_date, end_date
) VALUES (
  source.customer_id, source.name, true, current_date(), null
);

Key Points:

Type 1: Overwrite (no history)
Type 2: Track history with effective dates
Use is_current flag for easy current-version queries
Delta time travel provides natural history

Mlflow

What is MLflow and what are its components?▼

MLflow is an open-source platform for managing the ML lifecycle. It has four components: Tracking (logging parameters, metrics, artifacts), Projects (packaging code for reproducibility), Models (model packaging and deployment), and Model Registry (versioning and staging models). MLflow integrates natively with Databricks for seamless experiment tracking and deployment.

import mlflow
from mlflow.tracking import MlflowClient

# Start an experiment run
with mlflow.start_run(run_name="my_experiment"):
    # Log parameters
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 100)
    
    # Train model...
    
    # Log metrics
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("f1_score", 0.92)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log artifacts (plots, data samples)
    mlflow.log_artifact("confusion_matrix.png")

Key Points:

Tracking: Log experiments, parameters, metrics
Projects: Reproducible ML code packaging
Models: Standard format for deployment
Registry: Version control and model staging

How do you deploy a model to production in Databricks?▼

Model deployment in Databricks follows a lifecycle: train and log to MLflow, register in Model Registry, transition through stages (None → Staging → Production), then deploy for inference. Options include real-time REST endpoints (Model Serving), batch scoring with Spark UDFs, and streaming inference. Feature Store integration ensures consistent features.

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register model from run
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "my_model")

# Transition to production
client.transition_model_version_stage(
    name="my_model",
    version=1,
    stage="Production"
)

# Load production model for batch scoring
model = mlflow.pyfunc.load_model("models:/my_model/Production")
predictions = model.predict(df.toPandas())

# Or use Spark UDF for distributed scoring
predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/my_model/Production")
scored_df = df.withColumn("prediction", predict_udf(*feature_columns))

Key Points:

Model Registry manages model versions
Staging workflow: Development → Staging → Production
Model Serving for real-time REST endpoints
Spark UDFs for distributed batch scoring

What is the Feature Store and why use it?▼

Databricks Feature Store is a centralized repository for feature engineering that enables feature reuse across teams and consistent feature computation between training and inference. It solves feature consistency problems, reduces duplicate feature engineering work, and provides feature lineage and discovery. Features are stored as Delta tables with point-in-time lookup support.

Key Points:

Centralized feature repository
Ensures training/serving consistency
Point-in-time feature lookup prevents leakage
Feature lineage and discovery across teams

Architecture

How would you design a real-time streaming pipeline in Databricks?▼

Use Structured Streaming with Delta Lake as source/sink. Ingest from Kafka/Event Hubs/Kinesis, process with DataFrame transformations, and write to Delta tables with exactly-once guarantees. For complex pipelines, use Lakeflow (DLT) which handles checkpointing, retries, and automatic schema evolution. Monitor with Spark UI and Databricks observability tools.

# Structured Streaming with Delta
stream_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "events") \
    .load()

# Process stream
processed = stream_df \
    .select(from_json(col("value").cast("string"), schema).alias("data")) \
    .select("data.*") \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(window("event_time", "5 minutes"), "category") \
    .agg(count("*").alias("event_count"))

# Write to Delta with checkpointing
processed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/events") \
    .trigger(processingTime="1 minute") \
    .start("/delta/event_aggregates")

Key Points:

Structured Streaming for exactly-once processing
Watermarks handle late-arriving data
Checkpoints enable fault-tolerant recovery
Delta Lake as unified batch/streaming sink

What is Auto Loader and when should you use it?▼

Auto Loader is a Databricks feature for incrementally ingesting new files from cloud storage (S3, ADLS, GCS). It automatically discovers and processes new files without listing the entire directory, using file notification services for efficiency. Use Auto Loader for landing zone ingestion, especially with high file volumes. It supports schema inference and evolution.

# Auto Loader with cloudFiles format
df = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/schema/events") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .load("s3://bucket/landing/events/")

# Write to Delta Bronze table
df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoints/events_bronze") \
    .option("mergeSchema", "true") \
    .trigger(availableNow=True)
    .start("/delta/bronze/events")

Key Points:

Efficient file discovery with notifications
Automatic schema inference and evolution
Exactly-once ingestion guarantees
Scales to millions of files efficiently

How do you optimize costs in Databricks?▼

Cost optimization strategies include: using spot/preemptible instances, right-sizing clusters based on workload, enabling autoscaling with appropriate min/max, using SQL warehouses for BI queries, implementing cluster policies, scheduling jobs during off-peak hours, and optimizing data storage with Delta Lake compaction and Z-ordering. Monitor costs with Databricks account console.

Key Points:

Spot instances save 60-90% on compute
Autoscaling matches resources to demand
SQL Warehouses for concurrent BI queries
Delta Lake optimization reduces I/O costs
Cluster policies enforce cost guardrails

Intel

General

Delta Lake

Pyspark

Sql

Mlflow

Architecture

General

Key Points:

Key Points:

Key Points:

Key Points:

Key Points:

Delta Lake

Key Points:

Key Points:

Key Points:

Key Points:

Key Points:

Pyspark

Key Points:

Key Points:

Key Points:

Key Points:

Key Points:

Sql

Key Points:

Key Points:

Key Points:

Key Points:

Mlflow

Key Points:

Key Points:

Key Points:

Architecture

Key Points:

Key Points:

Key Points:

Intel

General

Delta Lake

Pyspark

Sql

Mlflow

Architecture

🏢General

Key Points:

Key Points:

Key Points:

Key Points:

Key Points:

🔷Delta Lake

Key Points:

Key Points:

Key Points:

Key Points:

Key Points:

⚡Pyspark

Key Points:

Key Points:

Key Points:

Key Points:

Key Points:

📊Sql

Key Points:

Key Points:

Key Points:

Key Points:

🤖Mlflow

Key Points:

Key Points:

Key Points:

🏗️Architecture

Key Points:

Key Points:

Key Points:

General

Delta Lake

Pyspark

Sql

Mlflow

Architecture