Initializing System...
Initializing System...
Decrypted Knowledge Base • 25 Entries
Prepare for Databricks interview operations with classified intel, detailed answers, code examples, and tactical key points.
5 classified entries
5 classified entries
5 classified entries
4 classified entries
3 classified entries
3 classified entries
Databricks is a unified data analytics platform built on Apache Spark. While Spark is an open-source distributed computing engine, Databricks provides a managed cloud service that includes optimized Spark runtime (Databricks Runtime), collaborative notebooks, job scheduling, cluster management, Delta Lake integration, and enterprise security features. Databricks simplifies infrastructure management and adds significant performance optimizations.
A Data Lakehouse is a modern data architecture that combines the flexibility and cost-efficiency of data lakes with the data management and ACID transaction capabilities of data warehouses. It enables business intelligence and machine learning on a single platform, storing data in open formats like Delta Lake while providing schema enforcement, governance, and direct BI tool access.
Ready to deploy your knowledge in the field?
Execute Challenge Operations →Databricks Runtime is the set of software artifacts that run on cluster machines. It includes Apache Spark plus additional components that improve usability, performance, and security. Key components include optimized I/O layers, enhanced query execution (Photon engine), integrated libraries for ML (MLlib, scikit-learn, TensorFlow), and native Delta Lake support. Different runtime versions exist for ML, GPU, and genomics workloads.
# Check runtime version in a notebook
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
# Example: "13.3.x-scala2.12" indicates DBR 13.3Unity Catalog is Databricks' unified governance solution for all data and AI assets in the lakehouse. It provides centralized access control, auditing, lineage tracking, and data discovery across workspaces and clouds. Unity Catalog enables fine-grained permissions at the table, column, and row level, making it essential for enterprise data governance and compliance.
Databricks clusters are sets of computation resources where you run data engineering and data science workloads. A cluster consists of a driver node and worker nodes. You can create all-purpose clusters for interactive analysis or job clusters for automated workloads. Clusters support autoscaling, spot instances for cost savings, and can be configured with specific Databricks Runtime versions.
# Cluster configuration example (JSON)
{
"num_workers": 4,
"cluster_name": "my-cluster",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and data lakes. It solves common data lake problems including: lack of transactions (dirty reads, failed writes), schema enforcement issues, difficulty handling updates/deletes, and no data versioning. Delta Lake stores data in Parquet format with a transaction log that tracks all changes.
Time travel allows you to query previous versions of a Delta table using version numbers or timestamps. Each write creates a new version stored in the transaction log. Use cases include: auditing data changes, reproducing ML experiments, recovering from accidental deletes, and debugging data pipelines by comparing before/after states.
# Query by version number
df = spark.read.format("delta").option("versionAsOf", 5).load("/path/to/table")
# Query by timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-01-15").load("/path/to/table")
# SQL syntax
SELECT * FROM my_table VERSION AS OF 5
SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15'MERGE (upsert) allows you to update, insert, and delete records in a Delta table based on a matching condition with source data. It's an atomic operation that handles complex scenarios like slowly changing dimensions (SCD Type 2). MERGE is more efficient than separate UPDATE/INSERT operations and is the recommended approach for incremental data loading.
# PySpark MERGE example
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/path/to/target")
deltaTable.alias("target").merge(
source_df.alias("source"),
"target.id = source.id"
).whenMatchedUpdate(set={
"name": "source.name",
"updated_at": "current_timestamp()"
}).whenNotMatchedInsert(values={
"id": "source.id",
"name": "source.name",
"created_at": "current_timestamp()"
}).execute()Delta Lake supports automatic schema evolution when writing data. You can add new columns, and Delta will merge the schemas. Schema enforcement prevents writing data that doesn't match the table schema. Use mergeSchema option to add columns automatically or overwriteSchema to replace the schema entirely (use with caution).
# Enable automatic schema merge
df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/path/to/table")
# Or overwrite entire schema
df.write.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.save("/path/to/table")Liquid clustering is a modern data layout optimization that replaces traditional partitioning and Z-ordering. It automatically clusters data based on specified columns, adapting the layout as data and query patterns change. Unlike partitioning, you can change clustering columns without rewriting data, and it handles high-cardinality columns efficiently.
-- Create table with liquid clustering
CREATE TABLE events (
event_id BIGINT,
event_date DATE,
user_id STRING,
event_type STRING
) CLUSTER BY (event_date, user_id);
-- Change clustering columns (no data rewrite)
ALTER TABLE events CLUSTER BY (event_type, event_date);
-- Trigger optimization
OPTIMIZE events;Transformations are lazy operations that define a new DataFrame/RDD without immediately computing results (e.g., select, filter, groupBy, join). Actions trigger actual computation and return results to the driver or write to storage (e.g., collect, count, show, write). Spark builds a DAG of transformations and only executes when an action is called.
# Transformations (lazy - no computation yet)
df_filtered = df.filter(df.age > 25) # Nothing happens
df_selected = df_filtered.select("name", "age") # Still nothing
# Action (triggers computation)
df_selected.show() # NOW the entire chain executes
result = df_selected.collect() # Returns data to driverNarrow transformations (map, filter, select) process data within the same partition without shuffling data across nodes. Wide transformations (groupBy, join, repartition) require shuffling data across partitions and nodes, which is expensive. Minimizing wide transformations and shuffles is crucial for Spark performance optimization.
# Narrow transformations (no shuffle)
df.filter(col("status") == "active") # Each partition processed independently
df.select("id", "name") # No data movement
# Wide transformations (require shuffle)
df.groupBy("department").count() # Data shuffled to group
df1.join(df2, "id") # Both DataFrames shuffled for joinJoin optimization strategies include: broadcast joins for small tables (< 10MB by default), using partitioning on join keys, filtering data before joining, and choosing appropriate join types. Broadcast joins avoid shuffles by sending the small table to all workers. For large-large joins, ensure data is pre-partitioned on join keys.
from pyspark.sql.functions import broadcast
# Broadcast join (small table)
result = large_df.join(broadcast(small_df), "customer_id")
# Optimize join with pre-filtering
df1_filtered = df1.filter(col("date") >= "2024-01-01")
result = df1_filtered.join(df2, "id")
# Check join strategy in explain plan
result.explain(mode="extended")
# Adjust broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) # 50MBUDFs allow you to extend Spark with custom Python functions. However, Python UDFs serialize data between JVM and Python, causing significant overhead. Prefer built-in Spark functions when possible. Use Pandas UDFs (vectorized) for better performance with Python code, as they use Apache Arrow for efficient data transfer.
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import StringType
import pandas as pd
# Regular UDF (slower - avoid if possible)
@udf(StringType())
def upper_case(s):
return s.upper() if s else None
# Pandas UDF (vectorized - much faster)
@pandas_udf(StringType())
def upper_case_vectorized(s: pd.Series) -> pd.Series:
return s.str.upper()
# Usage
df.select(upper_case_vectorized(col("name")))
# Best: Use built-in functions
from pyspark.sql.functions import upper
df.select(upper(col("name"))) # FastestData skew occurs when some partitions have significantly more data than others, causing slow tasks (stragglers). Solutions include: salting keys by adding random prefixes, using adaptive query execution (AQE), repartitioning with more partitions, and isolating skewed keys for separate processing. Databricks AQE automatically handles many skew scenarios.
from pyspark.sql.functions import col, concat, lit, rand
# Enable AQE (handles skew automatically in most cases)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Manual salting for extreme skew
salted_df = skewed_df.withColumn(
"salted_key",
concat(col("join_key"), lit("_"), (rand() * 10).cast("int"))
)
# Expand the other table to match salt values
expanded_df = other_df.crossJoin(
spark.range(10).withColumnRenamed("id", "salt")
).withColumn(
"salted_key",
concat(col("join_key"), lit("_"), col("salt"))
)
# Join on salted keys
result = salted_df.join(expanded_df, "salted_key")Window functions perform calculations across a set of rows related to the current row, without collapsing the result into a single row like GROUP BY. Common uses include running totals, rankings, moving averages, and comparing values to previous/next rows. The OVER clause defines the window partition and ordering.
-- Running total partitioned by customer
SELECT
order_id,
customer_id,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total
FROM orders;
-- Rank within department
SELECT
employee_id,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
LAG(salary) OVER (PARTITION BY department ORDER BY salary DESC) as prev_salary
FROM employees;The Medallion Architecture is a data design pattern for organizing data in a lakehouse. Bronze layer stores raw ingested data with minimal transformation. Silver layer contains cleaned, conformed, and validated data. Gold layer has business-level aggregates and metrics ready for reporting. This progressive refinement ensures data quality while maintaining lineage.
Managed tables have both metadata and data managed by the metastore. When dropped, both metadata and data files are deleted. External tables only have metadata in the metastore; data files exist in a specified location. Dropping an external table removes only metadata, preserving data files. Use external tables when data is shared across systems or requires specific storage locations.
-- Managed table (data stored in metastore location)
CREATE TABLE managed_sales (
id INT,
amount DECIMAL(10,2)
);
-- External table (data at specified location)
CREATE TABLE external_sales (
id INT,
amount DECIMAL(10,2)
)
LOCATION 's3://my-bucket/sales/';
-- Check table type
DESCRIBE EXTENDED managed_sales;SCD Type 1 overwrites old values (use MERGE with UPDATE). SCD Type 2 maintains history with effective dates and current flags. In Delta Lake, implement SCD Type 2 using MERGE with multiple WHEN MATCHED clauses to expire old records and insert new versions. Delta's time travel also provides implicit Type 2 capability.
-- SCD Type 2 implementation with MERGE
MERGE INTO dim_customer AS target
USING staged_updates AS source
ON target.customer_id = source.customer_id AND target.is_current = true
-- Expire existing record
WHEN MATCHED AND target.name != source.name THEN UPDATE SET
is_current = false,
end_date = current_date()
-- Insert new version (handled separately or with INSERT)
WHEN NOT MATCHED THEN INSERT (
customer_id, name, is_current, start_date, end_date
) VALUES (
source.customer_id, source.name, true, current_date(), null
);MLflow is an open-source platform for managing the ML lifecycle. It has four components: Tracking (logging parameters, metrics, artifacts), Projects (packaging code for reproducibility), Models (model packaging and deployment), and Model Registry (versioning and staging models). MLflow integrates natively with Databricks for seamless experiment tracking and deployment.
import mlflow
from mlflow.tracking import MlflowClient
# Start an experiment run
with mlflow.start_run(run_name="my_experiment"):
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("epochs", 100)
# Train model...
# Log metrics
mlflow.log_metric("accuracy", 0.95)
mlflow.log_metric("f1_score", 0.92)
# Log model
mlflow.sklearn.log_model(model, "model")
# Log artifacts (plots, data samples)
mlflow.log_artifact("confusion_matrix.png")Model deployment in Databricks follows a lifecycle: train and log to MLflow, register in Model Registry, transition through stages (None → Staging → Production), then deploy for inference. Options include real-time REST endpoints (Model Serving), batch scoring with Spark UDFs, and streaming inference. Feature Store integration ensures consistent features.
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register model from run
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "my_model")
# Transition to production
client.transition_model_version_stage(
name="my_model",
version=1,
stage="Production"
)
# Load production model for batch scoring
model = mlflow.pyfunc.load_model("models:/my_model/Production")
predictions = model.predict(df.toPandas())
# Or use Spark UDF for distributed scoring
predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/my_model/Production")
scored_df = df.withColumn("prediction", predict_udf(*feature_columns))Databricks Feature Store is a centralized repository for feature engineering that enables feature reuse across teams and consistent feature computation between training and inference. It solves feature consistency problems, reduces duplicate feature engineering work, and provides feature lineage and discovery. Features are stored as Delta tables with point-in-time lookup support.
Use Structured Streaming with Delta Lake as source/sink. Ingest from Kafka/Event Hubs/Kinesis, process with DataFrame transformations, and write to Delta tables with exactly-once guarantees. For complex pipelines, use Lakeflow (DLT) which handles checkpointing, retries, and automatic schema evolution. Monitor with Spark UI and Databricks observability tools.
# Structured Streaming with Delta
stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker:9092") \
.option("subscribe", "events") \
.load()
# Process stream
processed = stream_df \
.select(from_json(col("value").cast("string"), schema).alias("data")) \
.select("data.*") \
.withWatermark("event_time", "10 minutes") \
.groupBy(window("event_time", "5 minutes"), "category") \
.agg(count("*").alias("event_count"))
# Write to Delta with checkpointing
processed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/events") \
.trigger(processingTime="1 minute") \
.start("/delta/event_aggregates")Auto Loader is a Databricks feature for incrementally ingesting new files from cloud storage (S3, ADLS, GCS). It automatically discovers and processes new files without listing the entire directory, using file notification services for efficiency. Use Auto Loader for landing zone ingestion, especially with high file volumes. It supports schema inference and evolution.
# Auto Loader with cloudFiles format
df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "/schema/events") \
.option("cloudFiles.inferColumnTypes", "true") \
.load("s3://bucket/landing/events/")
# Write to Delta Bronze table
df.writeStream \
.format("delta") \
.option("checkpointLocation", "/checkpoints/events_bronze") \
.option("mergeSchema", "true") \
.trigger(availableNow=True)
.start("/delta/bronze/events")Cost optimization strategies include: using spot/preemptible instances, right-sizing clusters based on workload, enabling autoscaling with appropriate min/max, using SQL warehouses for BI queries, implementing cluster policies, scheduling jobs during off-peak hours, and optimizing data storage with Delta Lake compaction and Z-ordering. Monitor costs with Databricks account console.