Back to Blog
Interview
📊

Data Engineer Interview Questions: From Associate to Senior (2026)

Comprehensive interview questions for data engineers at all levels. Covers SQL, data pipelines, Spark, data warehousing, and system design.

BetaStudy Team
March 2, 2026
16 min read

Introduction

Data engineering interviews vary significantly by level. Associate roles focus on SQL and basic pipeline concepts. Senior roles expect system design, optimization, and architectural thinking. This guide covers questions for both levels, clearly marked by difficulty.

SQL & Data Modeling

1. [Associate] Write a query to find the second-highest salary in each department.

Answer:

```sql

SELECT department_id, salary

FROM (

SELECT department_id, salary,

DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank

FROM employees

) ranked

WHERE rank = 2;

```

DENSE_RANK handles ties correctly. ROW_NUMBER would skip if there are ties at first place.

2. [Associate] Explain the difference between WHERE and HAVING.

Answer: WHERE filters rows before aggregation. HAVING filters groups after aggregation.

```sql

-- WHERE: filter individual rows

SELECT * FROM orders WHERE amount > 100;

-- HAVING: filter aggregated results

SELECT customer_id, SUM(amount)

FROM orders

GROUP BY customer_id

HAVING SUM(amount) > 1000;

```

3. [Senior] Design a data model for an e-commerce analytics platform.

Answer: Use dimensional modeling (star schema):

Fact tables:

  • fact_orders (order_id, customer_key, product_key, date_key, quantity, revenue)
  • fact_page_views (session_id, user_key, page_key, date_key, duration)

Dimension tables:

  • dim_customer (customer_key, name, segment, acquisition_date)
  • dim_product (product_key, name, category, price, supplier)
  • dim_date (date_key, date, month, quarter, year, is_holiday)

Design considerations:

  • Slowly Changing Dimensions Type 2 for customer/product history
  • Pre-aggregated tables for common queries
  • Partitioning on date for large fact tables

4. [Senior] How do you handle slowly changing dimensions?

Answer:

  • Type 1: Overwrite old value. Simple but loses history.
  • Type 2: Add new row with version/dates. Preserves history, increases storage.
  • Type 3: Add column for previous value. Limited history.
  • Type 6: Hybrid of 1, 2, 3. Current value + history rows.

Choose based on analytical needs. Type 2 is most common for full history tracking.

Data Pipelines

5. [Associate] What is ETL vs ELT? When would you use each?

Answer:

  • ETL (Extract, Transform, Load): Transform data before loading into destination. Good when destination has limited compute (traditional data warehouses).
  • ELT (Extract, Load, Transform): Load raw data, transform in destination. Modern approach with powerful cloud warehouses (Snowflake, BigQuery).

ELT is increasingly preferred because cloud warehouses scale compute cheaply, and raw data in lake allows flexible transformations.

6. [Associate] How do you handle pipeline failures?

Answer:

  • Idempotency: Running twice produces same result (use MERGE, not INSERT)
  • Retries: Exponential backoff for transient failures
  • Checkpointing: Resume from last successful point
  • Alerting: PagerDuty/Slack notifications on failure
  • Dead letter queues: Capture failed records for later processing
  • Monitoring: Track success rates, latency, data quality

7. [Senior] Design a real-time data pipeline for fraud detection.

Answer:

Ingestion: Kafka for transaction stream, Kafka Connect for source integration

Processing: Flink or Spark Structured Streaming for:

  • Feature computation (velocity, deviation from normal)
  • ML model scoring (deployed via ONNX or custom UDF)
  • Rule-based checks

Storage:

  • Redis for feature store (low-latency lookups)
  • Kafka for scoring results
  • Delta Lake for historical analysis

Latency budget: <100ms end-to-end

Challenges: Late data, exactly-once semantics, model updates without downtime

8. [Senior] How do you ensure data quality in pipelines?

Answer:

Prevention:

  • Schema enforcement at ingestion
  • Contract testing between producer/consumer

Detection:

  • Great Expectations or dbt tests
  • Statistical anomaly detection (z-scores on counts)
  • Freshness checks (data should arrive by time X)

Response:

  • Circuit breakers to stop bad data propagation
  • Quarantine tables for invalid records
  • Automated alerts and incident response

Observability:

  • Data quality dashboards
  • SLAs with measurement

Big Data & Spark

9. [Associate] Explain the difference between map and flatMap in Spark.

Answer:

  • map: 1:1 transformation, one input row produces one output row
  • flatMap: 1:N transformation, one input row can produce zero or more output rows

```python

# map: transform each element

rdd.map(lambda x: x * 2) # [1,2,3] -> [2,4,6]

# flatMap: flatten nested results

rdd.flatMap(lambda x: x.split(" ")) # ["hello world"] -> ["hello", "world"]

```

10. [Associate] What is a shuffle in Spark and why is it expensive?

Answer: Shuffle redistributes data across partitions (required for joins, groupBy, repartition). It's expensive because:

  • Data written to disk
  • Data transferred over network
  • Serialization/deserialization overhead
  • Causes stage boundaries

Minimize shuffles by: broadcast joins for small tables, proper partitioning, avoiding unnecessary groupBys.

11. [Senior] How do you optimize a slow Spark job?

Answer:

  • Check Spark UI: Identify longest stages, shuffle sizes, task skew
  • Data skew: Salting keys, broadcast join, adaptive query execution
  • Partitioning: Right number of partitions (2-3x cores), partition by join/filter keys
  • Serialization: Use Kryo, avoid UDFs when possible
  • Memory: Tune spark.memory.fraction, check for spills
  • Joins: Broadcast hint for small tables, sort-merge for large
  • Caching: Cache frequently accessed data (storage level appropriate to size)
  • Predicate pushdown: Filter early, use columnar formats

12. [Senior] Explain Spark's execution model and memory management.

Answer:

Execution:

  • Driver creates DAG of stages
  • Tasks are units of work (one per partition per stage)
  • Executors run tasks in parallel

Memory (unified memory model):

  • Execution: Shuffles, joins, sorts, aggregations
  • Storage: Cached RDDs/DataFrames
  • Dynamic allocation between execution and storage

Spilling: When memory is insufficient, data spills to disk, significantly slowing operations.

Key configs: spark.executor.memory, spark.memory.fraction, spark.sql.shuffle.partitions

Data Warehousing

13. [Associate] What is the difference between OLTP and OLAP?

Answer:

  • OLTP (Online Transaction Processing): Operational databases. Many small transactions, normalized schemas, row-oriented. Example: PostgreSQL for app backend.
  • OLAP (Online Analytical Processing): Analytics databases. Complex queries on large datasets, denormalized schemas, columnar storage. Example: Snowflake, BigQuery.

14. [Senior] Compare Snowflake, BigQuery, and Redshift.

Answer:

FeatureSnowflakeBigQueryRedshift
Storage/ComputeSeparatedSeparatedTogether (RA3 separated)
ScalingAutomaticAutomaticManual clusters
PricingPer-second computePer-query bytes scannedPer-hour node
Best forMulti-cloud, varied workloadsGoogle ecosystem, serverlessAWS-heavy, predictable workloads

Choose Snowflake for multi-cloud flexibility, BigQuery for serverless simplicity, Redshift for deep AWS integration.

15. [Senior] How do you design a data lakehouse architecture?

Answer:

Storage: S3/ADLS with Delta Lake, Iceberg, or Hudi format

Benefits: ACID transactions, time travel, schema evolution on data lake

Layers:

  • Bronze: Raw data as ingested
  • Silver: Cleaned, conformed, deduplicated
  • Gold: Business-level aggregates, feature stores

Compute: Spark/Databricks for processing, Trino/Presto for ad-hoc queries

Governance: Unity Catalog or AWS Lake Formation for access control

System Design

16. [Senior] Design a data platform for a company with 10 billion events per day.

Answer:

Ingestion:

  • Kafka (MSK or Confluent) with multiple partitions
  • ~115K events/second sustained, plan for 3x peaks
  • Multiple consumer groups for different use cases

Streaming layer:

  • Flink for real-time transformations and aggregations
  • Output to Kafka topics for downstream consumers

Batch layer:

  • Spark on EMR/Databricks for historical reprocessing
  • Delta Lake for storage (compaction, Z-ordering)

Serving:

  • Snowflake/BigQuery for SQL analytics
  • Druid/ClickHouse for real-time dashboards
  • Redis for feature serving

Data quality: Great Expectations, Monte Carlo

Orchestration: Airflow/Dagster

Cost optimization: Spot instances, auto-scaling, data lifecycle policies

17. [Senior] How do you handle late-arriving data in streaming systems?

Answer:

  • Watermarks: Define how late data can arrive (e.g., 10 minutes)
  • Windowing: Tumbling, sliding, or session windows with allowed lateness
  • State management: Keep state long enough for late data
  • Reprocessing: Lambda architecture or Kappa with compacted topics
  • Business rules: Define what "late" means for your use case

Trade-off: Longer lateness allowance = more complete data but higher latency and state.

Behavioral Questions

18. [All levels] Tell me about a data pipeline you built that you're proud of.

Answer: Use STAR format emphasizing:

  • Scale: Data volumes processed
  • Impact: Business value delivered
  • Challenges: Technical problems solved
  • Decisions: Trade-offs you made and why

19. [Senior] How do you prioritize data requests from multiple stakeholders?

Answer:

  • Understand impact: Revenue, users affected, strategic importance
  • Quick wins: Some requests take 30 minutes but add immediate value
  • Batch similar work: Group related requests
  • Self-service: Build tools so stakeholders can answer questions themselves
  • Communication: Set expectations, provide estimates, update on progress
  • Documentation: Capture requirements to avoid rework

20. [Senior] How do you ensure data security and compliance?

Answer:

  • Access control: Role-based, column-level security, row-level security
  • Encryption: At rest (managed keys), in transit (TLS)
  • PII handling: Masking, tokenization, separate processing pipelines
  • Audit logging: Who accessed what when
  • Data retention: Automated deletion per policy
  • Compliance: GDPR right to deletion, SOC2 controls

Additional Resources

Conclusion

Data engineering interviews assess both technical depth and practical experience. Junior candidates should nail SQL and understand pipeline basics. Senior candidates need system design skills and the ability to discuss trade-offs at scale.

Certifications for Data Engineers

Stand out in your interview with industry-recognized credentials:

For Associate Level:

For Senior Level:

BetaStudy's practice questions include real-world scenarios and detailed explanations to build interview confidence.

Start your data engineering journey with a free trial.

Data Engineering
Interview Questions
SQL
Spark
Career

Ready to Start Practicing?

Apply what you learned with 250,000+ practice questions across 50+ certifications.