Wednesday, August 13, 2025

Data Engineering role scenario based questions and responsibilities

 

🔧 

1. SQL & Data Modeling

  • What is normalization and denormalization?

  • How do you handle slowly changing dimensions (SCD Type 1/2/3)?

  • Write a SQL query to find the second highest salary from an employee table.

  • What is a CTE (Common Table Expression)? When would you use it?

  • Difference between INNER JOIN, LEFT JOIN, FULL OUTER JOIN.


🏗️ 

2. Data Warehousing

  • What is a star schema vs snowflake schema?

  • Explain ETL vs ELT – which one do you prefer and why?

  • What are fact tables and dimension tables?

  • How do you handle late arriving facts?

  • What are OLTP and OLAP?


🛠️ 

3. Big Data & Distributed Systems

  • What is Hadoop? How does HDFS work?

  • Explain the MapReduce paradigm.

  • What is partitioning and bucketing in Hive?

  • Difference between Hive and Presto (or Athena)?

  • How do you handle large datasets that don’t fit in memory?


🔄 

4. Data Pipelines & Orchestration

  • What tools have you used for data orchestration? (e.g., Airflow, AWS Glue, Step Functions)

  • How would you design a pipeline to process real-time data?

  • How do you handle failures in ETL pipelines?

  • How do you monitor pipeline health and performance?


☁️ 

5. Cloud Services (AWS/Azure/GCP)

  • What is the difference between S3 and EBS?

  • How does AWS Glue work? Glue Job vs Glue Crawler?

  • How would you implement CI/CD for AWS Glue or Redshift?

  • What is Redshift Spectrum?

  • Compare AWS Redshift vs Snowflake.


⚡ 

6. Programming & Data Transformation

  • What’s the difference between Pandas and PySpark?

  • How do you handle null values in Spark?

  • What are RDDs vs DataFrames in PySpark?

  • How do you optimize PySpark jobs?

  • Explain broadcast joins and when to use them.


🔐 

7. Data Governance & Quality

  • How do you ensure data quality in pipelines?

  • What tools have you used for data cataloging and lineage?

  • How do you handle schema evolution?

  • What is GDPR and how does it affect data engineering?


📊 

8. Scenario-Based Questions

  • How would you migrate an on-prem ETL pipeline to the cloud?

  • Design a data warehouse for a ride-sharing app (like Uber).

  • You need to deduplicate billions of rows from a clickstream – how would you approach it?

  • How do you handle schema changes in production pipelines?

No comments:

Post a Comment