1. SQL & Data Modeling
What is normalization and denormalization?
How do you handle slowly changing dimensions (SCD Type 1/2/3)?
Write a SQL query to find the second highest salary from an employee table.
What is a CTE (Common Table Expression)? When would you use it?
Difference between INNER JOIN, LEFT JOIN, FULL OUTER JOIN.
2. Data Warehousing
What is a star schema vs snowflake schema?
Explain ETL vs ELT – which one do you prefer and why?
What are fact tables and dimension tables?
How do you handle late arriving facts?
What are OLTP and OLAP?
3. Big Data & Distributed Systems
What is Hadoop? How does HDFS work?
Explain the MapReduce paradigm.
What is partitioning and bucketing in Hive?
Difference between Hive and Presto (or Athena)?
How do you handle large datasets that don’t fit in memory?
4. Data Pipelines & Orchestration
What tools have you used for data orchestration? (e.g., Airflow, AWS Glue, Step Functions)
How would you design a pipeline to process real-time data?
How do you handle failures in ETL pipelines?
How do you monitor pipeline health and performance?
5. Cloud Services (AWS/Azure/GCP)
What is the difference between S3 and EBS?
How does AWS Glue work? Glue Job vs Glue Crawler?
How would you implement CI/CD for AWS Glue or Redshift?
What is Redshift Spectrum?
Compare AWS Redshift vs Snowflake.
6. Programming & Data Transformation
What’s the difference between Pandas and PySpark?
How do you handle null values in Spark?
What are RDDs vs DataFrames in PySpark?
How do you optimize PySpark jobs?
Explain broadcast joins and when to use them.
7. Data Governance & Quality
How do you ensure data quality in pipelines?
What tools have you used for data cataloging and lineage?
How do you handle schema evolution?
What is GDPR and how does it affect data engineering?
8. Scenario-Based Questions
How would you migrate an on-prem ETL pipeline to the cloud?
Design a data warehouse for a ride-sharing app (like Uber).
You need to deduplicate billions of rows from a clickstream – how would you approach it?
How do you handle schema changes in production pipelines?
No comments:
Post a Comment