Whats going on HADOOP with BIGDATA?

This blog is mainly focused on for discussions in Hadoop technology, using various tools from Hadoop ECO system. Hadoop experts and beginners post or share their views, experiences. Freshers on Hadoop post their questions here to clarify from experts. I motivated myself to create this blog for helping the new beginners from expensive Hadoop projects in the market. I do my best to collect and share genuine posts from hadoop discussions around world of internet too.

Wednesday, August 13, 2025

Data Engineering role scenario based questions and responsibilities

1. SQL & Data Modeling

What is normalization and denormalization?
How do you handle slowly changing dimensions (SCD Type 1/2/3)?
Write a SQL query to find the second highest salary from an employee table.
What is a CTE (Common Table Expression)? When would you use it?
Difference between INNER JOIN, LEFT JOIN, FULL OUTER JOIN.

2. Data Warehousing

What is a star schema vs snowflake schema?
Explain ETL vs ELT – which one do you prefer and why?
What are fact tables and dimension tables?
How do you handle late arriving facts?
What are OLTP and OLAP?

3. Big Data & Distributed Systems

What is Hadoop? How does HDFS work?
Explain the MapReduce paradigm.
What is partitioning and bucketing in Hive?
Difference between Hive and Presto (or Athena)?
How do you handle large datasets that don’t fit in memory?

4. Data Pipelines & Orchestration

What tools have you used for data orchestration? (e.g., Airflow, AWS Glue, Step Functions)
How would you design a pipeline to process real-time data?
How do you handle failures in ETL pipelines?
How do you monitor pipeline health and performance?

5. Cloud Services (AWS/Azure/GCP)

What is the difference between S3 and EBS?
How does AWS Glue work? Glue Job vs Glue Crawler?
How would you implement CI/CD for AWS Glue or Redshift?
What is Redshift Spectrum?
Compare AWS Redshift vs Snowflake.

6. Programming & Data Transformation

What’s the difference between Pandas and PySpark?
How do you handle null values in Spark?
What are RDDs vs DataFrames in PySpark?
How do you optimize PySpark jobs?
Explain broadcast joins and when to use them.

7. Data Governance & Quality

How do you ensure data quality in pipelines?
What tools have you used for data cataloging and lineage?
How do you handle schema evolution?
What is GDPR and how does it affect data engineering?

8. Scenario-Based Questions

How would you migrate an on-prem ETL pipeline to the cloud?
Design a data warehouse for a ride-sharing app (like Uber).
You need to deduplicate billions of rows from a clickstream – how would you approach it?
How do you handle schema changes in production pipelines?

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)