Member-only story
Spark: Key Topics for Data Engineering Interviews Part - 2
Key Topics for Data Engineering Interviews
5 min readApr 29, 2025
In this continuation (Key Topics for Data Engineering Interviews Part — 1), I will explore some more crucial concepts that not only illuminate the inner workings of Spark but also serve as key markers in Spark interviews
Whether you’re gearing up for a technical discussion or simply looking to deepen your understanding, this exploration promises to be a rewarding endeavor into the core of Spark’s essence.
Lets get started —
1️⃣1️⃣ Window Functions vs. Group By
groupBy:
- Functionality: Aggregates multiple rows into a single result per group.
- When to Use: Use
GROUP BY
when you need pure aggregation — like getting totals, averages, counts — without needing details about individual rows. - Performance: Efficient for simple aggregations over grouped datasets.
However, shuffle still happens, especially with large groups. - Example:
avg_salary_df = df.groupBy(“dept_id”).agg(avg(“salary”).alias(“avg_salary”))
window:
- Functionality: Adds aggregated information to each row without collapsing rows.