Member-only story

PySpark: Interview Questions (Coding)— Part 1

Journey through Interview Scenarios

8 min readJul 17, 2024

Navigating through a PySpark interview can be a challenging endeavor, especially when faced with scenario-based questions that test not only one’s theoretical knowledge but also practical problem-solving skills.

In this article, I’ll share several questions I personally encountered during interviews and also important ones. I believe these questions will provide insight into the types of inquiries encountered, whether in online assessments or face-to-face interviews.

Q1. ClickStream

Given a clickstream of user activity data , find the relevant user session for each click event.
click_time | user_id
2018–01–01 11:00:00 | u1
2018–01–01 12:00:00 | u1
2018–01–01 13:00:00 | u1
2018–01–01 13:00:00 | u1
2018–01–01 14:00:00 | u1
2018–01–01 15:00:00 | u1
2018–01–01 11:00:00 | u2
2018–01–02 11:00:00 | u2
session definition:
1. session expires after inactivity of 30mins, because of inactivity no clickstream will be generated
2. session remain active for total of 2 hours

Sol -

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, unix_timestamp, lag, when, lit, concat, sum, monotonically_increasing_id
from pyspark.sql.window import Window

# Create a SparkSession
spark = SparkSession.builder \
    .appName("ClickStreamSession") \
    .getOrCreate()

# Define the schema for…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Already have an account? Sign in

Written by Pravash

I am a passionate Data Engineer and Technology Enthusiast. Here I am using this platform to share my knowledge and experience on tech stacks.

Responses (6)
Write a response
What are your thoughts?
Also publish to my profile
Satish
Dec 12, 2024
For Q5: your solution is not working
df = spark_session.createDataFrame([1, 2, 3], IntegerType()).toDF("id")
df.show()
result_df = df.withColumn("duplicator", sequence(lit(1), col("id"))) \
.select("id", explode(col("duplicator"))).drop("col")
result_df.show()...
29
Satish
Dec 12, 2024
for 2nd and 3rd:
salary_window = Window.orderBy(desc(col("salary")))
df = worker.withColumn("rank", dense_rank().over(salary_window))
lowest_salary_rank = 1
highest_salary_rank = df.select(max("rank")).first()[0]
result =…
3
Arunesh
Nov 28, 2024 (edited)
In Q1 code only addresses inactivity for 30 minutes but doesn't enforce the 2-hour session limit.
I guess you can add this:
# Update session ID to account for the 2-hour rule
clickstream_df = clickstream_df.withColumn("final_session_id", sum("break_due_to_duration").over(session_window) + col("session_id"))
1

Recommended from Medium

Mastering PySpark: Essential Interview Questions and Code Solutions for Data Engineers

Mayurkumar Surani

Mastering PySpark: Essential Interview Questions and Code Solutions for Data Engineers

As a seasoned Big Data Engineer with over a 6 years of experience, I’ve had the privilege of working with some of the most cutting-edge…

Oct 17, 2024

JP Morgan PySpark Interview Question, Medium Level

In

Data Engineer Things

by

B V Sarath Chandra

JP Morgan PySpark Interview Question, Medium Level

Problem Statement:

Jan 2

Lists

Coding & Development

11 stories1035 saves

Predictive Modeling w/ Python

20 stories1858 saves

Practical Guides to Machine Learning

10 stories2229 saves

ChatGPT

21 stories991 saves

15 More Pyspark Interview Questions

Turkana Karimova

15 More Pyspark Interview Questions

PySpark, the Python API for Apache Spark, is widely used for big data processing and analytics. If you’re preparing for a PySpark…

Feb 12

Top 45 Apache Spark Interview Questions and Answers

Sanjay Kumar PhD

Top 45 Apache Spark Interview Questions and Answers

1. What is Apache Spark?

6d ago

Everything about Python OOPs Concepts — Part 2🚀

In

LearnPython

by

Shaloo Mathew

Everything about Python OOPs Concepts — Part 2🚀

Inheritance 🧬 Polymorphism 👻 Encapsulation 🔒 Abstraction 🎭

Feb 26

PySpark Internals — Day 87 of 100 Days of Data Engineering, AI and Azure Challenge

Karthik

PySpark Internals — Day 87 of 100 Days of Data Engineering, AI and Azure Challenge

PySpark combines the scalability of Apache Spark (JVM-based) with Python’s simplicity. Below is a detailed breakdown of its internals…

6d ago

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams