Member-only story

PySpark: Interview Questions (Coding)— Part 1

Journey through Interview Scenarios

Pravash
8 min readJul 17, 2024

Navigating through a PySpark interview can be a challenging endeavor, especially when faced with scenario-based questions that test not only one’s theoretical knowledge but also practical problem-solving skills.

In this article, I’ll share several questions I personally encountered during interviews and also important ones. I believe these questions will provide insight into the types of inquiries encountered, whether in online assessments or face-to-face interviews.

Q1. ClickStream

Given a clickstream of user activity data , find the relevant user session for each click event.

click_time | user_id
2018–01–01 11:00:00 | u1
2018–01–01 12:00:00 | u1
2018–01–01 13:00:00 | u1
2018–01–01 13:00:00 | u1
2018–01–01 14:00:00 | u1
2018–01–01 15:00:00 | u1
2018–01–01 11:00:00 | u2
2018–01–02 11:00:00 | u2

session definition:
1. session expires after inactivity of 30mins, because of inactivity no clickstream will be generated
2. session remain active for total of 2 hours

Sol -

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, unix_timestamp, lag, when, lit, concat, sum, monotonically_increasing_id
from pyspark.sql.window import Window

# Create a SparkSession
spark = SparkSession.builder \
.appName("ClickStreamSession") \
.getOrCreate()

# Define the schema for…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Pravash
Pravash

Written by Pravash

I am a passionate Data Engineer and Technology Enthusiast. Here I am using this platform to share my knowledge and experience on tech stacks.

Responses (6)

Write a response

For Q5: your solution is not working
df = spark_session.createDataFrame([1, 2, 3], IntegerType()).toDF("id")
df.show()
result_df = df.withColumn("duplicator", sequence(lit(1), col("id"))) \
.select("id", explode(col("duplicator"))).drop("col")
result_df.show()...

29

for 2nd and 3rd:
salary_window = Window.orderBy(desc(col("salary")))
df = worker.withColumn("rank", dense_rank().over(salary_window))
lowest_salary_rank = 1
highest_salary_rank = df.select(max("rank")).first()[0]
result =…

3

In Q1 code only addresses inactivity for 30 minutes but doesn't enforce the 2-hour session limit.
I guess you can add this:
# Update session ID to account for the 2-hour rule
clickstream_df = clickstream_df.withColumn("final_session_id", sum("break_due_to_duration").over(session_window) + col("session_id"))

1