PySpark: Interview Questions (Coding)— Part 1

Journey through Interview Scenarios

Pravash
8 min readJul 17, 2024

--

Navigating through a PySpark interview can be a challenging endeavor, especially when faced with scenario-based questions that test not only one’s theoretical knowledge but also practical problem-solving skills.

In this article, I’ll share several questions I personally encountered during interviews and also important ones. I believe these questions will provide insight into the types of inquiries encountered, whether in online assessments or face-to-face interviews.

Q1. ClickStream

Given a clickstream of user activity data , find the relevant user session for each click event.

click_time | user_id
2018–01–01 11:00:00 | u1
2018–01–01 12:00:00 | u1
2018–01–01 13:00:00 | u1
2018–01–01 13:00:00 | u1
2018–01–01 14:00:00 | u1
2018–01–01 15:00:00 | u1
2018–01–01 11:00:00 | u2
2018–01–02 11:00:00 | u2

session definition:
1. session expires after inactivity of 30mins, because of inactivity no clickstream will be generated
2. session remain active for total of 2 hours

Sol -

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, unix_timestamp, lag, when, lit, concat, sum, monotonically_increasing_id
from pyspark.sql.window import Window

# Create a SparkSession
spark = SparkSession.builder \
.appName("ClickStreamSession") \
.getOrCreate()

# Define the schema for…

--

--

Pravash
Pravash

Written by Pravash

I am a passionate Data Engineer and Technology Enthusiast. Here I am using this platform to share my knowledge and experience on tech stacks.

Responses (5)