(2)Apache Spark Interview

Posted Dec 13, 2024 Updated Jan 3, 2025

By kevinsung123

3 min read

DAG(Directed Acyclie Graph)은 각 stage가 병렬로 실행될 수 있는 task의 집합으로 이루어진 일련의 computation stage
DAG Scheduler는 job을 shuffle경계를 기반으로 job을 stages of task로 나누고 이러한 stage들을 올바른 순서로 실행하여 최종겨로가를 계산하는 역할을 함
DAG의 목적은 performance와 fault-tolernace을 위해 execution plan을 최적화하는것

Job Submission: The user submits a Spark job via a SparkContext.
DAG Creation: The SparkContext creates a logical plan in the form of a DAG.
Job Scheduling: The DAG scheduler splits the job into stages based on shuffle boundaries.
Task Assignment: The stages are further divided into tasks and assigned to executor nodes.
Task Execution: Executors run the tasks, processing data and storing intermediate results in memory or disk.
Shuffle Operations: Data is shuffled between executors as required by operations like joins.
Result Collection: The final results are collected back to the driver node or written to storage.

The different phases of the SQL engine in Spark include:

Parsing: Converts the SQL query into an unresolved logical plan.
Analysis: Resolves the logical plan by determining the correct attributes and data types using the catalog.
Optimization: Optimizes the logical plan using Catalyst optimizers.
Physical Planning: Converts the optimized logical plan into a physical plan with specific execution strategies.
Execution: Executes the physical plan and returns the result.

Transformations in Spark are of two types:

Narrow Transformations: These transformations do not require shuffling of data across partitions and can be executed without data movement. Examples include:

– map(): Applies a function to each element.

– filter(): Selects elements that satisfy a condition.

– flatMap(): Similar to map but can return multiple elements for each input element.

Wide Transformations: These transformations require shuffling of data across partitions and involve data movement. Examples include:

– reduceByKey(): Aggregates data across keys and requires shuffling.

– groupByKey(): Groups data by key, resulting in shuffling of data.

– join(): Joins two RDDs based on a key, involving shuffling.

Spark creates one job for each action.
This job may contain a series of multiple transformations.
The Spark engine will optimize those transformations and creates a logical plan for the job.
Then spark will break the logical plan at the end of every wide dependency and create two or more stages.
If you do not have a wide dependency, your plan will be a single-stage plan.
But if you have N wide-dependencies, your plan should have N+1 stages.
Data from one stage to another stage is shared using the shuffle/sort operation.
Now each stage may be executed as one or more parallel tasks.
The number of tasks in the stage is equal to the number of input partitions.

task는 Spark job을 위해 가장 중요한 concept이며 smallests unit of work in Spark job. Spark driver는 이 task들을 executor에 할당 그들에게 일하도록 요청

This post is licensed under CC BY 4.0 by the author.

Trending Tags