Keywords: Apache Spark | Standalone Cluster | Worker Process | Executor Process | Core Resource Management | Distributed Computing Architecture | Task Scheduling | Fault Tolerance Mechanism
Abstract: This article provides an in-depth exploration of the core components in Apache Spark standalone cluster architecture—Worker, Executor, and core resource coordination mechanisms. By analyzing Spark's Master/Slave architecture model, it details the communication flow and resource management between Driver, Worker, and Executor. The article systematically addresses key issues including Executor quantity control, task parallelism configuration, and the relationship between Worker and Executor, demonstrating resource allocation logic through specific configuration examples. Additionally, combined with Spark's fault tolerance mechanism, it explains task scheduling and failure recovery strategies in distributed computing environments, offering theoretical guidance for Spark cluster optimization.
Overview of Apache Spark Standalone Cluster Architecture
Apache Spark employs a Master/Slave architecture design. In standalone cluster mode, the system consists of a central coordinator (Driver) and multiple distributed worker nodes (Worker). The Driver serves as the main entry point for applications, responsible for converting user programs into tasks and scheduling them to various Executors for execution. Each Worker node can run one or more Executor processes, which are the entities that actually perform computational tasks.
Core Component Function Analysis
The Driver process is the control center of Spark applications. When users submit applications via spark-submit, a SparkContext object is first instantiated, at which point the application officially becomes the Driver. The Driver's main responsibilities include: converting user program logic into a Directed Acyclic Graph (DAG), decomposing the DAG into multiple Stages, dividing Stages into specific Tasks, and finally scheduling these Tasks to appropriate Executors for execution.
The Worker process is a JVM process running on cluster nodes, started by executing the bin/start-slave.sh script. The Worker's primary function is to report node resource availability (such as CPU cores and memory size) to the cluster manager and to start and manage Executor processes based on the Driver's resource requests. The Worker itself does not directly participate in task computation but acts as a resource container and Executor lifecycle manager.
The Executor process is a JVM process launched on worker nodes for each Spark application, responsible for executing specific tasks assigned by the Driver. Executors are created when the application starts and typically run throughout the entire application lifecycle. Each Executor includes the following key functions: executing Task computation logic; providing in-memory caching for RDDs through Block Manager; returning computation results to the Driver; and maintaining task execution status.
Application Execution Flow
Spark application execution follows this standardized process:
- User submits application, Driver process initializes and creates
SparkContext - Driver requests resources from cluster manager to launch Executors
- Cluster manager launches Executor processes on Worker nodes
- Executors register with Driver after startup, establishing direct communication channels
- Driver generates Tasks based on program logic and distributes them to registered Executors
- Executors execute Tasks, cache intermediate data via Block Manager, and return results to Driver
- If Worker or Executor fails, Driver reschedules affected tasks to other available Executors
- After application completion, Driver calls
SparkContext.stop()to release all resources
Key Configuration Parameters and Resource Management
Executor Quantity Control: In Spark standalone clusters, by default each Worker node starts one Executor per application. Users can finely control Executor allocation through configuration parameters:
--executor-cores: Specifies the number of CPU cores each Executor can use--total-executor-cores: Limits the total cores used by the application
When Worker nodes have sufficient cores, a single Worker can host multiple Executors. For example, if a Worker has 8 cores and --executor-cores=4 is configured, that Worker can start 2 Executors for the same application.
Task Parallelism Configuration: Executors achieve task parallel execution internally through multithreading. The number of cores per Executor determines its ability to concurrently execute Tasks. Configuring --executor-cores=N means each Executor can run N Task threads simultaneously. This design enables Spark to fully utilize multi-core CPU computing power, improving data processing throughput.
Configuration Example Analysis
Consider a cluster with 5 Worker nodes, each equipped with 8 CPU cores. The following analyzes resource allocation under different configurations:
Example 1: Default Configuration
Spark greedily acquires all resources offered by the scheduler, ultimately obtaining 5 Executors (one per Worker), each using 8 cores, for a total of 40 cores.
Example 2: --executor-cores 10 --total-executor-cores 10
Since each Worker has only 8 cores, it cannot meet the requirement of 10 cores per Executor, so no Executors can be launched.
Example 3: --executor-cores 10 --total-executor-cores 50
Similarly, because the core requirement per Executor (10) exceeds the Worker's maximum capacity (8), resources cannot be allocated.
Example 4: --executor-cores 50 --total-executor-cores 50
Executor core requirement far exceeds single Worker capacity, resource allocation fails.
Example 5: --executor-cores 50 --total-executor-cores 10
Although the total core limit is low, the per-Executor core requirement still cannot be met, preventing Executor launch.
Relationship Between Worker and Executor
There is a clear hierarchical relationship between Worker and Executor: Worker is the resource management process on physical nodes, while Executor is the computation process created by applications on Worker resources. The Driver communicates directly with Executors for task scheduling and data exchange, while Workers are primarily responsible for resource monitoring and Executor lifecycle management. This separation design enhances system flexibility and scalability.
Regarding running multiple Workers per node, this is generally not recommended in practice. Multiple Workers mean running multiple JVM processes on a single machine, which increases memory overhead and process management complexity without providing significant performance benefits. Spark's design optimizes scenarios where a single Worker manages multiple Executors, resulting in higher resource utilization.
Fault Tolerance Mechanism
Spark possesses robust fault tolerance capabilities, automatically handling node failures. When a Worker or Executor fails, the Driver detects heartbeat loss and reschedules incomplete tasks from that node to other healthy Executors. Spark also supports speculative execution: when a task executes abnormally slowly, the Driver can launch duplicate copies of the same task on other nodes and use the first completed result. This mechanism ensures computational reliability in distributed environments.
Best Practice Recommendations
Based on the above analysis, the following Spark cluster configuration recommendations are proposed:
- Reasonably set
--executor-coresbased on data scale and computing requirements, typically recommended as 1/2 to 3/4 of Worker cores to reserve resources for system operations and other processes - Use
--total-executor-coresto limit total resource usage by applications, preventing single applications from monopolizing cluster resources - Monitor Executor memory usage and properly configure
spark.executor.memoryto avoid frequent garbage collection - In standalone clusters, prioritize using a single Worker managing multiple Executors over multiple Worker mode
- Optimize task scheduling based on data locality to reduce network data transfer overhead
By deeply understanding the coordination mechanisms between Worker, Executor, and core resources, developers can more effectively configure and optimize Spark clusters, fully unleashing the performance potential of distributed computing frameworks.