Comprehensive Guide to Adding JAR Files in Spark Jobs: spark-submit Configuration and ClassPath Management

Keywords: Apache Spark | JAR File Management | ClassPath Configuration | spark-submit | File Distribution

Abstract: This article provides an in-depth exploration of various methods for adding JAR files to Apache Spark jobs, detailing the differences and appropriate use cases for --jars option, SparkContext.addJar/addFile methods, and classpath configurations. It covers key concepts including file distribution mechanisms, supported URI types, deployment mode impacts, and demonstrates proper configuration through practical code examples. Special emphasis is placed on file distribution differences between client and cluster modes, along with priority rules for different configuration options, offering Spark developers a complete dependency management solution.

ClassPath Configuration Details

When adding JAR files to Spark jobs, classpath configuration is a core consideration. Using spark.driver.extraClassPath or its alias --driver-class-path sets additional classpaths for the node running the Driver. Similarly, spark.executor.extraClassPath is used to set additional class paths for worker nodes. If a JAR file needs to be effective on both master and worker nodes, it must be explicitly specified in both configuration options.

Path Separator Specifications

The choice of path separators follows standard JVM conventions. On Linux systems, use colon : as the separator, for example: --conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar". On Windows systems, use semicolon ; as the separator, for example: --conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar".

File Distribution Mechanisms

File distribution behavior depends on the job's execution mode. In client mode, Spark starts a Netty HTTP server that distributes files to each worker node upon job startup. From the logs, you can observe: 16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b, indicating the HTTP file server has successfully started.

In cluster mode, Spark selects a leader worker node to execute the Driver process. In this scenario, the job does not run directly from the master node, and Spark will not set up an HTTP server. Users need to manually make JAR files available to all worker nodes via HDFS, S3, or other sources accessible to all nodes.

Supported URI Types

Spark supports multiple URI schemes for file distribution:

file: - Absolute paths and file:/ URIs are served by the driver's HTTP file server, and every executor pulls the file from the driver HTTP server
hdfs:, http:, https:, ftp: - These protocols pull down files and JARs from the URI as expected
local: - A URI starting with local:/ is expected to exist as a local file on each worker node. This means no network IO will be incurred, working well for large files/JARs pushed to each worker or shared via NFS, GlusterFS, etc.

JAR files and regular files are copied to the working directory for each SparkContext on executor nodes. Typically, this directory is located under /var/run/spark/work, where you can see directory structures like app-20160515061614-0027. Inside these directories, you can find all deployed JAR files.

Configuration Option Priority

Understanding configuration option priority is crucial. If any property is set through code, it takes precedence over any option specified via spark-submit. The specific priority order is: properties set directly on SparkConf have the highest precedence, followed by flags passed to spark-submit or spark-shell, and finally options in the spark-defaults.conf file.

Detailed Analysis of Each Option

--jars and SparkContext.addJar are essentially the same functionality, differing only in setup method: one is set via the Spark submit script, the other via code. The choice depends on the specific use case. It's particularly important to note that using these options does not automatically add JAR files to the Driver/Executor classpath; they must be explicitly added using extraClassPath configuration on both.

The difference between SparkContext.addJar and SparkContext.addFile is: the former is for dependencies that need to be used with your code, while the latter is for simply passing arbitrary files to worker nodes that aren't runtime dependencies in your code.

--conf spark.driver.extraClassPath=... and --driver-class-path are aliases; either can be chosen. --conf spark.driver.extraLibraryPath=... and --driver-library-path are also alias relationships.

--conf spark.executor.extraClassPath=... is used for dependencies that cannot be included in an über JAR (e.g., due to compile-time conflicts between library versions) and need to be loaded at runtime.

--conf spark.executor.extraLibraryPath=... is passed as the JVM's java.library.path option, used when a library path visible to the JVM is needed.

Practical Configuration Examples

In client mode, it's safe to use multiple main options simultaneously to add additional application JAR files. However, the example contains some redundant arguments; for instance, passing JAR files to --driver-library-path is useless. If you want them on your classpath, you need to pass them to extraClassPath.

A correct configuration example is:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

Ultimately, when deploying external JAR files on both driver and worker nodes, ensure both file distribution and classpath configuration are correctly set to guarantee that Spark jobs can properly access all necessary dependencies.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.