spark-shell 실행 메커니즘 이해하기

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Hyesung Oh

spark-shell 실행 메커니즘 이해하기 본문

Data Engineering/Apache Spark

spark-shell 실행 메커니즘 이해하기

혜성 Hyesung 2021. 1. 22. 00:46

apache spark official github repo: github.com/apache/spark

1. 이해의 출발 지점은 바로, 우리가 spark-shell REPL를 사용하기 위해 실행하는 bin/spark-shell 스크립트

# Shell script for starting the Spark Shell REPL

cygwin=false

case "$(uname)" in

CYGWIN*) cygwin=true;;

esac

* cygwin: window에서 linux터미널을 사용할 수 있게 해주는 오픈소스

현재 터미널이 linux이면 true, window이면 false정도로 이해했다.

function main() {

if $cygwin; then

# Workaround for issue involving JLine and Cygwin

# (see http://sourceforge.net/p/jline/bugs/40/).

# If you're using the Mintty terminal emulator in Cygwin, may need to set the

# "Backspace sends ^H" setting in "Keys" section of the Mintty options

# (see https://github.com/sbt/sbt/issues/562).

stty -icanon min 1 -echo > /dev/null 2>&1

export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"

"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"

stty icanon echo > /dev/null 2>&1

else

export SPARK_SUBMIT_OPTS

"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"

}

cygwin이 false 즉, window 환경에서 bin/spark-shell을 실행하게 되면 실제로는 bin/spark-submit을 실행하게 되고 class-path로 org.apache.spark.repl.Main을 넘겨주게 된다.

/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"

* "$@" 와 "$*"의 차이: 전자는 입력되는 모든 parameter를 한개의 단어로 취급한다는 뜻. 후자는 공백으로 구분된 별도의 문자열로 취급한다는 의미

2. spark-submit 스크립트 파헤치기!

spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"을 실행해줬고 이는 실제로는 아래의 스크립트를 실행. 즉, spark-class 스크립트를 실행하고 매개변수로 org.apache.spark.deploy.SparkSubmit와 repl.Main함수 등을 넘겨주게 된다.

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

3. spark-class 스크립트 파헤치기!

# Find the java binary

if [ -n "${JAVA_HOME}" ]; then

RUNNER="${JAVA_HOME}/bin/java"

else

if [ "$(command -v java)" ]; then

RUNNER="java"

else

echo "JAVA_HOME is not set" >&2

exit 1

spark를 실행하기 위해선 jdk가 설치되어 있어야 한다..!

RUNNER="${JAVA_HOME}/bin/java"

# Find Spark jars.

if [ -d "${SPARK_HOME}/jars" ]; then

SPARK_JARS_DIR="${SPARK_HOME}/jars"

else

SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then

echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2

echo "You need to build Spark with the target \"package\" before running this program." 1>&2

exit 1

else

LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"

해당 프로젝트를 빌드하면 jar파일이 만들어지고 우리는 그것들이 압축된 배포 파일을 다운 받아 압축해제하고 사용하는 것이다. 즉, 우리가 spark를 인터넷을 통해 다운을 받으면 SPARK_HOME 경로에 jar 디렉토리가 생성되게 되고, jar 디렉토리 내부에는 executable jar files들이 있다.

SPARK_JARS_DIR="${SPARK_HOME}/jars"

build_command() {

"$RUNNER" -Xmx128m $SPARK_LAUNCHER_OPTS -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"

printf "%d\0" $?

}

LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*", 즉 jar디렉토리내에 있는 모든 jar파일들에 지금까지 넘겨준(spark-shell -> spark-submit -> spark-class 의 매개변수로 넘겨주었던 main 함수들을 호출하도록 한다..!!

4. 매개변수로 넘겨준 class들 파헤치기!

1. org.apache.spark.repl.Main

- 주요 호출함수: createSparkSession

- return: sparkSession

=> spark-shell을 실행했을 때, 자동으로 sparkSession이 생성이 되어있는 이유..!

2. org.apache.spark.deploy.SparkSubmit

: 함수가 아니라 org.apache.spark.launcher.Main에 넘겨줄 매개변수일뿐!!

3. org.apache.spark.launcher.Main

설명은 아래 주석이 대신!! 요약하면, class 매개변수로 넘겨준 값이 "org.apache.spark.deploy.SparkSubmit" 이면 SparkLauncher class is used to launch a Spark application. if another class is provided, an internal Spark class is run

* Usage: Main [class] [class args]

* <p>

* This CLI works in two different modes:

* <ul>

* <li>"spark-submit": if <i>class</i> is "org.apache.spark.deploy.SparkSubmit", the

* {@link SparkLauncher} class is used to launch a Spark application.</li>

* <li>"spark-class": if another class is provided, an internal Spark class is run.</li>

* </ul>

* This class works in tandem with the "bin/spark-class" script on Unix-like systems, and

* "bin/spark-class2.cmd" batch script on Windows to execute the final command.

* <p>

* On Unix-like systems, the output is a list of command arguments, separated by the NULL

* character. On Windows, the output is a command line suitable for direct execution from the

* script.

if (className.equals("org.apache.spark.deploy.SparkSubmit")) {

try {

AbstractCommandBuilder builder = new SparkSubmitCommandBuilder(args);

cmd = buildCommand(builder, env, printLaunchCommand);

} catch (IllegalArgumentException e) {

printLaunchCommand = false;

System.err.println("Error: " + e.getMessage());

System.err.println();

MainClassOptionParser parser = new MainClassOptionParser();

try {

parser.parse(args);

} catch (Exception ignored) {

// Ignore parsing exceptions.

}

- - - - - - - -이하 코드 생략 - - - - -

결론

spark-shell, spark-submit 는 모두 executable script 파일일 뿐이며 그것의 매개변수로 넘겨준 scala class, java class 파일들이 jvm에서 실행이 되는 구조이다..!

1. spark-shell

2. spark-submit --class org.apache.spark.repl.main

3. spark-class org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.main

4. java -Xmx128m $SPARK_LAUNCHER_OPTS -cp "${SPARK_HOME}/jars/*" org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.main

5. sparkSession이 생성된 REPL 실행.

저작자표시 비영리 변경금지 (새창열림)

'Data Engineering > Apache Spark' 카테고리의 다른 글

Long running Spark Job Problem: NodeManager is unhealthy (0)	2022.04.27
Pyspark 도입 후 고도화하기/ 4. Optimization feat. spark-default.conf (0)	2021.11.02
Pyspark 도입 후 고도화하기/ 3. 가독성 높이기 feat. transform spark3.0 (0)	2021.11.01
Pyspark 도입 후 고도화하기/ 2. Pyspark 작동 원리 feat. Py4J (2)	2021.11.01
Pyspark 도입 후 고도화하기/ 1. 프로젝트 구조 (0)	2021.11.01

'Data Engineering/Apache Spark' Related Articles

Comments

Hyesung Oh

spark-shell 실행 메커니즘 이해하기 본문

spark-shell 실행 메커니즘 이해하기

apache spark official github repo: github.com/apache/spark

1. 이해의 출발 지점은 바로, 우리가 spark-shell REPL를 사용하기 위해 실행하는 bin/spark-shell 스크립트

2. spark-submit 스크립트 파헤치기!

3. spark-class 스크립트 파헤치기!

4. 매개변수로 넘겨준 class들 파헤치기!

결론

spark-shell, spark-submit 는 모두 executable script 파일일 뿐이며 그것의 매개변수로 넘겨준 scala class, java class 파일들이 jvm에서 실행이 되는 구조이다..!

'Data Engineering > Apache Spark' 카테고리의 다른 글

티스토리툴바