# Download the code
/home/project$ git clone https://github.com/ibm-developer-skills-network/fgskh-new_horizons.git
# Move to downloaded directory
/home/project$ cd fgskh-new_horizons
/home/project/fgskh-new_horizons$
# Add an alias to kubectl
='kubectl'
$ alias k
# Save namespace in an environment var
=$(kubectl config view --minify -o jsonpath='{..namespace}') $ my_namespace
Spark on Kubernetes
Objectives
- Create a Kubernetes Pod - a set of containers running inside Kubernetes - here, containing Apache Spark which we use to submit jobs against Kubernetes
- Submit Apache Spark jobs to Kubernetes
Kubernetes is a container orchestrator which allows to schedule millions of “docker” containers on huge compute clusters containing thousands of compute nodes. Originally invented and open-sourced by Google, Kubernetes became the de-facto standard for cloud-native application development and deployment inside and outside IBM. With RedHat OpenShift, IBM is the leader in hybrid cloud Kubernetes and within the top three companies contributing to Kubernetes’ open source code base.
Setup K8s
I’ll be using IBM Cloud Services IDE, other cloud IDEs will be similar.
- New terminal
- Download the code for k8s
- Change to the downloaded code directory
- Add an alias to
kubectl
this will help by just typingk
instead ofkubectl
everytime - Save the current namespace in an environment variable
Deploy Spark K8s Pod
- Install Spark POD
- Check status of POD
- If you have issues and the POD is not running: delete it with
k delete po spark
and start over again - Note that this Pod is called spark and contains two containers (2/2) of which are both in status Running. Please also note that Kubernetes automatically RESTARTS failed pods - this hasn’t happened here so far. Most probably because the AGE of this pod is only 10 minutes.
# Install Spark POD
apply -f spark/pod_spark.yaml
$ k
/spark created
pod
# Check status of POD
$ k get po
NAME READY STATUS RESTARTS AGE2/2 Running 0 19s
spark
# If you see this, it means you need to wait a bit then run k get po till you get the output above
NAME READY STATUS RESTARTS AGE0/2 ContainerCreating 0 29s
spark
# To delete the POD
$ k delete po spark
Submit Job to K8s
- Now it is time to run a command inside the spark container of this Pod.
- The command exec is told to provide access to the container called spark (-c). With – we execute a command, in this example we just echo a message.
- You just ran a command in spark container residing in spark pod inside Kubernetes. We will use this container to submit Spark applications to the Kubernetes cluster. This container is based on an image with the Apache Spark distribution and the kubectl command pre-installed.
- Inside the container you can use the spark-submit command which makes use of the new native Kubernetes scheduler that has been added to Spark recently.
exec spark -c spark -- echo "Hello from inside the container"
$ k
# OUTPUT
from inside the container Hello
Submit SparkPi
- ./bin/spark-submit is the command to submit applications to a Apache Spark cluster
- –master k8s://http://127.0.0.1:8001 is the address of the Kubernetes API server - the way kubectl but also the Apache Spark native Kubernetes scheduler interacts with the Kubernetes cluster
- –name spark-pi provides a name for the job and the subsequent Pods created by the Apache Spark native Kubernetes scheduler are prefixed with that name
- –class org.apache.spark.examples.SparkPi provides the canonical name for the Spark application to run (Java package and class name)
- –conf spark.executor.instances=1 tells the Apache Spark native Kubernetes scheduler how many Pods it has to create to parallelize the application. Note that on this single node development Kubernetes cluster increasing this number doesn’t make any sense (besides adding overhead for parallelization)
- –conf spark.kubernetes.container.image=romeokienzler/spark-py:3.1.2 tells the Apache Spark native Kubernetes scheduler which container image it should use for creating the driver and executor Pods. This image can be custom build using the provided Dockerfiles in kubernetes/dockerfiles/spark/ and bin/docker-image-tool.sh in the Apache Spark distribution
- –conf spark.kubernetes.executor.limit.cores=0.3 tells the Apache Spark native Kubernetes scheduler to set the CPU core limit to only use 0.3 core per executor Pod
- –conf spark.kubernetes.driver.limit.cores=0.3 tells the Apache Spark native Kubernetes scheduler to set the CPU core limit to only use 0.3 core for the driver Pod
- –conf spark.driver.memory=512m tells the Apache Spark native Kubernetes scheduler to set the memory limit to only use 512MBs for the driver Pod
- –conf spark.kubernetes.namespace=${my_namespace} tells the Apache Spark native Kubernetes scheduler to set the namespace to my_namespace environment variable that we set before.
- local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar indicates the jar file the application is contained in. Note that the local:// prefix addresses a path within the container images provided by the spark.kubernetes.container.image option. Since we’re using a jar provided by the Apache Spark distribution this is not a problem, otherwise the spark.kubernetes.file.upload.path option has to be set and an appropriate storage subsystem has to be configured, as described in the documentation
- 10 tells the application to run for 10 iterations, then output the computed value of Pi
# Here is the code to SparkPi
exec spark -c spark -- ./bin/spark-submit \
k --master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=romeokienzler/spark-py:3.1.2 \
--conf spark.kubernetes.executor.request.cores=0.2 \
--conf spark.kubernetes.executor.limit.cores=0.3 \
--conf spark.kubernetes.driver.request.cores=0.2 \
--conf spark.kubernetes.driver.limit.cores=0.3 \
--conf spark.driver.memory=512m \
--conf spark.kubernetes.namespace=${my_namespace} \
///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar \
local:10
Monitor Spark
Once the above command is ran, open a second terminal and use
This will show you the additional Pods being created by the Apache Spark native Kubernetes scheduler - one driver and at least one executor. Note that with only one executor the driver may run the executor within its own pod.
You can see that Pod spark-pi-Y-driver is in status Completed, from a single executor run 6 minutes ago and that there are one driver and one executor actually running for job spark-pi-X- ...
To check the job’s elapsed time just execute (you need to replace the Pod name of course with the one on your system):
$ kubectl get po
NAME READY STATUS RESTARTS AGE-pi-e9208f92f3e36dc9-driver 0/1 Completed 0 6m24s spark
- Here’s an example when using one executor running separately from the driver pod (exact IDs replaced by X and Y for readability)
NAME READY STATUS RESTARTS AGE2/2 Running 0 28m
spark -pi-X-exec-1 1/1 Running 0 33s
spark-pi-X-driver 1/1 Running 0 44s
spark-pi-Y-driver 0/1 Completed 0 12m spark
Check Value
-pi-e9208f92f3e36dc9-driver |grep "Pi is roughly "
kubectl logs spark
# OUTPUT
is roughly 3.13994713994714 Pi