1. Environment setup
Charmed Apache Spark solution is based on the spark-client
snap that can run Spark jobs on a Kubernetes cluster.
In this step of the tutorial, we will prepare a lightweight K8s environment, spark-client
snap, and some additional components required for this tutorial. We are going to use Multipass to create a virtual environment and set up the following software:
- MicroK8s — a lightweight Kubernetes that can run locally
- Spark-client snap — a snap that bundles client-side scripts to manage, configure, and run Apache Spark jobs on a Kubernetes cluster
- MiniO — S3-compliant object storage
- Juju — Canonical’s orchestration system
Minimum system requirements
Before we start, make sure your machine meets the following minimal requirements:
- Ubuntu 22.04 (jammy) or later (the tutorial has been prepared and tested to work on 24.04)
- 10 GB of RAM
- 5 CPU cores
- At least 50 GB of available storage
- Access to the internet for downloading the required snaps and charms
Virtual machine
Use Multipass to start a new Ubuntu virtual machine with the following recommended parameters:
multipass launch --cpus 4 --memory 8G --disk 50G --name spark-tutorial 24.04
See also:
- How to create an instance guide from Multipass documentation
multipass launch
command reference
Check the status of the provisioned virtual machine:
multipass list
Connect to the virtual machine:
multipass shell spark-tutorial
From now on, unless stated otherwise, we will work inside this virtual machine’s environment. For clarity, we will refer to it as VM, while the host machine that runs the Multipass will be called Host.
MicroK8s
Charmed Apache Spark is developed to be run on top of a Kubernetes cluster. For the purpose of this tutorial we will be using a lightweight Kubernetes: MicroK8s.
Installing MicroK8s is as simple as running the following command:
sudo snap install microk8s --channel=1.32-strict/stable
Make sure to install the 1.32-strict/stable
version of MicroK8s which was tested to work with all the components of this tutorial.
Configuration
Let’s configure MicroK8s so that the currently logged-in user has admin rights to the cluster.
First, set an alias kubectl
that can be used instead of microk8s.kubectl
:
sudo snap alias microk8s.kubectl kubectl
Then, add the current user into microk8s
group:
sudo usermod -a -G snap_microk8s ${USER}
Create and provide ownership of ~/.kube
directory to the current user:
mkdir -p ~/.kube
sudo chown -f -R ${USER} ~/.kube
Put the group membership changes into effect:
newgrp snap_microk8s
Check the status of the MicroK8s:
microk8s status --wait-ready
When the MicroK8s cluster is running and ready, you should see an output similar to the following:
microk8s is running
high-availability: no
...
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
disabled:
cert-manager # (core) Cloud native certificate management
...
Let’s generate a Kubernetes configuration file using MicroK8s and write it to ~/.kube/config
.
This is where kubectl
looks for the Kubeconfig file by default.
microk8s config | tee ~/.kube/config
Now let’s enable a few addons for using features like role based access control, usage of local volume for storage, and load balancing.
sudo microk8s enable rbac
sudo microk8s enable storage hostpath-storage
sudo apt install -y jq
IPADDR=$(ip -4 -j route get 2.2.2.2 | jq -r '.[] | .prefsrc')
sudo microk8s enable metallb:$IPADDR-$IPADDR
Wait for the commands to finish running and check the list of enabled addons:
microk8s status --wait-ready
The output of the command should look similar to the following:
microk8s is running
...
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
hostpath-storage # (core) Storage class; allocates storage from host directory
metallb # (core) Loadbalancer for your Kubernetes cluster
storage # (core) Alias to hostpath-storage add-on, deprecated
...
The MicroK8s setup is complete.
The spark-client
snap
For Apache Spark jobs to be running run on top of Kubernetes, a set of resources (service account, associated roles, role bindings etc.) need to be created and configured.
To simplify this task, the Charmed Apache Spark solution offers the spark-client
snap. Install the snap:
sudo snap install spark-client --channel 3.4/edge
Let’s create a Kubernetes namespace for us to use as a playground in this tutorial.
kubectl create namespace spark
We will now create a Kubernetes service account that will be used to run the Spark jobs. The creation of the service account can be done using the spark-client
snap, which will create necessary roles, role bindings and other necessary configurations along with the creation of the service account:
spark-client.service-account-registry create \
--username spark --namespace spark
This command does a number of things in the background. First, it creates a service account in the spark
namespace with the name spark
. Then it creates a role with name spark-role
with all the required RBAC permissions and binds that role to the service account by creating a role binding.
These resources can be viewed with kubectl get
commands as follows:
kubectl get serviceaccounts -n spark
kubectl get roles -n spark
kubectl get rolebindings -n spark
Output example
kubectl get serviceaccounts -n spark
NAME SECRETS AGE
default 0 5m41s
spark 0 2m49s
kubectl get roles -n spark
NAME CREATED AT
spark-role 2025-04-16T09:13:00Z
kubectl get rolebindings -n spark
NAME ROLE AGE
spark-role-binding Role/spark-role 2m48s
Now, launch a PySpark shell using the service account you created earlier to verify that it works:
spark-client.pyspark \
--username spark --namespace spark
The resulted output should include a welcome screen from PySpark:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.4.2
/_/
Using Python version 3.10.12 (main, Jan 17 2025 14:35:34)
Spark context Web UI available at http://10.181.60.136:4040
Spark context available as 'sc' (master = k8s://https://10.181.60.136:16443, app id = spark-627fb48be4da4315b3716e71a7613baf).
SparkSession available as 'spark'.
>>>
Press CTRL + D
to exit PySpark shell.
The basic Apache Spark setup is now complete: you have a Kubernetes environment and a configured spark-client
snap ready to use it. We will get back to using them in later steps of the tutorial.
Next, we’ll continue to set up additional software needed for Spark History Server and object storage that are also used during this tutorial.
Juju
Juju is an Operator Lifecycle Manager (OLM) for clouds, bare metal, LXD or Kubernetes.
We’ll use juju
to deploy and manage the Spark History Server and a number of other applications later to be integrated with Apache Spark.
To install and configure a juju
client using a snap:
sudo snap install juju
mkdir -p ~/.local/share
Juju can automatically detects all available clouds on our local machine (VM) without the need of additional setup or configuration.
You can verify this by running juju clouds
command that should produce an output similar to the following:
Only clouds with registered credentials are shown.
There are more clouds, use --all to see them.
You can bootstrap a new controller using one of these clouds...
Clouds available on the client:
Cloud Regions Default Type Credentials Source Description
localhost 1 localhost lxd 0 built-in LXD Container Hypervisor
microk8s 1 localhost k8s 0 built-in A Kubernetes Cluster
As you can see, Juju has detected LXD as well as K8s installation in the system.
For us to be able to deploy Kubernetes charms, let’s bootstrap a Juju controller in the microk8s
cloud:
juju bootstrap microk8s spark-tutorial
The creation of the new controller can be verified with the juju controllers
command.
The output of the command should be similar to:
Use --refresh option with this command to see the latest information.
Controller Model User Access Cloud/Region Models Nodes HA Version
spark-tutorial* - admin superuser microk8s/localhost 1 1 - 3.6.5
The Juju setup is complete.
MinIO
Apache Spark can be configured to use S3 for object storage. However, for this tutorial, instead of AWS S3, we’ll use MinIO: a lightweight S3-compatible object storage. It is available as a MicroK8s add-on by default, allowing us to create a local S3 bucket, which is more convenient for our local tests.
Let’s enable the MinIO addon for MicroK8s.
sudo microk8s enable minio
Authentication with MinIO is managed with an access key and a secret key. These credentials are generated and stored as Kubernetes secret when the MinIO add-on is enabled.
Let’s fetch credentials and export them as environment variables in order to use them later:
export ACCESS_KEY=$(kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_ACCESS_KEY}' | base64 -d)
export SECRET_KEY=$(kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_SECRET_KEY}' | base64 -d)
export S3_ENDPOINT=$(kubectl get service minio -n minio-operator -o jsonpath='{.spec.clusterIP}')
export S3_BUCKET="spark-tutorial"
The MinIO add-on offers access to a built-in Web UI which can be used to interact with the local S3 object storage. But for this tutorial, we will use CLI commands.
To set up the AWS CLI, run the following commands:
sudo snap install aws-cli --classic
aws configure set aws_access_key_id $ACCESS_KEY
aws configure set aws_secret_access_key $SECRET_KEY
aws configure set region "us-west-2"
aws configure set endpoint_url "http://$S3_ENDPOINT"
Check the tool by listing all S3 buckets:
aws s3 ls
If you see an error message "Could not connect to the endpoint URL: ", then you need to wait a minute before trying the command above again.
The list of the buckets in our S3 storage is empty now, and the command returns no output. That’s because we have not created any buckets yet. Let’s proceed to create a new one.
To create the spark-tutorial
bucket using AWS CLI, run:
aws s3 mb s3://spark-tutorial
We now have an S3 bucket available locally on our system! See for yourself by running the same command to list all buckets:
aws s3 ls
With the access key, secret key, and the endpoint properly configured, you should see spark-tutorial
bucket listed in the output.
Credentials set up
For Apache Spark to be able to access and use our local S3 bucket, we need to provide a few configuration options including the bucket endpoint, access key and secret key.
In the Charmed Apache Spark solution, these configurations are stored in a Kubernetes secret and bound to a Kubernetes service account. When Spark jobs are executed using that service account, all associated configurations are automatically retrieved and supplied to Apache Spark.
The S3 configurations can be added to the existing spark
service account with the following command:
spark-client.service-account-registry add-config \
--username spark --namespace spark \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.access.key=$ACCESS_KEY \
--conf spark.hadoop.fs.s3a.endpoint=$S3_ENDPOINT \
--conf spark.hadoop.fs.s3a.secret.key=$SECRET_KEY
Now check the list of configurations bound for the service account:
spark-client.service-account-registry get-config \
--username spark --namespace spark
You should see the following list of configurations in the output:
spark.hadoop.fs.s3a.access.key=<access_key>
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.endpoint=<s3_endpoint>
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.secret.key=<secret_key>
spark.kubernetes.authenticate.driver.serviceAccountName=spark
spark.kubernetes.namespace=spark
You can also see the configuration stored in a Kubernetes secret:
kubectl get secret -n spark -o yaml
Output example
apiVersion: v1
items:
- apiVersion: v1
data:
spark.hadoop.fs.s3a.access.key: M0MyTHBKd1duSHd4SDZVQ2d3cVQ=
spark.hadoop.fs.s3a.aws.credentials.provider: b3JnLmFwYWNoZS5oYWRvb3AuZnMuczNhLlNpbXBsZUFXU0NyZWRlbnRpYWxzUHJvdmlkZXI=
spark.hadoop.fs.s3a.connection.ssl.enabled: ZmFsc2U=
spark.hadoop.fs.s3a.endpoint: MTAuMTUyLjE4My4xMDU=
spark.hadoop.fs.s3a.path.style.access: dHJ1ZQ==
spark.hadoop.fs.s3a.secret.key: MTlBaVdWZENxMWZ1dHBYeUM0bmRSTlJ0M3Fid3ZydXFHdGZNNjl4ZA==
kind: Secret
metadata:
creationTimestamp: "2025-04-16T09:13:00Z"
name: spark8t-sa-conf-spark
namespace: spark
resourceVersion: "5555"
uid: ddc76bf0-729f-4a01-9e0e-e1668b658036
type: Opaque
kind: List
metadata:
resourceVersion: ""
With that, the tutorial’s environment setup is complete!