Samuel Cozannet
on 15 February 2017
GPUs and Kubernetes for deep learning — Part 1/3
A few weeks ago I shared a side project about Building a DYI GPU cluster for k8s to play with Kubernetes with a proper ROI vs. AWS g2 instances.
This was spectacularly interesting when AWS was lagging behind with old nVidia K20s cards (which are not supported anymore on the latest drivers). But with the addition of the P series (p2.xlarge, 8xlarge and 16xlarge) the new cards are K80s with 12GB RAM, outrageously more powerful than the previous ones.
Baidu just released a post on the Kubernetes blog about the PaddlePaddle setup, but they only focused on CPUs. I thought it would be interesting looking at a setup of Kubernetes on AWS adding some GPU nodes, then exercise a Deep Learning framework on it. The docs say it is possible…
This post is the first of a sequence of 3: Setup the GPU cluster (this blog), Adding Storage to a Kubernetes Cluster (right afterwards), and finally run a Deep Learning training on the cluster (working on it, coming up post MWC…).
The plan
In this blog, we will:
- Deploy k8s on AWS in a development mode (no HA, colocating etcd, the control plane and PKI)
- Deploy 2x nodes with GPUs (p2.xlarge and p2.8xlarge instances)
- Deploy 3x nodes with CPU only (m4.xlarge)
- Validate GPU availability
Requirements
For what follows, it is important that:
- You understand Kubernetes 101
- You have admin credentials for AWS
- If you followed the other posts, you know we’ll be using the Canonical Distribution of Kubernetes, hence some knowledge about Ubuntu, Juju and the rest of Canonical’s ecosystem will help.
Foreplay
- Make sure you have Juju installed.
On Ubuntu,
sudo apt-add-repository ppa:juju/stable
sudo apt update
sudo apt install -yqq juju
for other OSes, lookup the official docs
Then to connect to the AWS cloud with your credentials, read this page
- Finally copy this repo to have access to all the sources
git clone https://github.com/madeden/blogposts ./
cd blogposts/k8s-gpu-cloud
OK! Let’s start GPU-izing the world!
Deploying the cluster
Bootstrap
As usual start with the bootstrap sequence. Just be careful that p2 instances are only available in us-west-2, us-east-1 and eu-west-2 as well as the us-gov regions. I have experienced issues running p2 instances on the EU side hence I recommend using a US region.
juju bootstrap aws/us-east-1 — credential canonical — constraints “cores=4 mem=16G root-disk=64G”
# Creating Juju controller “aws-us-east-1” on aws/us-east-1
# Looking for packaged Juju agent version 2.1-rc1 for amd64
# Launching controller instance(s) on aws/us-east-1…
# — i-0d48b2c872d579818 (arch=amd64 mem=16G cores=4)
# Fetching Juju GUI 2.3.0
# Waiting for address
# Attempting to connect to 54.174.129.155:22
# Attempting to connect to 172.31.15.3:22
# Logging to /var/log/cloud-init-output.log on the bootstrap machine
# Running apt-get update
# Running apt-get upgrade
# Installing curl, cpu-checker, bridge-utils, cloud-utils, tmux
# Fetching Juju agent version 2.1-rc1 for amd64
# Installing Juju machine agent
# Starting Juju machine agent (service jujud-machine-0)
# Bootstrap agent now started
# Contacting Juju controller at 172.31.15.3 to verify accessibility…
# Bootstrap complete, “aws-us-east-1” controller now available.
# Controller machines are in the “controller” model.
# Initial model “default” added.
Deploying instances
Once the controller is ready we can start deploying services. In my previous posts, I used bundles which are shortcuts to deploy complex apps.
If you are already familiar with Juju you can run juju deploy src/k8s-gpu.yaml
and jump to the end of this section. For the others interested in getting into the details, this time we will deploy manually, and go through the logic of the deployment.
Kubernetes is made up of 5 individual applications: Master, Worker, Flannel (network), etcd (cluster state storage DB) and easyRSA (PKI to encrypt communication and provide x509 certs).
In Juju, each app is modeled by a charm, which is a recipe for how to deploy it.
At deployment time, you can give constraints to Juju, either very specific (instance type) or laxist (# of cores). With the latter, Juju will elect the cheapest instance matching your constraints on the target cloud.
First thing to do, is deploy the applications:
juju deploy cs:~containers/kubernetes-master-11 --constraints "cores=4 mem=8G root-disk=32G"
# Located charm "cs:~containers/kubernetes-master-11".
# Deploying charm "cs:~containers/kubernetes-master-11".
juju deploy cs:~containers/etcd-23 --to 0
# Located charm "cs:~containers/etcd-23".
# Deploying charm "cs:~containers/etcd-23".
juju deploy cs:~containers/easyrsa-6 --to lxd:0
# Located charm "cs:~containers/easyrsa-6".
# Deploying charm "cs:~containers/easyrsa-6".
juju deploy cs:~containers/flannel-10
# Located charm "cs:~containers/flannel-10".
# Deploying charm "cs:~containers/flannel-10".
juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=p2.xlarge" kubernetes-worker-gpu
# Located charm "cs:~containers/kubernetes-worker-13".
# Deploying charm "cs:~containers/kubernetes-worker-13".
juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=p2.8xlarge" kubernetes-worker-gpu8
# Located charm "cs:~containers/kubernetes-worker-13".
# Deploying charm "cs:~containers/kubernetes-worker-13".
juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=m4.2xlarge" -n3 kubernetes-worker-cpu
# Located charm "cs:~containers/kubernetes-worker-13".
# Deploying charm "cs:~containers/kubernetes-worker-13".
Here you can see an interesting property in Juju that we never approached before: naming the services you deploy. We deployed the same kubernetes-worker charm twice, but twice with GPUs and the other without. This gives us a way to group instances of a certain type, at the cost of duplicating some commands.
Also note the revision numbers in the charms we deploy. Revisions are not directly tight to versions of the software they deploy. If you omit them, Juju will elect the latest revision, like Docker would do on images.
Adding the relations & exposing software
Now that the applications are deployed, we need to tell Juju how they are related. For example, the Kubernetes master needs certificates to secure its API. Therefore, there is a relation between the kubernetes-master:certificates and easyrsa:client.
This relation means that once the 2 applications are connected, some scripts will run to query the EasyRSA API to create the required certificates, then copy them in the right location on the k8s master.
These relations then create statuses in the cluster, to which charms can react.
Essentially, very high level, think Juju as a pub-sub implementation of application deployment. Every action inside or outside of the cluster posts a message to a common bus, and charms can react to these and perform additional actions, modifying the overall state… and so on and so on until equilibrium is reached.
Let’s add the relations:
juju add-relation kubernetes-master:certificates easyrsa:client
juju add-relation etcd:certificates easyrsa:client
juju add-relation kubernetes-master:etcd etcd:db
juju add-relation flannel:etcd etcd:db
juju add-relation flannel:cni kubernetes-master:cni
for TYPE in cpu gpu gpu8
do
juju add-relation kubernetes-worker-${TYPE}:kube-api-endpoint kubernetes-master:kube-api-endpoint
juju add-relation kubernetes-master:cluster-dns kubernetes-worker-${TYPE}:kube-dns
juju add-relation kubernetes-worker-${TYPE}:certificates easyrsa:client
juju add-relation flannel:cni kubernetes-worker-${TYPE}:cni
juju expose kubernetes-worker-${TYPE}
done
juju expose kubernetes-master
Note at the end the expose commands.
These are instructions for Juju to open up a firewall in the cloud for specific ports of the instances. Some are predefined in charms (Kubernetes Master API is 6443, Workers open up 80 and 443 for ingresses) but you can also force them if you need (for example, when you manually add stuff in the instances post deployment).
Adding CUDA
CUDA does not have an official charm yet (coming up very soon!!), but there is my demoware implementation which you can find on GitHub. It has been updated for this post to CUDA 8.0.61 and drivers 375.26.
Make sure you have the charm tools available, clone and build the CUDA charm:
sudo apt install charm charm-tools
# Exporting the ENV
mkdir -p ~/charms ~/charms/layers ~/charms/interfaces
export JUJU_REPOSITORY=${HOME}/charms
export LAYER_PATH=${JUJU_REPOSITORY}/layers
export INTERFACE_PATH=${JUJU_REPOSITORY}/interfaces
# Build the charm
cd ${LAYER_PATH}
git clone https://github.com/SaMnCo/layer-nvidia-cuda cuda
charm build cuda
This will create a new folder called builds in JUJU_REPOSITORY, and another called cuda in there.
Now you can deploy the charm
juju deploy --series xenial $HOME/charms/builds/cuda
juju add-relation cuda kubernetes-worker-gpu
juju add-relation cuda kubernetes-worker-gpu8
This will take a fair amount of time as CUDA is very long to install (CDK takes about 10min and just CUDA probably 15min).
Nevertheless, at the end the status should show:
juju status
Model Controller Cloud/Region Version
default aws-us-east-1 aws/us-east-1 2.1-rc1
App Version Status Scale Charm Store Rev OS Notes
cuda active 2 cuda local 2 ubuntu
easyrsa 3.0.1 active 1 easyrsa jujucharms 6 ubuntu
etcd 2.2.5 active 1 etcd jujucharms 23 ubuntu
flannel 0.7.0 active 6 flannel jujucharms 10 ubuntu
kubernetes-master 1.5.2 active 1 kubernetes-master jujucharms 11 ubuntu exposed
kubernetes-worker-cpu 1.5.2 active 3 kubernetes-worker jujucharms 13 ubuntu exposed
kubernetes-worker-gpu 1.5.2 active 1 kubernetes-worker jujucharms 13 ubuntu exposed
kubernetes-worker-gpu8 1.5.2 active 1 kubernetes-worker jujucharms 13 ubuntu exposed
Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0/lxd/0 10.0.0.122 Certificate Authority connected.
etcd/0* active idle 0 54.242.44.224 2379/tcp Healthy with 1 known peers.
kubernetes-master/0* active idle 0 54.242.44.224 6443/tcp Kubernetes master running.
flannel/0* active idle 54.242.44.224 Flannel subnet 10.1.76.1/24
kubernetes-worker-cpu/0 active idle 4 52.86.161.22 80/tcp,443/tcp Kubernetes worker running.
flannel/4 active idle 52.86.161.22 Flannel subnet 10.1.79.1/24
kubernetes-worker-cpu/1* active idle 5 52.70.5.49 80/tcp,443/tcp Kubernetes worker running.
flannel/2 active idle 52.70.5.49 Flannel subnet 10.1.63.1/24
kubernetes-worker-cpu/2 active idle 6 174.129.164.95 80/tcp,443/tcp Kubernetes worker running.
flannel/3 active idle 174.129.164.95 Flannel subnet 10.1.22.1/24
kubernetes-worker-gpu8/0* active idle 3 52.90.163.167 80/tcp,443/tcp Kubernetes worker running.
cuda/1 active idle 52.90.163.167 CUDA installed and available
flannel/5 active idle 52.90.163.167 Flannel subnet 10.1.35.1/24
kubernetes-worker-gpu/0* active idle 1 52.90.29.98 80/tcp,443/tcp Kubernetes worker running.
cuda/0* active idle 52.90.29.98 CUDA installed and available
flannel/1 active idle 52.90.29.98 Flannel subnet 10.1.58.1/24
Machine State DNS Inst id Series AZ
0 started 54.242.44.224 i-09ea4f951f651687f xenial us-east-1a
0/lxd/0 started 10.0.0.122 juju-65a910-0-lxd-0 xenial
1 started 52.90.29.98 i-03c3e35c2e8595491 xenial us-east-1c
3 started 52.90.163.167 i-0ca0716985645d3f2 xenial us-east-1d
4 started 52.86.161.22 i-02de3aa8efcd52366 xenial us-east-1e
5 started 52.70.5.49 i-092ac5367e31188bb xenial us-east-1a
6 started 174.129.164.95 i-0a0718343068a5c94 xenial us-east-1c
Relation Provides Consumes Type
juju-info cuda kubernetes-worker-gpu regular
juju-info cuda kubernetes-worker-gpu8 regular
certificates easyrsa etcd regular
certificates easyrsa kubernetes-master regular
certificates easyrsa kubernetes-worker-cpu regular
certificates easyrsa kubernetes-worker-gpu regular
certificates easyrsa kubernetes-worker-gpu8 regular
cluster etcd etcd peer
etcd etcd flannel regular
etcd etcd kubernetes-master regular
cni flannel kubernetes-master regular
cni flannel kubernetes-worker-cpu regular
cni flannel kubernetes-worker-gpu regular
cni flannel kubernetes-worker-gpu8 regular
cni kubernetes-master flannel subordinate
kube-dns kubernetes-master kubernetes-worker-cpu regular
kube-dns kubernetes-master kubernetes-worker-gpu regular
kube-dns kubernetes-master kubernetes-worker-gpu8 regular
cni kubernetes-worker-cpu flannel subordinate
juju-info kubernetes-worker-gpu cuda subordinate
cni kubernetes-worker-gpu flannel subordinate
juju-info kubernetes-worker-gpu8 cuda subordinate
cni kubernetes-worker-gpu8 flannel subordinate
Let us see what nvidia-smi gives us:
juju ssh kubernetes-worker-gpu/0 sudo nvidia-smi
Tue Feb 14 13:28:42 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:00:1E.0 Off | 0 |
| N/A 33C P0 81W / 149W | 0MiB / 11439MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
On the more powerful 8xlarge,
juju ssh kubernetes-worker-gpu8/0 sudo nvidia-smi
Tue Feb 14 13:59:24 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:00:17.0 Off | 0 |
| N/A 41C P8 31W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:00:18.0 Off | 0 |
| N/A 36C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:00:19.0 Off | 0 |
| N/A 44C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:00:1A.0 Off | 0 |
| N/A 38C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:00:1B.0 Off | 0 |
| N/A 43C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:00:1C.0 Off | 0 |
| N/A 38C P0 69W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:00:1D.0 Off | 0 |
| N/A 44C P0 58W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:00:1E.0 Off | 0 |
| N/A 38C P0 71W / 149W | 0MiB / 11439MiB | 39% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Aaaand yes!! We have our 8 GPUs as expected so 8x 12GB = 96GB Video RAM!
At this stage, we only have them enabled on the hosts. Now let us add GPU support in Kubernetes.
Adding GPU support in Kubernetes
By default, CDK will not activate GPUs when starting the API server and the Kubelets. We need to do that manually (for now).
Master update
On the master node, update /etc/default/kube-apiserver to add:
# Security Context
KUBE_ALLOW_PRIV="--allow-privileged=true"
before restarting the API Server. This can be done programmatically with:
juju show-status kubernetes-master --format json | \
jq --raw-output '.applications."kubernetes-master".units | keys[]' | \
xargs -I UNIT juju ssh UNIT "echo -e '\n# Security Context \nKUBE_ALLOW_PRIV=\"--allow-privileged=true\"' | sudo tee -a /etc/default/kube-apiserver && sudo systemctl restart kube-apiserver.service"
So now the Kube API will accept requests to run privileged containers, which are required for GPU workloads.
Worker nodes
On every worker, /etc/default/kubelet to add the GPU tag, so it looks like:
# Security Context
KUBE_ALLOW_PRIV="--allow-privileged=true"
# Add your own!
KUBELET_ARGS="--experimental-nvidia-gpus=1 --require-kubeconfig --kubeconfig=/srv/kubernetes/config --cluster-dns=10.1.0.10 --cluster-domain=cluster.local"
before restarting the service.
This can be done with
for WORKER_TYPE in gpu gpu8
do
juju show-status kubernetes-worker-${WORKER_TYPE} --format json | \
jq --raw-output '.applications."kubernetes-worker-'${WORKER_TYPE}'".units | keys[]' | \
xargs -I UNIT juju ssh UNIT "echo -e '\n# Security Context \nKUBE_ALLOW_PRIV=\"--allow-privileged=true\"' | sudo tee -a /etc/default/kubelet"
juju show-status kubernetes-worker-${WORKER_TYPE} --format json | \
jq --raw-output '.applications."kubernetes-worker-'${WORKER_TYPE}'".units | keys[]' | \
xargs -I UNIT juju ssh UNIT "sudo sed -i 's/KUBELET_ARGS=\"/KUBELET_ARGS=\"--experimental-nvidia-gpus=1\ /' /etc/default/kubelet && sudo systemctl restart kubelet.service"
done
Testing our setup
Now we want to know if the cluster actually has GPU enabled. To validate, run a job with an nvidia-smi pod:
kubectl create -f src/nvidia-smi.yaml
Then wait a little bit and run the log command:
kubectl logs $(kubectl get pods -l name=nvidia-smi -o=name -a)
Tue Feb 14 14:14:57 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 47C P0 56W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 39C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 48C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 41C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 47C P0 58W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 40C P0 69W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 48C P0 59W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 41C P0 72W / 149W | 0MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Ẁhat is intersting here is that the pod sees all the cards, even if we only shared the /dev/nvidia0 char device. At runtime, we would have problems.
If you want to run multi GPU containers, you need to share all char devices like we do in the second yaml file (nvidia-smi-8.yaml)
Conclusion
We reached the first milestone of our 3 part journey: the cluster is up & running, GPUs are activated, and Kubernetes will now welcome GPU workloads.
If you are a data scientist or running Kubernetes workloads that could benefit of GPUs, this already gives you an elegant and very fast way of managing your setups. But usually in this context, you also need to have storage available between the instances, whether it is to share the dataset or to exchange results.
Kubernetes offers many options to connect storage. In the second part of the blog, we will see how to automate adding EFS storage to our instances, then put it to good use with some datasets!
In the meantime, feel free to contact me if you have a specific use case in the cloud for this to discuss operational details. I would be happy to help you setup you own GPU cluster and get you started for the science!
Tearing down
Whenever you feel like it, you can tear down this cluster. These instances can be pricey, hence powering them down when you do not use them is not a bad idea.
juju kill-controller aws/us-east-1
This will ask for confirmation then destroy everything… But now, you are just a few commands and a coffee away from rebuilding it, so that is not a problem.