K8s is a container orchestration tool. Today there are a lot of microservices architectures. We can have a container per service and managing all those containers can be hard. K8s takes care of the scalability, security, persistence and load balancing.
When k8s is triggered to create a container, it will delegate it to the container runtime engine via a CRI (container runtime interface).
There are two types of nodes (these can be vm, baremetal machines, whatever you call a computer):
kubectl something
you are talking to this API.
These have different components to do their job:
There are some components that are shared by the nodes, whether they are control or workers.
K8s has api resources which are the building blocks of the cluster. These are your pods, deployments, services, and so on.
Every k8s primitive follows a general structure:
k api-versions
to see the versions
compatible with your cluster
This is how we talk to the api. Usually you do:
k <verb> <resource> <name>
Keep in mind we usually have different stuff in different namespaces, so we
are always appending -n <some namespace>
to the command.
The name of an object has to be unique across all objects of the same resource within a namespace.
There are two ways to manage objects: the imperative or the declarative way.
The imperative is where you use commands to make stuff happen in the cluster. Say you want to create an nginx pod you would do:
k run --image=nginx:latest nginx --port=80
This would create the pod in the cluster when you hit enter. In my professional experience, you hardly ever create stuff like that. The only time I use it is to create temporary pods to test something.
There are other verbs which you might use a bit more. edit
brings
up the raw config of the resource and you can change it on the fly. Although
I would recommend just do this for testing things. Hopefully your team has
the manifests under a version control system, if you edit stuff like this it
would mess it up.
There is also patch
which I have never used, but it... "Update
fields of a resource using strategic merge patch, a JSON merge patch, or a
JSON patch."
There is also delete
which -- as you probably guess already --
deletes the resource. Usually the object gets a 30 sec grace period for it to
die. But if it does not the kubelet will try to kill it forcefully.
If you do:
k delete pod nginx --now
It will ignore the grace period.
This is where you have a bunch of yaml
s which are your
definitions of resources. The cool thing about this is that you can version
control them. Say you have a nginx-deploy.yaml
. You can create
it in the cluster with:
k apply -f nginx-deploy.yaml
This gives you more flexibility on what you are doing. Since you can just go to the file change stuff and apply it again.
Usually I use a hybrid approach, most of the imperative commands have this
--dry-run=client -o yaml
flag that you can append to the command
and it will render the yaml manifest. You can redirect that to a file and
start working on that. You open the yaml with your favourite text editor, and
then mount volumes and stuff like that.
There are more ways to manage the resources for example you can use kustomize to render different values based on the same manifest, or with helm to bring up complete apps/releases to just cluster. Probably we will go over them later in the book.
There are a million ways of doing this. I used terraform to create some
droplets in digital ocean and packer with ansible to build an image that would
let everything ready for me to run the kubeadm
commands.
kubeadm
is the tool to create a cluster.
Here is a non-comprehensive list of what is needed before running
kubeadm
stuff.
Open ports needed for k8s to work
Disable swap; otherwise kubelet is going to fail to start
Install a container runtime, like containerd
Install kubeadm
There are some things k8s does not have by default. You need to install this extensions as needed.
Container Network Interface (CNI)
Container Network Interface (CRI)
Container Storage Interface (CSI)
Once you have kubeadm
in your system everything else is pretty
straight forward. You just ssh to your control plane and run:
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
This runs some preflight checks to see if everything is working properly, if
not it will likely print a message telling you about what is wrong. In my case
it complained about /proc/sys/net/ipv4/ip_forward
being disabled.
But was able to fix it by just doing echo 1 | sudo tee
/proc/sys/net/ipv4/ip_forward
.
Where does the cidr
comes from? I had exactly the same question.
It seems that it will depend on the CNF you will install, but do not quote me
on that.
Once the command runs successfully, it will print next steps:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join ip.control.panel.and:port --token some-cool-token \
--discovery-token-ca-cert-hash sha256:some-cool-hash
Just follow those steps. You ssh into the workers and join them with that command. If you lost the tokens for some reason you can reprint them with:
kubeadm token create --print-join-command
Now, before joining the workers, you need to install the CNI, you can pick any of the ones on the k8s add-ons docs.
Installing them is nothing fancy, your literally just run a `k apply -f some-mainfest` and be done with it. I went with calico for no particular reason.
The control plane is like the most important part of the cluster, since if it fails, you are not even going to be able to talk to the API to do stuff. We can add redundancy to improve this. Here is where HA architectures come into play.
There are two
Stacked etcd topology
External etcd topology
You have at least three control planes each with its own etcd in the same node. All the nodes running at the same time and the workers talk to them through a load balancer, if one dies, we still have others.
Per control plane we have two nodes, one that runs etcd and one that runs the
actual control plane stuff. They communicate through the
kube-apiserver
api.
This topology require more nodes, and that means a bit more manage overhead.
It is recommended to upgrade from a minor version to a next higher one, say,
1.18.0
to 1.19.0
, or from a patch version to a
higher one, 1.18.0
to 1.18.3
The high level plan is this:
Upgrade a primary control plane node
In case of HA, upgrade additional control planes
Upgrade worker nodes
One last thing before going to the steps. You are going to see that when we
drain
a node we use the--ignore-daemonsets
flag. Which begs the question, what is a daemonset?A daemonset defines pods needed for node-local stuff, say you want to have a daemon on each node that collects logs. You can deploy a daemonset for it. When we drain a node to upgrade we tell it to not kick out of there the daemonsets, since we might actually need those for the node to operate properly.
ssh
into the node
k get nodes
to check current version
Use your package manager apt
/dnf
and
upgrade kubeadm
Check which kubeadm
versions are available to upgrade
to
$ sudo kubeadm upgrade plan
...
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.18.20
[upgrade/versions] kubeadm version: v1.19.0
I0708 17:32:53.037895 17430 version.go:252] remote version is much newer: \
v1.21.2; falling back to: stable-1.19
[upgrade/versions] Latest stable version: v1.19.12
[upgrade/versions] Latest version in the v1.18 series: v1.18.20
...
You can now apply the upgrade by executing the following command:
kubeadm upgrade apply v1.19.12
Note: Before you can perform this upgrade, you have to update kubeadm to v1.19.12.
...
Upgrade it kubeadm upgrade apply v1.19.12
Then we need to drain the node. Which means we mark the node as unschedulable, and new pods wont arrive.
kubectl drain kube-control-plane --ignore-daemonsets
Use your package manager to upgrade both kubelet
and
kubectl
to the same version
Restart and reload kubelet
daemon with
systemctl
Mark node as schedulable again
k uncordon kube-control-plane
k get nodes
should show the new version
ssh
into the node
k get nodes
to check current version
Use your package manager apt
/dnf
and
upgrade kubeadm
Do kubeadm upgrade node
to upgrade the
kubelet
configuration
Drain the node as we did with the control plane
kubectl drain worker-node --ignore-daemonsets
Use your package manager to upgrade both kubelet
and
kubectl
to the same version
Restart and reload kubelet
daemon with
systemctl
Mark node as schedulable again
k uncordon worker-node
k get nodes
should show the new version
etcd is a key-value store used as k8s backing store for all the cluster information. They are a stand alone project with its own docs. Since it is used for backup, we need to know how to use it in order to restore or backup the cluster.
There are two cli's we will be working with etcdcutl
and
etcdutl
.
etcdctl
: primary way to interact with etcd over the
network.
etcdutl
: designed to operate with etcd data files
directly, not over the network.
kubeadm
will setup etcd as pods managed directly by the kubelet
daemon (known as static pods). You can actually see them by runnin g
All k8s data is stored in etcd, this includes sensitive data, therefore the snapshots created by etcd are encrypted.
In order to talk to etcd we can ssh
into the control plane, then
do etcdctl version
to verify it is installed.
If you went with kubeadm
as your installation way, you can see
that there is a pod in the kube-system
namespace that concerns
etcd. If you describe
it you will some information relevant to
connect to etcd.
k describe pod etcd-cka-control-plane -n kube-system | grep '\-\-'
--listen-client-urls=https://10.2.0.9:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--key-file=/etc/kubernetes/pki/etcd/server.key
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
If we want to talk to etcd from outside the control plane node, we will need
the --listen-client-urls
addresses. If you are inside the node,
you can skip that. We are going to need the path to all the TLS things. A
simple command you can test if you have everything right is the following
ETCDCTL_API=3 etcdctl --endpoints 10.2.0.9:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
member list
bbf4baa696b33a2e, started, control-plane, https://10.2.0.9:2380, https://10.2.0.9:2379
Since the certificates are inside a path which your user probably does not have
access, you will have to sudo
it.
The you can create an snapshot by running the snapshot save
/path/to/new/snapshot
command.
ETCDCTL_API=3 etcdctl --endpoints https://162.243.29.89:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
snapshot save snapshot.db
2025-08-09 17:04:29.201951 I | clientv3: opened snapshot stream; downloading
2025-08-09 17:04:29.241278 I | clientv3: completed snapshot read; closing
Snapshot saved at snapshot.db
We will use the etcdutl
to restore a snapshot.
etcdutl --data-dir /path/to/be/restored/to snapshot restore snapshot.db
We also need to point the etcd pod to this new path we have restored the info
to. You can find the manifest for the etcd pod under
/etc/kubernetes/manifests/etcd.yaml
. There is a volume called
etcd-data
, point it to the new path, and restart the pod
The way anyone talks with k8s, is through the API, it does not matter if you
are a human, or a service account you all talk to the k8s http api. When a
request gets to the server it goes through some stages, shown in the docs
diagram which I copy pasted here:
All the requests go through TLS, by default the API will run on
0.0.0.0:6443
but this can be changed , with
--secure-port
and the --bind-address
flags.
When you kubeadm init
your cluster, k8s will create its own
Certificate Authority (CA) and its key (/etc/kubernetes/pki/ca.crt and
/etc/kubernetes/pki/ca.key
respectively). It will use this to sign the
certificates used by the API server.
Inside your .kube/config
file you will need a copy of that
certificate, this verifies that the API's certificate is authentic and was
signed with the clusters CA.
Once we have TLS, we can continue with authentication. The cluster admin may setup different authentication modules, if so they will be tried sequentially to see any suffices.
K8s may use the whole http request to authenticate, although most modules only use the headers.
If all the modules failed, then a 401
will be returned. If it is
successful, the user is authenticated as an specific username
.
Once the request has passed the authentication stage, it is time to see if it can in fact do the action it was trying to accomplish. The request will must include its username, a requested action, and the resource affected by the action. The request then will be authorized if there is an existing policy that declares that the user has permissions to do the action it is intended to.
There are different authorization modules, the administrator can setup many in
one cluster, they will be tried one by one and if all fails a 403
will be
returned
If the authorization is successful, then we jump to admission controles. They are basically a piece of code that will check the data arriving in a request that modifies a resource. They do not control requests to read resources, only those that modify them. They usually just validate stuff. The thing is that if one fails the request is rejected, it is not like the others stages where we try one by one.
Generate a chronological set of records, documenting everything that is happening.
Role-based access control is a way of controlling access to network resources
based on the roles an individual has. The
rbac.authorization.k8s.io
api group, allows you to set them up
dynamically in the k8s cluster.
RBAC introduces 4 new object types to the cluster, Role
,
ClusterRole
, RoleBinding
,
ClusterRoleBinding
.
Role
and ClusterRole
These represent a set of permissions. The only difference between the
two is that Role
defines the permissions for a namespace,
and ClusterRole
is not limited to a namespace.
Role
Here is the command for creating a role to get and watch all the pods in the nginx namespace.
k create role --dry-run=client -o yaml pod-reader --resource=pod --verb=get,watch -n nginx
To be honest you might be better just going to the docs and copy the manifest from there, since it can get a bit long to write all the verbs and resources in one command.
ClusterRole
Since these are not bound to a namespaces you can also use them to set permissions to things like nodes and persistent volumes.
The command is super similar, just we do not specify a namespace
k create clusterrole secret-watcher --resource=secret --verb=get,list --dry-run=client -o yaml
Another thing specific to ClusterRole
s is that you can
aggregate them. When you create one you can add a label to it. Then
you can create another one that uses that label.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-aggregate
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.example.com/aggregate-to-monitoring: "true"
rules: []
In this example we are adding all ClusterRole
s that have
the label rbac.example.com/aggregate-to-monitoring: "true"
.
RoleBinding
and ClusterRoleBinding
Once you created your Role
object you can bind it to a
user, or service account. This makes Role
s reusable. For
example, you can create a pod-read-only, and bind it to many
subjects
(users, groups or service accounts).
A RoleBinding
may bind any Role
in the same
namespace, buuut you can also use them to bind
ClusterRoles
, to a single namespace.
You can create them as you would expect
k create rolebinding pod-reader --dry-run=client -o yaml --role=pod-readonly --user=jose
Remember you can always --help
stuff, or copy an
example from the wiki
You cannot patch/edit an existing rolebinding to change the roles. You
have to delete it and create one again. There is this kubectl auth
reconcile
which will do that for you.
One last tip, you can always do
k auth can-i get pod/logs --as="some-subject" -n "ns" # can-i verb resource
To check if the role is working as expected.
A service account is a non-human account that provides an identity in a k8s cluster. Pods can use them to do requests against k8s api, to authenticate against a image registry.
They are represented in the k8s cluster with the ServiceAccount
object. They are namespaced, lightweight, portable.
There is also this default
service account created in every
namespace. If you try to delete it the control plane replaces it. This account
is assign to all pods if you do not manually assign one, and has api discovery
permissions.
You can use RBAC
to add roles to it. It is just another subject
you can include in the roles manifests.
To use one you just have to:
Create the service account, in a declarative or imperative way
Give it roles with RBAC
Assign it to a pod during its creation
k create token "sa-name" -n test-token
To assign one to a pod just add the spec.serviceAccountName
field.
A resource is an endpoint in k8s api, that manages objects of the same type.
One example is the pods
resource; it is an endpoint of the api and
you use it to create, destroy, list pod objects.
Then a custom resource is an extension of k8s native api. You can
create your own resources for your own needs. Custom resources can be created
and destroyed dynamically on a running cluster, and once installed you can use
kubectl
to manage them as you would manage any other resource.
Say, you might have one to create a database custom resource to represent and manage it inside of your cluster.
A custom resource by itself will only represent some structured data. To make them work in a true declarative state you need to add also a controller.
In an imperative API you tell the server to do something and it does it. In a declarative api, like k8s is, you tell it the state you want to accomplish, in this case using the custom resources endpoints, and then there will be a controller that makes sure that state is true.
It is outside of the scope of the exam but there are two ways of creating a custom resource.
Using the CustomResourceDefinition
api resource
Using API server aggregation
The operator pattern is creating a custom controller to manage a custom resource.
One example would be, deploying an application on demand. This would
look something like, we have a new custom resource called
ApplicationDeployment
where the developer specifies the application
they want to deploy. Now when they k apply -f
it, there would be a
controller that takes care of all the deployment of the app.
There are many operators already created by the community, you can find several
in the OperatorHub. A popular one is
ArgoCD, this defines custom resources such as Application
where
you can point to a git repository , and it will make sure the code is in sync
with that repository among other things. Popular on organizations using GitOps.
Helm is a package manager for k8s. This means that you can use it similar
to apt
or dnf
to install full working packages in the
k8s cluster.
Usually a deployment of a full service in a k8s cluster would involve multiple
resources, services
pods
configmaps
. It
would be a bit complicated to deploy all of them using kubectl
.
With helm you can deploy full working solutions with just a few commands.
Say you want to deploy jenkins in your cluster. You could just look in the ArtifactHub for jenkins, and follow the instructions for installing the chart. It typically looks something like the following.
We first need to add the repo for helm to keep track of it.
helm repo add jenkins https://charts.jenkins.io
helm repo update
Then you can just install it specifying a name for the release. Do not
forget that you are using your kubeconfig
configuration so
the namespace and cluster you are pointing to will be the target of
this operation.
helm install my-jenkins jenkins/jenkins --version 5.8.25 # helm install [RELEASE_NAME] jenkins/jenkins [flags]
It will create all the k8s resources needed for it to work. The cool thing about this is that you can customize it a bit by passing values to certain variables for the package. Say you want to change the admin user, it varies on the package of course but here you can do something like:
helm install my-jenkins jenkinsci/jenkins --version 4.6.4 \
--set controller.adminUser=boss --set controller.adminPassword=password \
-n jenkins --create-namespace
You can discover a list of all the values too.
helm show values jenkinsci/jenkins
helm list
helm repo update; # so we have the most up-to-date version
helm upgrade my-jenkins jenkinsci/jenkins --version 5.8.26
helm uninstall my-jenkins
Kustomize allow you to manage multiple k8s manifests in an easy way. It has different capabilities.
You can build configmaps
and other resources out of
files.
You can patch different values, say the DNS for an application based on different overlays/environments.
Just a few quick things, the heart of this is the
kustomization.yaml
file, there you will list all the resources
kustomize will use to render the templates.
You can also render how the manifests would look without having to apply them with
kustomize bulid /path/to/kustomization.yaml # or
k kustomize /path/to/kustomization.yaml
Here is a short example on how you can start using this, say to add the same namespace to two different manfiests.
% tail -n +1 kustomization.yaml pod.yaml configmap.yaml
== kustomization.yaml ==
namespace: kustom
resources:
- pod.yaml
- configmap.yaml
== pod.yaml ==
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: nginx
name: nginx
spec:
containers:
- image: nginx:1.21.1
name: nginx
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
== configmap.yaml ==
apiVersion: v1
data:
dir: /etc/logs/traffic-log.txt #/etc/logs/traffic.log
kind: ConfigMap
metadata:
creationTimestamp: null
name: logs-config
A workload is an application running inside a k8s cluster. Wether your application has different components running or just one, you will run it inside a set of Pods. A Pod is nothing more than a set of containers.
Pods have a defined life-cycle, meaning if you kill one or it dies do to some issue, it is not going to respawn by itself or anything.
To make life easier, k8s has a set of different controllers that will help manage this pods. Say, always keeping 3 of them alive, even if one is killed, spin up another one to take its place. You can use workload resources to make this happen. The workload resources will configure these controllers depending on what you want to do. We will go more in depth on each, but here is a brief intro into each of them.
Deployment and ReplicaSet, these are good more managing workloads where pods are replaceable/interchangeable, stateless applications.
StatefulSet, this will help you run applications where pods do keep track of the state. Useful when mounting Persistent Volumes to different pods, so they stay consistent.
DaemonSet, pods that provide some functionality to Nodes, maybe for networking, or to manage the node. These are like daemons that will be assigned to each Node.
Job and CronJob, define tasks that will run until completion and then stop.
A pod is like a set of containers with shared namespaces and shared file systems. You can run just one or multiple containers in one Pod.
Pods are consider ephemeral, pods are created, assigned a unique id (UID), scheduled to run to nodes where they will live until their termination. If a node dies, the pods that lived there, or were scheduled to live there, will be marked for deletion.
While a pod is running, kubelet
can restart its containers to
handle some kind of faults.
Pods are only scheduled one in their lifetime; assigning a pod to a node is called binding, and the process of selecting which node the pod should go to is known as scheduling. Once a pod is scheduled to a node they are bound until either of them dies.
A pod is never "re-scheduled", it is simply killed and replaced by maybe a super similar one but the UID will be different.
There are several pods phases:
Phase | Description |
---|---|
Pending | The Pod has been accepted by k8s, but one or more containers are not ready to run. This means it might be waiting for scheduling or downloading an image from a registry. |
Running | The Pod has been bound to a Node, all the containers have been created. At least one of them is running, or in the process of starting/restarting. |
Succeeded | All containers in the Pod have been terminated in success. |
Failed | All containers in the Pod have been terminated, but one ore more terminated in failure. |
Unknown | We could not get the state of the pod, usually an error with communicating with the Node the pod is running on. |
CrashLoopBackOff
andTerminating
are not actually phases of a pod. Make sure to not confuse status with phase.
Similar to every living thing on this green Earth, a Pod will be presented with issues along its time in this world filled with thorns and thistles. Maybe, as us, even its own life will depend on how well it is able to solve them. This unnecessary biblical de-tour begs the question, how does it handle problems with containers?
The pods spec
has a restartPolicy
. This will
determine how k8s reacts to containers exiting due to errors.
Initial Crash, k8s immediately tries to restart it based on
the restartPolicy
Repeated Crashes, if it keeps failing, it will add an exponential backoff delay for the next restarts
CrashLoopBackOff state, this indicates the backoff delay mechanism is in effect.
Backoff reset, if a container manages to stay alive for a certain duration of time, the backoff delay is restarted.
Troubleshooting is its own separated section, but here are some reasons a Pod
might be CrashLoopBackOff
ing.
Application errors are causing the container to exit
Configuration errors, missing files, or env vars
Resources, the container may not have enough memory or cpu to start
Healthchecks are failing if the application doesn't start serving in time.
How to debug this? Check the logs
, events
, ensure the
configuration is set up properly, check resources limits, debug application.
Maybe even run the image locally, see if it is working fine.
A restartPolicy
can be Never
, Always
,
OnFailure
.
A probe is a diagnostic periodically performed by the kubelet. There are three
types the livenessProbe
, readinessProbe
,
startupProbe
.
Pretty self explanatory, maybe the only thing to clarify is that the
startupProbe
indicates if the app inside a container started. All
the other probes will be disabled until this is done. Usually this one is used
for containers that take a long time to start.
And the readinessProbe
indicates whether the container is ready to
respond to requests.
There are 4 check mechanisms.
exec
: exec a command inside the container, if
successful return 0
.
grpc
: performs remote call using gRPC.
httpGet
: makes a http GET request against the pod ip
to a given endpoint.
tcpSocket
: perform a tcp check, considers successful
if the port is open
An init container is one (or and array) of containers that will run before your main application containers. They will run until completion, meaning they cannot live side by side with your main containers. Those are sidecars, which we will talk about later.
They run sequentially, and if one fails kubelet
will restart that
init container until it succeed, if the restartPolicy
is set to
never. Then when it fails the whole pod will be treated as failed.
They have all the fields and features of regular containers, they just do not have probes.
They are useful to setup different stuff in your application. Like set up things in volumes and stuff like that. Maybe download a file or something.
Here is an example form the docs where the init container waits for a svc in k8s to be up and running before starting these pods containers.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app.kubernetes.io/name: MyApp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
Sidecar Containers are containers that run side by side with the application container, they can be used for logging, data sync, monitoring, and so on. Typically you only have one application container per pod. For example, if you have a web app that requires a web server, you would have you web app in the app container and use a sidecar container as the web server.
The implementation is super simple, just add a restartPolicy
to an
init container and there you have it. The cool thing is that they will still be
run sequentially, but the one with the restartPolicy
will keep
running.
You can also use probes on these containers opposed to regular init containers.
When a pod dies, first the app container will be deleted, then the sidecar containers on opposed order to what they were spawned on.
You can have multiple containers as the main containers of the pod, so when to use this. The app containers are meant for executing primary application logic. That is why usually you just have one and use sidecars for anything else.
Your applications will run inside pods, the k8s api offers different resources to help you manage them. Say a pod dies, you would not like to have to go in the middle of the night and start it again. We can use k8s objects that will help us manage them.
A replicaset's purpose is to keep a stable set of replica pods running at any given time. You tell it how many and how does the pod look and it will remove/create pods to maintain this state.
In a replicaset's fields you specify a selector, which tells the replicaset how to identify pods it can acquire and a number specifying how many pods it should be maintaining.
You are not likely to create a replicaset by itself, you usually create higher level resources like a Deployment that then will use ReplicaSets.
A deployment manages a set of pods that run an application. You describe the desired state for the application, say, how many pods need to be available? and the controller will maintain that state.
There are some fields worth mentioning in a deployments' manifest. First, the
spec.replicas
field will specify the number of replicas, of
course. We also have spec.selector
which tells the replicaset how
to find the pods to manage, usually this one will match the label set in the
spec.template
pod template.
Once you k apply
deployment, there are some commands that
will come in handy.
k get deploy
will give you overall information on the
deployments; how many replicasets have been created, how many are
available, the age.
k rollout status deployment/some-deployment
will print
a message telling you how many replicas have been rollout, and
similar stuff.
k get rs
will print the status for the replicasets
If you update an image from the deployment k set image
deployment/some-deploy nginx=nginx:latest
and do k get rs
you will see that there are two now. The one with the previous image that now
marks its pods as 0 and the new one with the pods marked as the
spec.replicas
says.
The deployment controller ensures that only an specific number of pods are down
wile being updated. By default it makes sure that at least 3/4
of
the desired number of pods are up. Meaning only 1/4
can be
unavailable.
When updating, the controller will look for existing replicasets that control
certain spec.label
but do not match the existing
spec.template
, and scale those down. While a new replicaset with
the new spec.template
is scaled up.
If you update a deployment but it is not going the way you wanted, you can easily go back to the previous version of your deployment. First you need to check the rollout history, to choose what version will you rollback to.
kubectl rollout history deployment/nginx-deployment
You can see more details on a rollout by using the same command but with the
--revision=n
command.
If you decide to rollback you can do
kubectl rollout undo deployment/nginx-deployment --to-revision=n
You can scale the replicas in a deployment with
kubectl scale deployment/nginx-deployment --replicas=10
If Horizontal Pod Autoscaler is setup you can setup it up based on cpu/memory
usage.
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80
The spec.strategy
field, will tell you the type of strategy used
to replace old pods by new pods. I can either be, Recreate or
RollingUpdate. The latter is the default value.
If Recreate, all pods are killed before new ones are created.
If RollingUpdate, one replicaset is scaled down while a new one is
scaled up. You can specify maxUnavailable
and
maxSurge
to control this.
Max Unavailable: tells how many pods can be unavailable during the updating process. If set 30%, the deployment will scale down the old replicaset to 70% of its capacity, and will not scale it further down until the new pods in the new replicaset are ready. Making sure that always at least 70% of the pods are available.
Max Surge: this specifies how much the number of pods can go over the limit specified in the replicaset. Say you have 10 pods running and set this to 3; when the upadate starts the controller will scale the total number of pods to 13. If this number is higher, the update will be faster, but at the expense of using more resources.
There are similar to deployments, but they maintain a sticky identity to each of the pods they create.
You will use stateful sets if you need:
If you need stable network identities, meaning, your pods have the same name after (re)scheduling, opposed to a random hash at the end of their name.
You want to specify a PVC per pod, and do not have them fight for one pre-defined.
In a pod you can specify how much of a resource (RAM an CPU) a container needs.
You can set a request of certain amount of resources, and the scheduler
uses this information to know on which node to put it. You can also set a
limit on how many resources a container can use, and
kubelet
will make sure the running container does not exceeds
those.
A pod may use more resources than it requested, as long as the node has enough of them there will be no issue.
Limits work differently though. They are enforced by the linux kernel.
For cpu
they are hard limits, the kernel will restrict access to
the CPU based on its limit by CPU throttling. For memory
the
kernel uses out of memory (OOM) kills. This does not mean that as soon as the
container exceeds the memory it is killed, the kernel will only kill it if it
detects memory pressure.
You specify cpu
and memory
limits/resources using
specific units. Kuberenetes CPU units, and bytes respectively.
Usually you specify the limits at container level
spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory
But since v1.32
you can also set them a pod level
spec.resources.limits.cpu
spec.resources.limits.memory
spec.resources.requests.cpu
spec.resources.requests.memory
You can limit the resources by namespace using a k8s resource called
ResourceQuota
. These are not limited to only memory
and cpu
, since you can limit the amount of objects that can be
created in a namespace, say only create 10 pods or something like that.
Users need to specify the resource limit or request on their workloads if not
the API may not give permission to create them. This can be a bit painful for
developers, so you can define LimitRange
to set defaults on pods
that do not specifically set the requirements.
Important note on all these, they do not apply to running pods, they only apply to new pods. So if you have some deployment and then set a LimitRange expecting the pods from that deployment to apply it you are wrong.
Here is an example on how they look like:
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-resource-constraint
spec:
limits:
- default: # this section defines default limits
cpu: 500m
defaultRequest: # this section defines default requests
cpu: 500m
max: # max and min define the limit range
cpu: "1"
min:
cpu: 100m
type: Container
One last thing, the LimitRange wont check if your limits make sense. If you specify a limit less than your request, it will let you fail.
The k8s network model has several pieces:
Each pod had its own unique cluster-wide ip.
A pod has its own private network which is shared by all the
containers running in the pod; they can talk to each other using
localhost
.
The pod network handles communication between pods. It makes sure pods can communicate with each other regardless of the node they are in. This also allows for node deamons to talk to the pods living on the same node.
The Service api, provides a long lived IP address/hostname for a service implemented by pods. The pods can be replaced but the service will stay the same.
There is another object called EndpointSlice
which
provides information about the pods currently working for a service.
The Gateway API, (or its predecessor, Ingress), allows you
to make a svc
accessible to clients outside the
cluster.
NetworkPolicy allows you to control traffic between pods
You can specify how a pod is allowed to communicate to different entities over the network using Network Policies. They are dependant on the network plugin you used, but you usually can specify which namespaces, which pods, or which IP blocks are allowed to send and receive traffic to/from a pod.
you go for the namespaces/pods NetworkPolicy you will use a selector to tell what traffic is allowed.
If you go with the IP blocks you will define CIDR ranges.
Two things worth mentioning, the pod will always allow traffic between the node and itself, and a pod cannot block access to itself.
There are two types of pod isolation, egress
and
ingress
. They are declared independently.
egress
will tell us who the pod is allowed to send
traffic to; meaning who it can speak to.
ingress
will tell us who the pod is allowed to receive
traffic from; meaning who it can listen from.
By default a pod will allow all outbound (egress) and inbound (ingress) connections. You can create NetworkPolicy resources where the selector matches a pod and applies its rules to, they are accumulative.
Here is an example from the docs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
A few fields worth mentioning:
spec.podSelector
: selects the group of pods to which
the policy will apply. An empty one will select all pods in the
namespace.
spec.policyTypes
: this may include Ingress, Egress, or
both.
spec.egress
: list of allowed egress rules. Has a
to
and ports
sections.
spec.ingress
: list of allowed ingress rules. Has a
from
and ports
sections.
Be careful, these two configs are different.
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
podSelector:
matchLabels:
role: client
Here are accepting traffic from pods that are in the namespace with the label
user: alice
and also they need to have the label
role: client
.
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
- podSelector:
matchLabels:
role: client
Here you are accepting traffic from pods that are in the namespace with that
label or from pods that have that label.
For the ipBlock
the IP blocks you select must be cluster-external
IPs since Pod IPs are ephemeral.
One last thing worth mentioning, to target a namespace by name you will have to
use the immutable label kubernetes.io/metadata.name
.
If you do a deployment to your cluster that serves as a backend for an application you want to access over the network. It can get tricky, because the pods of a replicaset are ephemeral, their IPs will change all the time. Your front end is not expected to update the address every time something happens.
This is why we have services, these allow you to select a group of pods using label selectors (so if one is killed and spawned it will still be picked up) and assign an IP that wont change in your cluster.
This way, pods can be killed and respawned but the frontend only has to keep track of 1 address.
In the service definition you will specify the selector and ports you want to use target.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app.kubernetes.io/name: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
The pods selected by that label will form a resource called
EndpointSlices
which basically will do the mapping, the controller
for that service will update the virtual IPs if they change.
You can define names for ports inside pods, which then you can use as a reference in the service.
You can also create the services for a deployment with the k
expose
command.
ClusterIP
makes the service reachable from within the
cluster.
NodePort
, map the service to a port on the node, this
will give it outside access.
LoadBalancer
exposes the service externally using a
load balancer. k8s does not offer a load balancing component you
will have to use a cloud provider or something.
ExternalName
, map the serivce to the externalName
field, say api.foo.example
, this setup the cluster's
DNS server to return a CNAME record with that hostname value.
The Gateway API are some k8s resources that provide traffic routing, and make network services available. They are role-oriented, meaning each level of resource is supposedly manage by different personas, infra engineer, cluster admin, and developers. Here is a list of the 3 levels.
GatewayClass
: these are managed by the infra engineer,
they are similar to a StorageClass
as in they are not
limited to namespaces, they are cluster-scoped, and usually given
by the cloud provider. This is how the cloud provider handles
requests from the outside world.
They are as simple as this:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: example-class
spec:
controllerName: example.com/gateway-controller
Gateway
: these describe how traffic can be translated
to Services within the cluster.
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: example-gateway
spec:
gatewayClassName: example-class
listeners:
- name: http
protocol: HTTP
port: 80
This basically is saying, create a gateway using the class
specified there, and listen on port 80
.
HTTPRoute
, tells the behaviour of http requests from
the gateway listener.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: example-httproute
spec:
parentRefs:
- name: example-gateway
hostnames:
- "www.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /login
backendRefs:
- name: example-svc
port: 8080
Here we are telling the gateway we just created, that if in the
Host:
of the request you find
www.example.com
and that the path is
/login
then you should use example-svc
on
port 8080
An Ingress is an object that manges external access to a svc
.
Here you can define a hostname, tls among other things.
To make them work in your cluster you need to first have an Ingress Class, there are a few you can choose from like ingress-nginx controller.
The imperative way for creating one is actually a good way to understand them to. Look at the command:
k create ing website-api --rule='website.com/api=my-svc:8080'
The rule part is will tell the cluster that if a request gets, where the host
is webiste.com
and the path is /api
it should map it
to the svc
called my-svc
on port 8080
.
Basically host/path=service:port
.
In that example we are not specifying tls
, but you can do it by
pointing to a secret of that type.
There are some path types, like if the /api
you specified should
be a exact match (Exact
) or can be a prefix
(Prefix
)and stuff like that.
You can talk with pods and services within the cluster using its dns. It is as
simple as following this structure
name.namespace.type.cluster.local
In order to do this, k8s runs a DNS server implementation called CoreDNS, if
you get the pods from kube-system
you will be able to see the pod
that is running this. The config is in a cm
called coredns in the
same namespace.