Demystifying the Kubernetes Networking Model

Anupam Mahapatra
18 min readJan 25, 2021

This guide provides a foundation for understanding the Kubernetes networking model and how it enables common networking tasks. The field of networking is both broad and deep and it’s impossible to cover everything here. This guide should give you a starting point to dive into the topics you are interested in and want to know more about. Whenever you are stumped, leverage the Kubernetes documentation and the Kubernetes community to help you find your way.

Overview of the internal components

API Server and etcd datastore

In Kubernetes, everything is an API call served by the Kubernetes API server (kube-apiserver). The API server is a gateway to an etcd datastore that maintains the desired state of your application cluster. To update the state of a Kubernetes cluster, you make API calls to the API server describing your desired state. This resides on the master node.

Controllers

Controllers are the core abstraction used to build Kubernetes. Once you’ve declared the desired state of your cluster using the API server, controllers ensure that the cluster’s current state matches the desired state by continuously watching the state of the API server and reacting to any changes. Controllers operate using a simple loop that continuously checks the current state of the cluster against the desired state of the cluster.

Kubelet

The kubelet is an controller service is responsible for launching (by communicating with the Kube API server on the master) and maintaining a set of pods.
The kubelet is also responsible for registering a node with a Kubernetes cluster, sending events and pod status, and reporting resource utilization.
The kubelet doesn’t manage containers which were not created by Kubernetes

For example, when you create a new Pod using the **API server**, the **Kubernetes scheduler** (a controller) notices the change and makes a decision about where to place the Pod in the cluster. It then writes that state change using the API server (backed by etcd). The **kubelet (a controller)** then notices that new change and sets up the required networking functionality to make the Pod reachable within the cluster. Here, two separate controllers react to two separate state changes to make the reality of the cluster match the intention of the user.

KubePorxy

kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer — iptables.
kubeproxy feeds the details of the pods into the iptables which are local firewalls and routes traffic. Whenever a new pod is launched, kubeproxy updates the iptables to make sure the pod is reachable within the cluster.

Service-LB

This is the external load balancer, ELB in case of AWS that is exposed to outside.
ELB in the target group has the address of all the nodes and routes traffic to the nodes which arrive at the IP table which communicate within itself to detect where the actual pod is to redirect the traffic to it.

Pods

A Pod is the atom of Kubernetes — the smallest deployable object for building applications. A single Pod represents a running workload in your cluster and encapsulates one or more Docker containers, any required storage, and a unique IP address. Containers that make up a pod are designed to be **co-located and scheduled on the same machine.**

  • Green boxes are the pod, within them the orange boxes are the containers.
  • Containers within a pod can communicate to each other over the local network of the pod (localhost:PORT)
  • pods within a cluster communicate over the network and use service discovery (discussed later)

Nodes

Nodes are the machines running the Kubernetes cluster. These can be bare metal, virtual machines, or anything else. The word hosts is often used interchangeably with Nodes. I will try and use the term Nodes with consistency but will sometimes use the word Virtual Machine to refer to Nodes depending on context.

The Kubernetes Networking Model

Kubernetes makes opinionated choices about how containers are networked. In particular, Kubernetes dictates the following requirements on any networking implementation:

All Pods can communicate with all other Pods without using network address translation (NAT).

All Nodes can communicate with all Pods without NAT.

The IP that a Pod sees itself as is the same IP that others see it as.

Given these constraints, we are left with four distinct networking problems to solve:

1. Container-to-Container networking

2. Pod-to-Pod networking

3. Pod-to-Service networking

4. Internet-to-Service networking

1. Container-to-Container networking

In Linux, each running process communicates within a network namespace that provides a logical networking stack with its own routes, firewall rules, and network devices. In essence, a network namespace provides a brand new network stack for all the processes within the namespace. By default, Linux assigns every process to the root network namespace to provide access to the external world.

NOTE: The term ‘namespace’ here is used in the networking context and has no resemblance to the Kubernetes namespace abstraction on the application side.

Fig 1: container to container communication on the same node

In terms of Docker constructs, a Pod is modeled as a group of Docker containers that share a network namespace. Containers within a Pod all have the same IP address and port space assigned through the network namespace assigned to the Pod, and can find each other via localhost since they reside in the same namespace. We can create a network namespace for each Pod on a virtual machine. This is implemented, using Docker, as a “Pod container” which holds the network namespace open while “app containers” (the things the user specified) join that namespace with Docker’s **–net=container: function**

The diagram shows each Pod is made up of multiple Docker containers (ctr*) within a shared namespace. Applications within a Pod also have access to shared volumes, which are defined as part of a Pod and are made available to be mounted into each application’s filesystem.

2. Pod-to-Pod Networking

In Kubernetes, every Pod has a real IP address and each Pod communicates with other Pods using that IP address. The task at hand is to understand how Kubernetes enables Pod-to-Pod communication using real IPs, whether the Pod is deployed on the same physical Node or different Node in the cluster.

2.1 Pods that reside on the same Nodes

We start this discussion by considering Pods that reside on the same machine to avoid the complications of going over the internal network to communicate across Nodes.

Fig 2: Pod to Pod communication on the same machine

From the Pod’s perspective, it exists in its own Ethernet namespace that needs to communicate with other network namespaces on the same Node. Thankfully, namespaces can be connected using a Linux Virtual Ethernet Device or veth pair consisting of two virtual interfaces that can be spread over multiple namespaces. To connect Pod namespaces, we can assign one side of the veth pair to the root network namespace, and the other side to the Pod’s network namespace. Each veth pair works like a patch cable, connecting the two sides and allowing traffic to flow between them. This setup can be replicated for as many Pods as we have on the machine. Figure above shows veth pairs connecting each Pod on a VM to the root namespace.

At this point, we’ve set up the Pods to each have their one network namespace so that they believe they have their own **virtual Ethernet device (eth0) and IP address**, and they are connected to the root namespace for the Node. Now, we want the Pods to talk to each other through the root namespace, and for this we use a **network bridge.**

**A Linux Ethernet bridge is a virtual Layer 2 networking device (crb0)** used to unite two or more network segments, working transparently to connect two networks together. The bridge operates by maintaining a forwarding table between sources and destinations by examining the destination of the data packets that travel through it and deciding whether or not to pass the packets to other network segments connected to the bridge. The bridging code decides whether to bridge data or to drop it by looking at the MAC-address unique to each Ethernet device in the network.

Fig 3: Life of packet in Pod to Pod communication on the same machine

The **Bridges implement the ARP protocol** to discover the link-layer MAC address associated with a given IP address. When a data frame is received at the bridge, the bridge broadcasts the frame out to all connected devices (except the original sender) and the device that responds to the frame is stored in a lookup table. Future traffic with the same IP address uses the lookup table to discover the correct MAC address to forward the packet to.

Summary:

Given the network namespaces that isolate each Pod to their **own networking stack**, **virtual Ethernet devices** that connect each namespace to the root namespace, and a **bridge** that connects namespaces together, we are finally ready to send traffic between Pods on the same Node. Pods always sends a packet to its own Ethernet device **eth0** which is available as the default device for the Pod. eth0 is connected via a virtual Ethernet device to the root namespace, **veth0** which forwards the packet to the bridge **crb0**. Once the packet reaches the bridge, the bridge resolves the correct network segment to send the packet to — veth1 using the ARP protocol. **veth1** forwarded directly to Pod 2’s namespace and the eth0 device within that namespace. Throughout this traffic flow, each Pod is communicating only with eth0 on localhost and the traffic is routed to the correct Pod. This complies with our initial requirement:

Kubernetes’ networking model dictates that Pods must be reachable by their IP address across Nodes. That is, the IP address of a Pod is always visible to other Pods in the network, and each Pod views its own IP address as the same as how other Pods see it.

2.2 Pod-to-Pod, across Nodes

The Kubernetes networking model requires that Pod IPs are reachable across the network. In practice this is network specific. Generally, every Node in your cluster is assigned a CIDR block specifying the IP addresses available to Pods running on that Node. Once traffic destined for the CIDR block reaches the Node it is the Node’s responsibility to forward traffic to the correct Pod.

Lets take AWS EKS as an example.
Amazon maintains a container networking plugin for Kubernetes that allows Node to Node networking to operate within an Amazon VPC environment using a [Container Networking Interface (CNI) plugin] . The Container Networking Interface (CNI) provides a common API for connecting containers to the outside network. As developers, we want to know that a Pod can communicate with the network using IP addresses, and we want the mechanism for this action to be transparent. The CNI plugin developed by AWS tries to meet these needs while providing a secure and manageable environment through the existing VPC, IAM, and Security Group functionality provided by AWS. The solution to this is using elastic network interfaces.

In EC2, each instance is bound to an **elastic network interface (ENI)** and all ENIs are connected within a VPC — ENIs are able to reach each other without additional effort. By default, each EC2 instance is deployed with a single ENI, but you are free to create multiple ENIs and deploy them to EC2 instances as you see fit. The **AWS CNI plugin** for Kubernetes leverages this flexibility by creating a new ENI for each Pod deployed to a Node. Because ENIs within a VPC are already connected within the existing AWS infrastructure, this allows each Pod’s IP address to be natively addressable within the VPC. When the CNI plugin is deployed to the cluster, each Node (EC2 instance) creates multiple elastic network interfaces and allocates IP addresses to those instances, forming a **CIDR block for each Node**. When Pods are deployed, a small binary deployed to the Kubernetes cluster as a DaemonSet receives a requests to add a Pod to the network from the Nodes local **kubelet** process. This binary picks an available IP address from the Node’s available pool of ENIs and assigns it to the Pod by wiring the virtual Ethernet device and bridge within the Linux kernel as described when networking Pods within the same Node. With this in place, Pod traffic is routeable across Nodes within the cluster.

3. Pod-to-Service Networking

We’ve shown how to route traffic between Pods and their associated IP addresses. This works great until we need to deal with change. Pod IP addresses are not durable and will appear and disappear in response to scaling up or down, application crashes, or Node reboots. Each of these events can make the Pod IP address change without warning. Services were built into Kubernetes to address this problem.

A **Kubernetes Service** manages the state of a set of Pods, allowing you to track a set of Pod IP addresses that are dynamically changing over time. Services act as an abstraction over Pods and assign a single virtual IP address called the **Cluster IP** to a group of Pod IP addresses. Any traffic addressed to the Cluster IP of the Service will be routed to the set of Pods that are associated with the Cluster IP. This allows the set of Pods associated with a Service to change at any time — clients only need to know the Service’s Cluster IP, which does not change.

When creating a new Kubernetes Service, a new Cluster IP is created on your behalf. Anywhere within the cluster, traffic addressed to the Cluster IP will be load-balanced to the set of backing Pods associated with the Service. In effect, Kubernetes automatically creates and maintains a distributed in-cluster load balancer that distributes traffic to a Service’s associated healthy Pods. Let’s take a closer look at how this works.

Netfilter and iptables :

To perform load balancing within the cluster, Kubernetes relies on the networking framework built in to Linux — **netfilter.**

Netfilter is a framework provided by Linux that allows various networking-related operations and operations for packet filtering, network address translation, and port translation, which provide the functionality required for directing packets through a network. They are as broadly defined definition of SDN (Software Defined Networking).

**iptables** is a user-space program providing a table-based system for defining rules for manipulating and transforming packets using the netfilter framework. In Kubernetes, iptables rules are configured by the **kube-proxy controller** that watches the Kubernetes API server for changes. When a change to a Service or Pod updates the Cluster IP address of the Service or the IP address of a Pod, iptables rules are updated to correctly route traffic directed at a Service to a backing Pod. The iptables rules watch for traffic destined for a Service’s virtual IP and, on a match, a **random Pod IP** address is selected from the set of available Pods and the iptables rule changes the packet’s destination IP address from the Service’s Cluster IP to the IP of the selected Pod. As Pods come up or down, the iptables ruleset is updated to reflect the changing state of the cluster. **Put another way, iptables has done load-balancing on the machine to take traffic directed to a service’s IP to an actual pod’s IP.** On the return path, the IP address is coming from the destination Pod. In this case iptables again rewrites the IP header to replace the Pod IP with the Service’s Cluster IP so that the **Pod believes it has been communicating solely with the Service’s Cluster IP the entire time.**

Fig 4: Life of a packet, Pod to Service.

When routing a packet between a Pod and Service, the journey begins in the same way as before. The packet first leaves the Pod through the eth0 interface attached to the Pod’s network namespace (1). Then it travels through the virtual Ethernet device to the bridge (2). The ARP protocol running on the bridge does not know about the Service and so it transfers the packet out through the default route — eth0 (3). Here, something different happens. Before being accepted at eth0, the packet is filtered through iptables. After receiving the packet, iptables uses the rules installed on the Node by kube-proxy in response to Service or Pod events to rewrite the destination of the packet from the Service IP to a specific Pod IP (4). The packet is now destined to reach Pod 4 rather than the Service’s virtual IP. In essence, iptables has done in-cluster load balancing directly on the Node. Traffic then flows to the Pod using the Pod-to-Pod routing we’ve already examined.

Fig 5: Life of a packet, Service to pod.

The Pod that receives this packet will respond, identifying the source IP as its own and the destination IP as the Pod that originally sent the packet (1). Upon entry into the Node, the packet flows through iptables which remember the choice it previously made and rewrite the source of the packet to be the Service’s IP instead of the Pod’s IP (2). From here, the packet flows through the bridge to the virtual Ethernet device paired with the Pod’s namespace (3), and to the Pod’s Ethernet device as we’ve seen before (4).

KubeDNS and CoreDNS:

Kubernetes can optionally use DNS to avoid having to hard-code a Service’s cluster IP address into your application. Kubernetes DNS runs as a regular Kubernetes Service that is scheduled on the cluster. It configures the kubelets running on each Node so that containers use the DNS Service’s IP to resolve DNS names. Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name. DNS records resolve DNS names to the cluster IP of the Service or the IP of a POD, depending on your needs.

A DNS Pod consists of three separate containers:

- **kubedns:** watches the Kubernetes master for changes in Services and Endpoints, and maintains in-memory lookup structures to serve DNS requests.

- **dnsmasq:** adds DNS caching to improve performance.

- **sidecar:** provides a single health check endpoint to perform healthchecks for dnsmasq and kubedns.

The DNS Pod itself is exposed as a Kubernetes Service with a static cluster IP that is passed to each running container at startup so that each container can resolve DNS entries. DNS entries are resolved through the kubedns system that maintains in-memory DNS representations. etcd is the backend storage system for cluster state, and kubedns uses a library that converts etcd key-value stores to DNS entires to rebuild the state of the in-memory DNS lookup structure when necessary.

CoreDNS works similarly to kubedns but is built with a plugin architecture that makes it more flexible. As of Kubernetes 1.11, CoreDNS is the default DNS implementation for Kubernetes.

4. Internet-to-Service Networking

So far we have looked at how traffic is routed within a Kubernetes cluster. This is all fine and good but unfortunately the hello world applications isnt going to say hello from your local machine — at some point you will want to expose your service to external traffic. This need highlights two related concerns: (1) getting traffic from a Kubernetes Service out to the Internet, and (2) getting traffic from the Internet to your Kubernetes Service. This section deals with each of these concerns in turn.

4.1 Egress — Routing traffic to the Internet

In AWS, a Kubernetes cluster runs within a VPC, where every Node is assigned a private IP address that is accessible from within the Kubernetes cluster. To make traffic accessible from outside the cluster, you attach an Internet gateway to your VPC. The Internet gateway serves two purposes: providing a target in your VPC route tables for traffic that can be routed to the Internet, and performing network address translation (NAT) for any instances that have been assigned public IP addresses. The NAT translation is responsible for changing the Node’s internal IP address that is private to the cluster to an external IP address that is available in the public Internet.

With an Internet gateway in place, VMs are free to route traffic to the Internet. Unfortunately, there is a small problem. Pods have their own IP address that is not the same as the IP address of the Node that hosts the Pod, and the NAT translation at the Internet gateway only works with VM IP addresses because it does not have any knowledge about what Pods are running on which VMs — the gateway is not container aware. Let’s look at how Kubernetes solves this problem using iptables (again).

4.1.1 Life of a packet: Node to Internet

Fig 6: Life of a packet, Node to Internet

The packet originates at the Pod’s namespace (1) and travels through the veth pair connected to the root namespace (2). Once in the root namespace, the packet moves from the bridge to the default device since the IP on the packet does not match any network segment connected to the bridge. Before reaching the root namespace’s Ethernet device (3), iptables mangles the packet (3). In this case, the source IP address of the packet is a Pod, and if we keep the source as a Pod the Internet gateway will reject it because the gateway NAT only understands IP addresses that are connected to VMs. The solution is to have iptables perform a source NAT — changing the packet source — so that the packet appears to be coming from the VM and not the Pod. With the correct source IP in place, the packet can now leave the VM (4) and reach the Internet gateway (5). The Internet gateway will do another NAT rewriting the source IP from a VM internal IP to an external IP. Finally, the packet will reach the public Internet (6). On the way back, the packet follows the same path and any source IP mangling is undone so that each layer of the system receives the IP address that it understands: VM-internal at the Node or VM level, and a Pod IP within a Pod’s namespace.

4.2 Ingress — Routing Internet traffic to Kubernetes

Ingress is divided into two solutions that work on different parts of the network stack: (1) a Service LoadBalancer and (2) an Ingress controller.

4.2.1 Layer 4 Ingress: LoadBalancer

When you create am Ingress Kubernetes Service ( such as ingress nginx) you can optionally specify a LoadBalancer to go with it. The implementation of the LoadBalancer is provided by a cloud controller that knows how to create a load balancer for your service. Once your Service is created, it will advertise the IP address for the load balancer. As an end user, you can start directing traffic to the load balancer to begin communicating with your Service.

With AWS, load balancers are aware of Nodes within their Target Group and will balance traffic throughout all of the Nodes in the cluster. Once traffic reaches a Node, the iptables rules previously installed throughout the cluster for your Service will ensure that traffic reaches the Pods for the Service you are interested in.

4.2.2 Life of a packet: LoadBalancer to Service

Fig 7: Life of a packet: Internet to service

The above diagram shows a network load balancer in front of three VMs that host your Pods. Incoming traffic (1) is directed at the load balancer for your Service. Once the load balancer receives the packet (2) it picks a VM at random. In this case, we’ve chosen pathologically the VM with no Pod running: VM 2 (3). Here, the iptables rules running on the VM will direct the packet to the correct Pod using the internal load balancing rules installed into the cluster using kube-proxy. iptables does the correct NAT and forwards the packet on to the correct Pod (4).

4.2.3 Layer 7 Ingress: Ingress Controller

Layer 7 network Ingress operates on the HTTP/HTTPS protocol range of the network stack and is built on top of Services. The first step to understanding Ingress is to understand a port on your Service called the **NodePort** . If you set the Service’s type field to NodePort, the Kubernetes master will allocate a port from a range you specify, and each Node will proxy that port (the same port number on every Node) into your Service. That is, any traffic directed to the Node’s port will be forwarded on to the service using iptables rules. This Service to Pod routing follows the same internal cluster load-balancing pattern we’ve already discussed when routing traffic from Services to Pods.

**To expose a Node’s port to the Internet you use an Ingress object.** An Ingress is a higher-level HTTP load balancer that maps HTTP requests to Kubernetes Services. HTTP load balancers, like Layer 4 network load balancers, only understand Node IPs (not Pod IPs) so traffic routing similarly leverages the internal load-balancing provided by the **iptables rules installed on each Node by kube-proxy.**

The life of a packet flowing through an Ingress is very similar to that of a LoadBalancer. The key differences are that an Ingress is aware of the URL’s path (allowing and can route traffic to services based on their path), and that the initial connection between the Ingress and the Node is through the port exposed on the Node for each service.

Let’s look at how this works in practice. Once you deploy your service, a new Ingress load balancer will be created for you by the cloud provider you are working with (1). Because the load balancer is not container aware, once traffic reaches the load-balancer it is distributed throughout the VMs that make up your cluster (2) through the advertised port for your service. iptables rules on each VM will direct incoming traffic from the load balancer to the correct Pod (3) — as we have seen before. The response from the Pod to the client will return with the Pod’s IP, but the client needs to have the load balancer’s IP address. iptables and conntrack is used to rewrite the IPs correctly on the return path, as we saw earlier. One benefit of Layer 7 load-balancers are that they are HTTP aware, so they know about URLs and paths. This allows you you to segment your Service traffic by URL path.

--

--