[MUSIC] In this lesson, we will discuss networking in the context of server virtualization. How do we network the large number of VMs that might reside on a single physical server? Cloud computing depends heavily on server virtualization for several reasons. Virtual machines allow multiplexing of hardware with tens to 100s of VMs residing on the same physical server. Also, it allows rapid deployment of new services. Spinning up a virtual machine might only need seconds compared to deploying an app on physical hardware, which can take much longer. Further, if a workload requires migration. For example, you do physical server requiring maintenance. This can be done quite quickly with virtual machines, which can be migrated to other servers without requiring interruption of service in many instances. Due to these advantages today, more endpoints on the network. Our virtual, then our physical. So let's look at how virtualization works. The physical hardware is managed by a Hypervisor. This could be Zen, KVM, or VMWare's ESXI, or multiple such alternatives. On top of the Hypervisor runs several virtual machines in user space. The hypervisor provides an emulated view of the hardware of the virtual machines, which the virtual machines treat as their substrate to run a guest operating system on. Among other hardware resources the network interface card is also virtualized in this manner. The hypervisor managing the physical NIC, in exposing virtual interfaces to the VMs. The physical NIC also connects the server to the rest of the network. So how are these VMs networked inside the hypervisor? The hypervisor runs a virtual switch, this can be a simple layer tool searching device. Operating in software inside the hypervisor. It's connected to all the virtual NICs, has them as the physical NIC, and moved packets between the VMs and the external network. Before we examine the details of this, it's worth noting that there are other ways of doing virtualization. One, using Docker is becoming quite popular. In this model we just discussed, each virtual machine runs its own entire guest operating system. The application runs as a process inside this guest OS. This means that even running a small application requires the overhead of running an entire guest operating system. An alternate approach is to use Linux containers for virtualization. Just like Docker does. In this setting, an application together with its dependents uses packaged into a Linux container, which runs using the host machine's Linux stack and any shared resources. Docker is simply a container manager for multiple such containers. Applications are isolated form each other by the use of separate namespaces. Resources in one application cannot be addressed by other applications. This yields isolation quite similar to virtual machines, but with a smaller footprint. The container packages can be much smaller than the multiple gigabytes needed for a guest operating system. They also do not need to run any redundant guest OS processes. This enables much higher density. A large number of application can be supported on the same physical machine than with the VM scenario. Further, containers can be brought up much faster than virtual machines. Hundreds of milliseconds as opposed to seconds or even tens of seconds with virtual machines. Recent work on comparing performance across these two approaches, one using containers and another using virtual machines, shows that performance is quite similar with both approaches. How does networking work with such virtualization using Docker? Each container is assigned a virtual interface. Docker contains a virtual ethernet bridge connecting these multiple virtual interfaces and the physical NIC. Configuring Docker and the environment variables decide what connectivity yis provided. Which machines can talk to each other, which machines can talk to the external network, and so on. External network connectivity is provided through a NAT, that is a network address translator. However, multiple projects are working on extending network functionality in various ways. And ultimately we can expect networking to look quite similar in this scenario to what it looks like for virtual machines. In our discussion here, we've ignored certain details about virtualization, for instance, the distinction between type one and type two hypervisors. But the hypervisor real model as described here will suffice for our discussion. So let's look back at it. As we saw earlier the hypervisor runs a virtual switch to able to network the VM's. But whose doing the work for moving their own packets. That's the CPU. Now packet processing on CPUs can be quite flexible because it can have general purpose forwarding logic. You can have packet filters on arbitrary fields. Run multiple packet filters if necessary, etc. But if done naively, this can also be very CPU-expensive and slow. Let's see why. What does forwarding packets entail? At 10Gbps line rates with the smallest packets, that's 84 Bytes. We only have an interlope of 67ns before the next packet comes in on which we need to make a forwarding decision. Note that ethernet frames are 64 bytes, but together with the preamble which the tells the receiver that a packet is coming and the necessary gap between packets. The envelope becomes 84 bytes. For context a CPU to memory access takes tens of nanoseconds. So, 67 nanoseconds is really quite small. And we're trying to accomplish a lot in this time. For one, we need time for Packet I/O. That is moving packets from the NIC buffers to the OS buffers, which requires CP entrants. Until recently a single X86 core couldn't even saturate a ten gigabits per second link. And this is without any switching required. This was just moving packets from the NIC to the US. After significant engineering effort, packet aisle is now doable at those line rates. However for a software switch we need more. If any of the switching logic is in userspace, one incurs the overhead for switching between userspace and kernel space. Further, for switching we need to match rules for forwarding packets to a forwarding table. All of this takes CPU time. Also keep in mind that forwarding packets is not the main goal for the CPU. The CPU is there to be doing useful computation. Next we'll discuss two starkly different approaches to addressing the problem of networking virtual machines. One, using specialized hardware and the other using an all software approach. The main idea behind the hardware approach is that CPUs are not designed to forward packets, but the NIC is. The naive solution would be to just give access to the VMs, to the NIC. But then problems arise. How do you share the NIC's resources? How do you isolate various virtual machines? SR-IOV, single-root I/O virtualization, provides one solution to this problem. The single-root part will not be essential to this discussion and we'll ignore it. The SR-IOV, the physical link itself, supports virtualization in hardware. So let's peek inside this SR-IOV enabled network interface card. The NIC provides a physical function, which is just a standard ethernet port. In addition, it also provides several virtual functions which are simple ques that transmit and receive functionality. Each VM is mapped to one of these virtual functions. So the VMs themselves get NIC hardware resources. On the NIC there also resides a simple layer two, which classifies traffic into queues responding to these virtual functions. Further, packets are moved directly from the net virtual function to the responding VM memory using DMA. That is, direct memory access. This allows us to bypass the hypervisor entirely. The hypervisor is only involved in the assignment of virtual functions to virtual machines. And the management of the physical function, but not the data part for packets. The upshot to this is higher through-put, lower latency and lower CPU utilization and this give us close to native performance. There are downsides to this approach though. For one, live VMI migration becomes trickier, because now you've tied the virtual machine to physical resources on that machine. The forwarding state for that virtual machine now resides in the layer two search inside the NIC. Second, forwarding is no longer as flexible. We're relying on a layer two search that is built into the hardware of the NIC. So we cannot have general purpose rules and we cannot be changing this logic very often. It's built into the hardware. In contrast, software-defined networking allows a much more flexible forwarding approach. Next we look at a software based approach that addresses these two criticisms. This discussion will be based on work by Pfaff et al. We only see a very broad overview here. But I highly recommend that you check out the entire paper in detail. Because, it'll not only tell you the final design choices, but also why those choices were made. It's a very interesting paper that walks you through the design process. So lets look at Open vSwitch design. Open vSwitch design goals are flexible and fast-forwarding. This necessitates a division between user space and kernel space task. One can not work entirely in the kernel, because of development difficulties. It's hard to push changes to kernel level code, and it's desirable to keep logic that resides in the kernel as simple as possible. So, this march of this approach, that is the switch routing decisions lie in user space. This is where one decides what rules or filters apply to packets of a certain type. Perhaps based on network updates from other, possibly virtual, such as in the network. This behavior can also be programmed using open flow, which we'll cover later in our SDN section. So, this part is optimized for processing network updates, and not necessarily for wire speed packet forwarding. Packet forwarding, on the other hand, is handled largely in the kernel, broadly. Open vSwitch approach is to optimize the common case, as opposed to the worst case line rate requirements. And as we'll see, gnashing will be the answer to that need. Let's look at how packets flow thru the circuit architecture. The first packet of a flow goes to userspace here several different packet classifiers may be consulted. Some actions may be based on MAC addresses and some others might depend on TCPPORTs, etc. The highest priority matching action across these different classifiers will be used to forward the packet. Once a packet is forwarded, a collapsed rule used to forward that packet is installed in the kernel. This is a simple classifier with no priorities. The following packets of this flow will never enter user space, seeing only the kernel level classifier. The problem though is when it's still running a packet classifier in the kernel in software. What this means is for every packet that comes in, your searching in this table for the right entry that matters and using that entry for forward the packet. This can be quite slow. Dbopen research solves this problem is to create a simple hash table based cache into the classifier. So what you'll do is, instead of looking at this entire table and finding the right rule. You'll hash what fields are used to match packet, and the hash key is now your pointer to the action that needs to be taken. And, these hash keys and their actions can be cache. So far, subsequent packets of flow, you only look at the hash match in the cache and use that to point to the debil entry. So now, you're no longer using an entire packet classifier. You're no longer search through this entire table. You're just doing a constant time hash table lookup, and using that to forward the packet. This works quite well for yield workloads. The paper report results in a large real deployment showing high cache hit rates, over 97 % and low CPU usage at the end host. Let's look at one of the results in the paper in a bit more detail. The performance data was gathered from 24 hours of operations at our racks based data center with multiple tenants. Each point on the scatter plot represents font of more than 1,000 hypervisor's in this deployment. On the x axis is the number of kernel misses per second, averaged over 24 hours. So, this is the number of times we didn't hit the cache or the kernel classifier and had to go to the userspace tables. As the number of kernel misses increases CPU load increases as well. But it's almost always below 20%. In the vast majority of cases it is below 10%. This might not be clear from the scatterplot but data in the paper says over 80% of the hypervisors average 5% or less CPU load. A minor note here, the measured CPU load can sometimes exceed 100% user multi-threading. Some of the top right points in the scatter plot demonstrate that. Interestingly, the authors found that all six instances of this happening in the top right corner, were due to previously unknown implementation bugs. To summarize briefly, the hardware approach that SRIOV takes sacrifices flexibility and forwarding logic for line rate performance in all scenarios. Virtually hitting native performance. While with Open vSwitch, the compromise made is to avoid targeting worst case performance, and focusing on forwarding flexibility. This brings to our app our brief discussion on virtualization in this part of the course. But we'll return to software define networking where we'll see a lot more on virtualization. [MUSIC]