In an age of zero-trust security, enterprises are looking to secure individual virtual machines (VMs) in their on-premise data centres, cloud or hybrid environments to prevent increasingly sophisticated attacks. The problem is that firewalling individual VMs using tools like software appliance firewalls or Connection Tracking (Conntrack) is operationally challenging to manage. It delivers bad performance, restricting VM mobility and consuming many CPU cycles on servers, which limits their ability to process applications and workloads.
As the need for VM security grows, IT managers end up spending on more and more servers, most of which are tied up with security processing rather than application processing. In this article, we will look at zero-trust security and how best to implement it in data centres.
About zero-trust security
Forrester Research first introduced the Zero-Trust Model for Cybersecurity in its 2013 NIST paper, “Developing a Framework to Improve Critical Infrastructure Cybersecurity.” In this model, all network traffic is untrusted, whether it comes from within the enterprise network or from outside the network. Before this model there was the concept of a trusted network (usually the data center network or enterprise LAN), and an untrusted network (essentially outside the data center or enterprise LAN). Typically, the trust was enforced by a perimeter security mechanism (Figure 1a).
Zero-trust advocated that (a) all resources be accessed securely irrespective of location, (b) adoption and enforcement of least privilege and role-based access, and, (c) inspection and logging of traffic. In traditional enterprise networks, these were implemented primarily by two main mechanisms:
- Segmentation – Mostly network segmentation using VLANs. However VLANs just provide segmentation, not security
- Perimeter security at the edge of the segments
This is depicted in Figure 1b.
Zero-trust in data centres
Large-scale data centres deploy a wide variety of services. A single user request can spawn many services within a data center, leading to both east-west traffic within the data center and north-south traffic between the data center and the Internet. For example, consider the process of ordering something on Amazon, where a front-end web server shows the product, but then services are required to accept and validate credit card information, issue a confirmation and send a fulfillment request. This means we must apply the zero-trust model within the data center as well.
There are three reasons why a zero-trust model using security appliances cannot be deployed in data centers, as shown in Figure 1b.
First, operationally it is extremely cumbersome. The traffic from each server has to be backhauled to a security appliance, and all appliances must be properly configured. This leads to manual errors and operational challenges related to keeping the appliances up to date with changes in service requirements and/or changes in service deployments.
Second, it does not scale well and delivers inferior performance. Most of the security appliances today can handle traffic on the order of 200Gb/s. As servers start getting upgraded to and saturating 10Gb/s and higher network interfaces, a new security appliance must be deployed and provisioned for every 10-20 servers deployed. Actually, a pair of security appliances is needed for redundancy. With the security appliances becoming choke points, it also reduces the performance of the services.
Third, this creates silos within the data centre, making it hard to fully utilise the data centre infrastructure.
Zero-trust in virtualised or cloud-scale data centres
The challenges of using appliance-based zero-trust security are amplified in a virtualized data center as the number of VMs per server increases. There is an additional operational challenge in securing VMs, since they can be shut down and brought back up on a different server or sometimes in a different data center or even live migrated. This means the policies associated with a VM should move with it as well, or else all policies have to be programmed on all security appliances.
As a result, we have to think of a different deployment mechanism for zero-trust in data centres and in particular, virtualized data centers. This can be done by distributing security to each server using virtual appliances running alongside the VMs, by implementing security at the host/hypervisor level using Linux iptables, or at the vSwitch level using Open vSwitch (OVS) Conntrack.
This method (Figure 2a) presents the same problems of scalability and performance and the same operational challenges as the standard security appliance model. The virtual security appliance becomes the bottleneck. It is difficult to manage the policies one appliance at a time. When VMs move, it is extremely challenging to move the policies. In addition, the virtual security appliance is now consuming valuable server resources like CPU, memory and disk space that should be used to run VMs and deliver revenue-generating services.
Distributed security using Linux Bridge and iptables: This method (Figure 2b) solves some of the scale challenges because Linux iptables are available on all Linux hosts. However, by adding another layer of bridging between OVS and VMs, the performance suffers immensely. It is also a massive operational challenge to program taps and then policies for each Linux bridge. VM live migration and/or movement is still extremely challenging as the bridges, taps and policies have to be manually programmed.
Distributed security using OVS Conntrack: The basic solution for operational challenges is to add OVS Conntrack to OVS networking (Figure 2c). OVS has well-defined APIs for integrating with data center management stacks including OpenStack – e.g., OpenStack Security Groups are mapped to OVS Conntrack. This significantly reduces the operational complexity of deploying distributed security. Also, it removes the additional abstraction and provides a little bit better performance than using Linux iptables. However, this approach still does not address performance and scale. Deploying OVS with Conntrack in software results in very high CPU usage for that function alone.
To address these performance and scale issues, data center operators must find a way to offload OVS and Conntrack from the CPU cores. This allows them to provide a very high-performance distributed firewall on each server – close to the VMs – which can define policies and service granularity with a high number of connections being set up and tracked.
Offloading OVS Conntrack with a SmartNIC
The most efficient way to offload OVS and Conntrack is to use a SmartNIC and appropriate software. A SmartNIC is a network interface card that incorporates a programmable network processor which can run application software. By running Conntrack software in the SmartNIC’s processor, this chore is offloaded from the server CPU cores.
Offloading OVS Conntrack from the server CPU cores leads to far higher performance and scalability. Figure 3 (above) compares some representative performance metrics for the server CPU-based and SmartNIC-based implementations.
As can be seen in Figure 3, SmartNIC-based implementation delivers 4X the performance of a software-only, CPU-based implementation while consuming less than 3 percent of the CPU for a large number of flows.
Current implementations of software-only CPU-based Conntrack starts consuming more than 40 percent CPU at 100-500 unique flows and can go as high as 51 percent CPU utilization on a modern server with 48 cores. Clearly, using more than half a server to provide security is not a feasible solution when the central function of the server is to host VMs with applications/services that generate revenue.
Essentially, offloading OVS and Conntrack to a SmartNIC makes it feasible to implement security on a per-VM or per-container basis by removing the server usage penalty and expense, solving the scalability and performance issues, and, delivering better server utilization for application traffic as intended.