Introduction
Cloud computing is now the mainstream computing service of information technology systems. Its architecture mostly includes computing and networking portions. The computing portions of infrastructure-as-a-service systems, a kind of cloud computing systems, provide virtual machines (VMs) for cloud users. The networking portions connect VMs and VMs with cloud users. They are mostly implemented using software on servers instead of hardware-based network equipment. The required performance for cloud networks is increasing every year, and at present, it is 100 Gbps or higher. The 100 Gigabit Ethernet (GbE) is already used as the network interface of servers in cloud systems. Furthermore, the 400 GbE has started to commercialize [1]. However, it is difficult to achieve these performances using software only. Thus, hardware offload solutions are attracting attention [2]. But a conventional offload solution was not discussed the table structures and memory locations sufficiently. To keep the compatibility of the software vRouter, the table structures must keep the original configuration. Because many vRouters have their variety of tables, the high performance of random memory access should be the key technology. Therefore, we propose a new accelerator architecture that uses high bandwidth memories (HBMs) [3] for high-access performance and an accelerator engine embedded with a multi-pipeline architecture for high-performance packet processing. We prototyped an accelerator based on our proposed architecture as a peripheral component interconnect express (PCIe) card. In this accelerator, a field-programmable gate array (FPGA)-based accelerator engine and HBMs were fabricated as a multi-chip module. Using this accelerator card, we offloaded a vRouter in an OpenStack-based cloud system [4]. The vRouter packet processing performances achieved 320 Mpps and 250 Gbps.
The remainder of this paper is presented as follows: Section II describes the background, including cloud computing system architecture and why distributed virtual routers (vRouters) are now used. Section III describes comparisons of our study with other related works. Section IV describes the proposed accelerator architecture. Section V describes the implementation of the prototyped accelerator. Section VI shows the measurement results. Finally, Section VII summarizes this study.
Background
A. Network Performance Required for Cloud Systems
The important function of the networks in cloud systems is an isolation of the networks for each user or tenant. To create virtually separated networks, virtual extensible local area networks (VXLAN), multi-protocol label switching (MPLS), and various tunnel protocols are used [5], [6], [7].
To communicate between isolated networks that belong to different users, tenants, or subnets (even belong to the same user), the Internet protocol (IP) routing is required. In the early implementation of OpenStack, the IP-routing function was consolidated on a server called a network node, which is different from computing nodes. However, as cloud systems expanded and the required bandwidth for communication paths between VMs increased, network nodes have become bottlenecks.
In particular, the problem was that all communication between different subnets gathered at the Network Node. That means communication between VMs in the same computing node must go via the network node, as shown in Fig. 1(a).
Taking into consideration the situation described above, the OpenStack community has developed a distributed virtual router (DVR) [8]. Here vRouters are deployed on each computing node, and they performed IP routing, as shown in Fig. 1(b). These vRouters are implemented in the software.
Because the IP traffic in cloud systems has been increasing continuously, it is difficult to achieve the currently required network performance using the DVR architecture only.
The speed of network interface cards (NICs) attached to servers has become increasingly faster. 100 GbE has already been widely used, and 400 GbE has started to be commercialized. Hence, vRouters also need to achieve throughputs of several hundred Gbps These throughputs cannot be achieved via conventional technologies. Accordingly, we need to consider some network acceleration technologies.
B. Network Acceleration Technologies
The candidate technologies to improve the vRouter performance are as follows:
Software acceleration technologies
Hardware offload technologies
The Data Plane Development Kit (DPDK) [9] is the most famous and widely used software acceleration technology. However, it only accelerates packet transfers. For IP routing, packet parsing is also a key function that analyzes packet headers and extracts information (e.g., IP addresses and port numbers). The performance of software vRouters is limited by packet parsing. In general, the software performance of packet parsing with 1CPU core is under 7 Mpps [10]. On the other hand, a paper [11] reported that “300 Mpps has been achieved by software only on one server.” In this paper, designated address tables were key to improve the performance of packet parsing. Moreover, using DPDK, layer 3 (L3) forwarding achieved 233 Gbps (347 Mpps). To achieve such performances, 12 CPU cores were allocated to each 100 GbE port, and a total of 42 CPU cores were used. This approach achieved a super-high performance; however, it consumed many CPU cores. If this method is applied to vRouters in computing nodes, then the number of CPU cores that are supposed to be allocated to a user’s VMs will be significantly reduced. Therefore, this approach is not appropriate for vRouter acceleration.
Comparing the software vRouter and the hardware offload, the hardware offload can reduce CPU utilization, while the packets between VMs in a same node must be sent to accelerator via PCIe bus, while software vRouter can process them in memory. The hardware offload will become an appropriate solution considering the bandwidth of the PCIe bus is expanded.
The PCI card contains one or more accelerator engine(s). The choices of the engine are network processors, application-specific integrated circuit (ASICs), and FPGAs [12]. Network processors operate using the software and process instructions and packets. Furthermore, ASICs and FPGAs process packets using hardware logic circuits, forming pipelines, as shown in Fig. 2. The pipelines are designed to process packet flows in predetermined orders, similar to the ones specified in OpenFlow [13]. Generally, pipelines for IP-routing lookup tables to determine the actions and destinations for incoming packets. The performance of the engine is proportional to the product of the clock frequency and number of pipelines. In the post-Moore era, the clock frequency increases limitedly, and increasing FPGAs’ clock frequency is difficult. Therefore, we must improve the performance by increasing the number of parallel pipelines. However, if we increase the number of parallel pipelines, the number of lookup tables will also increase because each pipeline accesses the tables simultaneously. Hence, in the hardware offload, memory accesses to lookup tables have become an issue regardless of the type of accelerator engine. Therefore, solving this issue is key for achieving high performance.
Another important point is where the memory containing routing tables is located. As shown in Fig. 3, there are two candidate architectures. In the first candidate architecture (Fig. 3(a)), the system memory is used. This architecture is the same as the software vRouter. With this architecture, enough memory capacity is ensured. However, access to lookup tables conflicts on the PCIe bus with the packet transmission between the accelerator card and VMs. Therefore, the PCIe bus becomes a bottleneck.
In the second candidate architecture, dynamic random-access memory (DRAMs) containing routing tables are located on the accelerator card. As the accelerator engine can directly access the DRAMs, it is not considered a PCIe bus bottleneck in this architecture. However, the accesses from the engine to the DRAMs become a bottleneck. A conventional DRAMs have insufficient access bandwidth for multiple accesses from the engine needed for the multi-pipeline architecture described previously. Therefore, a new memory and its usage architecture are needed to achieve a throughput with several hundred Gbps.
Related Works
As described in Section II, Ohara et al. [11] achieved a packet processing performance of 300 Mpps for a router using an optimized software. However, it is not suitable for computing nodes because it consumes many CPU cores to process the router. Yamazaki et al. [14] implemented partial offloading, which offloaded some parts of the router’s process into a FPGA. The offloading processes, which included packet-header parsing and hash calculation, reduced CPU loads. Although the processing of incoming packets from the physical interface can be offloaded, this technology cannot offload processes for incoming packets from the VM. Korikawa et al. [15] developed a method to speed up table search using hybrid memory cubes (HMCs) as a memory cache. This method placed the cache processing into a FPGA and cache data in an HMC. Here, the HMC is based on the same technology as HBM. In this study, network processing (e.g., header parsing and packet transfers) was not offloaded.
Singha et al. [16] used HBMs for a system that calculates a meteorological forecast model. High performance and power saving have been achieved using HBMs. Using HBMs, high-performance computing has achieved memory accesses even with irregular addresses. The key feature of this work is the same as that in our present study. However, the application is not IP-packet routing. Therefore, this study does not include IP-routing offload. Attig and Brebner [17] offloaded the packet parser process that classifies packet headers onto a FPGA. This work achieved a 400-Gbps performance. However, other processes (e.g., routing table lookups and packet transfers) were not offloaded. NVIDIA ASAP2 in Connect X-6 announced that total switching capacity of DPU that is a kind of ASIC switch is over 200Gbps [18]. The processing capacity is going to increasing, but ASIC accelerators may not be the best solution for all users, as described in the next section. Ethernity networks announced that ACE-NIC100 was provided as a smartNIC to offload the NFV infrastructure [19]. This smartNIC is the one of existing solutions based on FPGA. The packet processing capacity is 100Gbps. Napatech released NT50B01 which offloaded the Open vSwitch by using FPGA [20]. The switching capacity is 50Gbps and offload the functionality, such as the packet capture.
As described above, no work has offloaded entire vRouter processes and reported 300 Mpps or higher performance.
Proposed vRouter Acceleration Architecture
Based on the background described above, we propose an architecture that offloads almost all vRouter processes to an accelerator hardware. The accelerator is built as a PCIe card mounted on a general-purpose server, as shown in Fig. 4. VMs and the accelerator send and receive packets via a PCIe bus. VMs on a computing node do not need to be aware of the offload, because the VMs and offloaded vRouter use the same interface (virtio) [21] as the software vRouter uses. Thus, the proposed architecture is compatible for existing user VMs to share the packet buffer in the system memory.
A control plane is an important part of an orchestrator or SDN controller, for example Open stack, Open vSwitch or Tungsten fabric. In the control plane manages the routing tables with their APIs, the conversion layer is necessary to change the original APIs into the dedicated commands for each accelerator. The original configuration of table makes the conversion layer simple. In some cases, an ASIC accelerator has better performance of the packet processing. Because the difference between tables of the control plane and the ASIC equipped tables, some tables need to be merged or change the table type into exact match. These modifications often make the conversion layer complex [22]. The flexibility of table configuration is merit for FPGA solution.
Packets coming from the VMs or physical port are parsed in the accelerator engine’s pipelines. Then, routing tables in the HBM are looked up, and the destination address is resolved. Finally, the incoming packets go to their destination either via the VM or physical port. The physical port of the card can be directly connected to the underlay network via a quad small form factor pluggable (QSFP). Hence, the accelerator card also functions as a NIC.
Accelerator Hardware Design
A. Architecture of the Accelerator Card
Fig. 5 shows our developed accelerator card. The accelerator engine has been implemented onto a FPGA. The card consists of a multi-chip module of a FPGA and two HBMs and a QSFP. The card equips 16 lanes of the PCIe Gen. 3 interface.
B. HBM2
The HBM is a type of memory specified by the Joint Electron Device Engineering Council Solid State Technology Association as JESD235D [23].
In our design, two HBMs and one FPGA are mounted on an LSI package called the multi-chip module. Each HBM is connected to the FPGA via an interposer. The capacity of the each HBM is 8GB, and 16 memory interfaces to the FPGA allow the FPGA to access the HBM with multiple addresses at the same time.
Incidentally, conventional CPUs used for servers only have six double data rate (DDR) interfaces. This number comes from the pin-count limitation of LSI packages. Conventional LSI packages for CPUs have ~1000 pins [24]. These pins are used for CPU-interconnect, PCIe bus, power supply, and DDR interfaces. A DDR interface consumes ~100 pins (64 for data, 10 for address, and some for control). Therefore, six DDR interfaces use ~600 pins out of 1000, which is the limitation of conventional LSI packages.
To break this limitation, the HBM was designed to connect logic devices (e.g., ASIC and FPGA) with chip-to-chip direct connect. In particular, HBMs and logic devices are mounted on the same LSI package or multi-chip module. Using the multi-chip module and stacked structure of DRAMs, the FPGA can have 32 memory interfaces to the HBMs.
From the usage viewpoint, the memory access is considered as follows.
The routing tables of the vRouter stored in the HBMs are looked up with memory-access addresses that come from the hash values of 5-tuples containing source/destination IP addresses and a protocol number of packets. Because there is no correlation between the 5-tuple of each packet, memory-access addresses are random. The access address is not basically deterministic in another table too. Accessing DRAMs with random addresses means accessing with frequently changed ROW addresses. When the ROW addresses change, the DRAM enters a ROW-write state, and the following accesses are paused during this write state. This penalty degrades the memory-access performance.
Our used HBM assigns access channels to every 500 MB address block. Even when a penalty occurs in a certain channel, an access to another channel is not influenced by the penalty. Thus, we can reduce the probability of the penalty occurrence. This feature is the advantage of the HBM. Other wideband memory interfaces, such as DDR5 and quad data rate, increase their data transfer rate. However, this kind of penalty has not been solved. As described above, the HBM is suitable for storing the routing tables of vRouters where many lookups with random addresses occur simultaneously from multiple pipelines, as shown in Fig. 2.
In our design phase, we estimated the performance difference between DDR4 and HBM2 by the RTL simulations. The results showed that the one-channel DDR’s performance was 50 Mpps. The four-channel HBM’s performance was 200 Mpps. These simulations used a simple model where one table was looked up with random addresses in 250 MHz.
C. Accelerator Engine Design
Fig. 6 shows the block diagram of the internal FPGA circuit designed in this study. Our used vRouter [25] takes different algorithms for upstream and downstream. Therefore, these processes were implemented onto dedicated pipelines separately. Accordingly, this is a multi-pipeline architecture, so the pipelines operate parallelly. Here, the downstream is a packet stream from VMs, and the upstream is a packet stream from the physical port to VMs.
The downstream pipeline performs the following processes shown in Fig. 7: first, the packet parser extracts 5-tuples and MAC addresses from incoming IP packets. Second, the flow table is looked up with the extracted 5-tuples, and then an action is determined. The forward table consists of a routing table and bridge table. The bridge table is looked up with the extracted MAC address. If the MAC address is not resolved, then the routing table is looked up with the extracted IP address. Using these lookups, a nexthop ID is determined. Then, using this next hop ID, the nexthop table is looked up, and a destination is determined. Depending on the destination, the Xbar (cross-bar) switch forwards incoming packets to a VM or physical port. When the packets travel to a physical port, the nexthop table provides the information required for encapsulation (MPLS over UDP/GRE) for the underlay network transfer, and the encap (encapsulation) module performs MPLS over UDP/GRE encapsulation.
The upstream pipeline first removes the capsule header and extracts the MPLS label in the header. Using the extracted MPLS label, the MPLS table is looked up, and then, a nexthop ID is determined. The flow table determines the action for the packet using the 5-tuple extracted from the decapsulated packet. The nexthop table resolves the output port using action and nexthop ID.
The lookup tables (e.g., flow and routing tables) are stored in the HBMs. They were assigned a number of access channels, as shown in Table 1, considering the frequency of lookups and the number of entries. Because the flow table is looked up from upstream and downstream pipelines and the required many entries are based on the combination of 5-tuples, all top HBMs are assigned for the flow table.
The bottom HBM is divided to a forward table and next hop table, and each table has four channels, considering the performance and number of entries. The forward table consists of a bridge table and routing table.
The next hop table has a small number of entries, but it is looked up from the upstream and downstream pipelines. To improve the performance, two duplicated tables are placed in the HBM. Each pipeline accesses one of two tables. Because the MPLS table requires a small number of entries and the data width of entries is narrow, it was placed in the block RAM of the FPGA.
The login utilization of our design is shown in The Table 2. Each block name corresponds to the block in Figure 6. HBM memory controller is the logic circuit of gathering the memory access from the table lookups and distributing to the HBM memory channel. The Common portion includes the Xbar switch, software management circuit, and other common logic. The Other portion includes PLL, reset, and other circuit pre-maked by FPGA vendor.
Mesurement Setup and Results
For the measurement, we set up two computing nodes, as shown in Fig. 8, and their configurations are shown in Table 3. The device under test (DUT) node was setup with the developed accelerator. The measurement node was setup with the software only to send and receive packet streams. This node was used to measure the inter-node packet transfer performance in conjunction with the DUT node. The inner-node packet transfer performance was measured between the VMs in the DUT node. The packet was generated and received by the DPDK packet generator on the VMs.
The vRouter software that was offloaded to FPGA adopted Tungsten Fabric [25]. It was modified to add functions for port creation or deletion corresponding to the FPGA host interfaces. Moreover, it was expanded to set various parameters of the FPGA.
The software vRouter on the measurement node was implemented with DPDK. It occupied 14 CPU cores that was isolated from the kernel or other VMs’ processes.
Fig. 9 shows the packet throughput of inter-nodes (between the measurement node and DUT node). The theoretical limit throughput of each pipeline is 128 Gbps because the data width inside the FPGA is 512 bits and the operation clock is 250 MHz. On the downstream, the throughput almost reached the theoretical limit when the packet length was 128 Byte or more. On the upstream, the throughput did not reach the theoretical limit. This was because the encapsulation header became a large overhead at the small packet length. Moreover, the packet throughput of the downstream was 191 Mpps in the 64-byte packet and that of the upstream was 130 Mpps. These results correspond to the simulation mentioned previously and show the design validation of the pipelines.
The processing performance and total throughput (upstream and downstream) is shown in Figs. 10 and 11, respectively. In the 64-byte packet, the vRouter with our accelerator achieved a very high processing performance of over 320 Mpps. This result shows that the packet processing performance improved by parallelizing the pipelines, and the HBMs eliminated the bottleneck of lookup tables. For comparison, the performance and throughput of vRouter with a smart NIC [27] and a software vRouter with DPDK were also measured. Here, the smart NIC was developed and dedicated for Tungsten Fabric vRouter. And the Tungsten Fabric community announced that this is an officially supported offload device. The software vRouter with DPDK occupied 13 CPU cores. Clearly, the vRouter with our accelerator achieved a higher performance than conventional ones. Furthermore, our accelerator used only one CPU core for the management.
We measured the latencies of the packet transfer between VMs in a node with our accelerator and DPDK vRouter, and the results are shown in Fig. 12. When there was no background traffic, the latency was ~20 micro sec for both vRouters. However, when there was background traffic, the latency of the DPDK-vRouter increased to several hundreds of micro sec. Here, we set four VMs to generate background traffics. Each VM generated 100 flows, with a total of 10 Gbps. We consider this was because the CPU core was shared with the latency packet flow and background traffics. By contrast, our accelerated vRouter showed very stable latencies of ~18 micro sec without the influence of the packet length and background traffic.
Table 4 shows the numerical data of the average latencies and standard deviations at the 512 Byte packet. We found that DPDK-vRouter showed very large variations. Meanwhile, our accelerator showed an almost normal distribution regardless of the existence of background traffics.
At last, I will mention about the power dissipation of our prototype. Since it was difficult to measure the actual power of the entire PCI card, we estimated the power consumption of the FPGA using a tool provided by the FPGA vendor [28]. The power consumption of our accelerator was estimated 78W in a standard routing condition. As the total power consumption, the DPDK vRouter consume almost 105W in 14 CPU cores to process packets in 40 Mpps, where FPGA offload consume 78W for 320Mpps. Thus, FPGA offload can reduce the power dissipation for one packet drastically.
Conclusion
vRouters in cloud systems require hardware offloads to meet the demands for the ever-increasing network bandwidth. Furthermore, offload hardware needs new architecture to achieve the throughput of several hundreds of Gbps. This is because memories that stores routing tables are bottlenecks for access from multiple pipelines. Therefore, in this study, we propose a new hardware architecture using an accelerator engine with multiple pipelines and HBMs to store routing tables. We prototyped this architecture using a FPGA for the accelerator engine and HBMs. Using the engine with multiple pipelines, we improved the processing performance and throughput. Using HBMs, we achieved a high memory-access performance even though they are accessed from multiple pipelines simultaneously. We demonstrated very high packet processing performances of 320 Mpps and 250 Gbps. Moreover, we showed the low and stable latency of packet transfers via the accelerated vRouter, whose performance was not affected by background traffics.