Run-to-completion model in Data Plane processing
Multi-core processing is a vital part of the modern computational architectures. It is present in huge server clusters, personal computers, and smartphones. Parallel processing allows applications and services to use computing resources in an energy-efficient manner while being executed uninterruptedly and independently from processes executed on other cores.
The key advantage of multi-core architecture is its provision of a central processing unit (CPU) resources, especially important in cloud computing. As a result, CPU resources can be shared by different processes based on their demand. These capabilities are provided at a lower cost than the cost of the core, which is powerful enough to handle all the processing in one process .
Multi-core architecture of general purpose processors has also been proposed as an answer to the increasing demand for computing resources of the mobile networks and mobile devices for example, in form of mobile edge computing (MEC). In MEC, control plane (C-plane) and user plane (U-plane) processing is moved to the computing cloud located in the close proximity to the source of data.
The processing required by the data plane can be implemented on a multi-core general purpose chip, although in terms of computing, data plane has specific requirements. In general, the data plane handles the arriving traffic in the current network node. The traffic is processed and sent back to the network. From the outside, such network node looks like a single super-core which processes packet by packet, while being constrained by the packet rate .
Due to this constraint, especially nowadays when high data throughput is a primal demand, the packet processing must be deterministic and straightforward. There is a need to identify and optimize a critical path which is taken by most of the packets inside the current network node .
This article presents specific aspects of data plane applications running in a multi-core environment. In the following sections, the article elaborates in detail on the typical aspects of packets processing and how different demands are addressed by the run to-completion model, which can be utilized for the design of data plane applications.
1. The optimization problem and the supporting model
1.1 System modeling
Let us consider a network composed of data source, computing network node, and data sink (terminal) with a queue between the source and the node. Network node is hosting a data plane application in a multi-core environment. Each core executes the same set of instructions on packets read from the same input and written to the same output stream . The processing of packets from an input queue is distributed across cores inside a node based on scheduling and CPU load balancing policy. However, on purpose, we do not want to limit this analysis only to a specific network architecture; the number of data sources and terminals, and I/O queues can be different. The only rule is that the arrival packet flow from the input must be forwarded to the output.
A typical server consists of cores which are numbered and has a homogeneous architecture. Each core has the same computing power. Each transmission path has an associated queue or buffer for packets, since processing of each packet takes time (see Figure 1). Consequently, a number of packets is serviced in nth time interval. In Figure 1, a(n) is a stochastic process of the transmission and there is some average transmission rate of incoming packets .
1.2 CPU load
Depending on the current packet arrival rate, scheduling, and CPU load balancing policy, the packets can be processed on one of the cores with a certain rate . Such processing takes time, therefore causing CPU load. Hence, there is a maximum load which can be obtained in each time interval of the transmission. And obviously, there is a maximum rate of packets which can be achieved by causing such load.
1.3 Queuing model
Every time interval indexed by “n”, a(n) packets are transmitted by data source. Let q(n) be a queue length at the data source. Since the transmission goes on, a number of packets in a buffer changes. Arriving packets are queued and stored in the buffer. At the same time, packets are dequeued from the buffer and serviced by CPUs with a specific rate. This process should balance the size of the buffer. Otherwise, the transmission would have to be broken, packets would have to be dropped and eventually lost.
At each process running on core, the arriving packet can be stored into another buffer before it is processed. In Figure 1, q1(n), q2(n), …, q|C|(n) denote the queues length of the core. The number of packets in the queue depends on the current state of that queue, arrival rate of the packets, and processing rate of the core.
The total service rate at the network node is a sum of processing rates of used cores.
1.4 Problem formulation
There is a need to decide how much of the overall transmission will be handled by each available CPU in the server. The control action at each core includes a selection of the CPU load . This allocation is a subject of CPU load balancing policy. For example, the scheduler in the network node may allocate a different fraction of CPU time in a given time. The state of the network node which should be considered, while deciding about the load balancing policy, includes the data rate from the source, and the current number of packets in buffers. In our case, it is reduced to the decision about the number of packets processed by each core in a current time interval.
The presented model was defined to simplify the analysis of the optimization problem which we face while designing a packet processing application in a multi-core environment. It clearly shows that the main actors are data packets and cores [1, 2]. Thus, the highly optimized data plane software should utilize the abstraction of these two things.
Run-to-completion is a scheduling model in which each task runs until it either finishes or explicitly yields control back to the scheduler for example, to use a hardware accelerator. In multi-core environment, run-to-completion is considered as a variant of cluster model . It combines several cores into a cluster which executes not fragmented, single staged, non-preemptive program. Such clusterization is very often referred as partitioning [1, 2].
As presented in the system model from Section 1., one of the ways to achieve higher overall performance is to focus on the efficiency at which each packet is processed entirely by one core [1, 2]. Since the run-to-completion model already defines the entity of a single stage execution, it can be easily mapped to packet processing.
2.1 Processor affinity
To get the notion of a processor in our model, a popular technique of processor affinity (also known as CPU pinning) can be applied . It binds a specific physical core or cores to a specific process. In that way, process data always remains in cache, thus avoiding cache misses . Depending on the environment, processor affiliation can be used to achieve better performance.
As the packet arrival rate increases, the CPU load must also increase. If the cores serve interrupts, the high load leads to the context switching overhead and results in performance degradation. Instead of using an interrupting serving routines for dequeuing packets from buffers, the polling mode is often used to achieve extreme efficiency. The poll mode driver is implemented as a process pinned to the CPU core and it periodically checks the receive queues for the arriving packets to be dispatched and serviced by the single routine, which runs uninterruptedly until it finishes. Because the polling process constantly checks the queues, the CPU is constantly loaded. In addition, the packet processing is executed in the same context as polling, so the context switching overhead does not occur.
2.3 Data structures for packet processing
Several types of data structures can be independently identified for the data plane model:
- Packet descriptors – data structures per packet
- Data structures per flow/connection
- Current context, state for the processing state machine, statistics counters – data structures per protocol 
In general, the access to data shared between the cores must be protected against race conditions.
Since the whole packet is processed by a single core in a run-to-completion model, packet descriptors are not shared . Hence, they do not require a restricted access.
In ideal situation, it is desired to be able to let all processes running concurrently. Touching the same memory inherently adds invisible contention . Thus, in applications, which have requirements of a very high performance, it is allowed to duplicate the same context data in different places of memory. The access to one copy is then restricted only to one core. It is a convenient and theoretically simple solution, especially assuming that processes are only reading common data and do not modify them. Otherwise, a different solution should be considered to avoid handling of asynchronous events coming from different cores.
There is an extensive research conducted on queuing and shared data structures. Generally, algorithms for concurrent data structures, including FIFO queues, fall into two categories: blocking and non-blocking . Blocking algorithms allow a slow or delayed process to prevent faster processes from completing operations on the shared data structure . Non-blocking algorithms guarantee that if there are one or more active processes trying to perform operations on a shared data structure, some operations will complete within a finite number of time steps . On asynchronous (especially multi-programmed) multiprocessor systems, blocking algorithms suffer from significant performance degradation when a process is halted or delayed at an inopportune moment . Possible sources of delay include processor scheduling preemption, page faults, and cache misses . Non-blocking algorithms are more robust in the face of these events, hence they are an absolute preference in data plane applications.
Some of the algorithms proposed in the research papers assume the existence of non-delayed processes attempting to perform operations on shared data structures . These operations must be assumed to be completed in a finite time. It directly shows how vital is the attempt to design a straightforward and deterministic packet processing in the run-to-completion model.
Despite of the analysis of the particular solution, it is worth to recognize that some of then utilize the idea of the procedure which is inherently non-preemptable. Such operation is called atomic . Execution of atomic operation is virtually like calling of single instruction of the processor – it will always complete without interruption.
In extreme situations, it can be assumed that the whole packet processing operation is atomic. Hence, the processing of the next packet originating from the same flow cannot be started before the previous one is completed, no matter whether on the same or other core. In consequence, the concurrent access to the flow data can be avoided. In addition, such approach supports the other need explored in the next section: packets belonging to the same flow which are arriving to the node must go back to the network in the same order.
2.4 Parallel packets processing from a single flow
The atomic queue  allows the processing of the packets originating from a single flow, one at a time. It does not mean that the packets from other flows cannot be processed concurrently with these packets. It is now a matter of CPU load balancing policy to allocate processing resources efficiently considering the additional restriction.
It is easy to notice that this solution works inefficiently, especially when we consider only one existing flow.
Hence, being supported with applying different techniques of avoiding the concurrent access to the shared data, it is worth to explore more sophisticated techniques of keeping packets in order.
The most widely used method to achieve this is to assign a unique sequence number to each packet when it enters the processor, allow the packets to be processed out of order, and have them reordered basing on their sequence numbers just before their return to the network . It means that it should be allowed to execute concurrent processing of packets and then spend additional CPU time for reordering. Unfortunately, it violates the principle of the presented run-to-completion model which tries to only keep the notion of the packet and the processor.
This article shows only how much can be done to efficiently utilize general purpose architecture of processors for data plane applications optimized for very high throughput. In the next section, the article presents the comparison between the run-to completion model with a model which uses general purpose fair scheduling.
3. An alternative approach
As seen till now, by introducing the run-to-completion model, we maximize the overall packet rate in a data plane application which is running on a multi-core machine. In the long term, the goal is to maximize a joint utility of data processing throughput and to minimize the power consumption . The economical usage of power (computing) resources is not addressed by the run-to-completion model.
In other words, cores which are affiliated to packet processing processes are not used efficiently when packet rate is much lower than the maximum value. CPU time is therefore consumed by the polling process which checks the queue state whether there is a packet to be serviced. It creates artificial CPU load and, when it take too long time, it might appear simply unprofitable because occupied cores cannot be used for other tasks or cannot be disabled.
In addition, the processor affinity (Section 2.1) requires pinning of the process to the physical CPU what disables hyper-threading (Intel’s proprietary solution). It is done not to share internal resources of physical core. This optimization technique should be considered only when it is confirmed that the processor affinity improves the data plane application performance in a particular case.
Constraints, applied due to the presence of data shared between processing stages, can cause additional inefficiency in some conditions and leave processors loaded but also unused for example, the atomic queue and single flow problem discussed in Section 2.4.
The prohibition of preemption, mainly made to avoid context switching overhead and to make data forwarding process more predictable, seems to be a root cause of the constant load of a fixed number of CPUs in a presented solution.
The main goal of data plane processing is to avoid packet loss. If packet rate is sufficiently low and there are no real-time constraints, the preemption and fair scheduling can be considered.
Preemption and fair scheduling should allow to efficiently use available CPUs by data plane application and other programs which can run on the same core. Nevertheless, mixing data plane and control plane applications associated with the same network brings additional risks and should be avoided . The preemption caused by a higher priority control plane application process can lead to longer packet queues, longer delays, and congestion. The preemption of a control plane process by the data plane delays the analysis of asynchronous hardware and network events, and results in analyzing them when they are already obsolete for example, the link that was previously reported down might be up by now .
The use of standard tools and loose coupling with the hardware has obvious advantages in software engineering. The decision of introducing high performance data plane optimizations should be made by also considering software development phase.
The highly optimized data plane application design has many specific aspects which are not present in the control plane and other types of applications. Nevertheless, due to the rise of multi-core processing which utilizes general purpose architecture, it is a must to face these problems and define appropriate solutions. The tight coupling of the data plane with hardware platform, that is physical CPUs and accelerators, does not make it easier, as the virtualization of the application obviously brings additional processing overhead. The scalability of data plane applications is also a matter of question.
The run-to-completion model of data plane application tries to address this variety of issues. It does not mean that it solves them completely or must be strictly applied when it comes to the packet processing. However, it reveals the potential of being a basis for efficient and to some extend scalable solution deployed on the popularity gaining clouds.
 M. Kangas, “Stability Analysis of New Paradigms in Wireless Networks”, University of Oulu, 2017
 C. F. Dumitrescu, “Design Patterns for Packet Processing Applications on Multi-core Intel® Architecture Processors”, Intel Corporation, December 2008
 J. Donovan, K. Prabhu, “Building the Network of the Future: Getting Smarter, Faster, and More Flexible with a Software Centric Approach”, CRC Press, 2017
 M. Michael, M. L. Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms”, University of Rochester, 1996
 Nokia Siemens Networks, “Open Event Machine”, 2013 [ONLINE] Available at: https://github.