Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today’s cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems. To measure the performance of datacenter networks, different performance metrics have been used such as flow completion times, deadline miss rate, throughput and fairness. Depending on the application and user requirements, some metrics may need more attention. While investigating different traffic control techniques, we point out the tradeoffs involved in terms of costs, complexity and performance. We find that a combination of different traffic control techniques may be necessary at particular entities and layers in the network to improve the variety of performance metrics. We also find that despite significant research efforts, there are still open problems that demand further attention from the research community.
Datacenters provide an infrastructure for many online services such as on-demand video delivery, storage and file sharing, web search, social networks, cloud computing, financial services, recommendation systems, and interactive online tools. Such services may dynamically scale across a datacenter according to demands enabling cost-savings for service providers. Moreover, considering some degree of statistical multiplexing, better resource utilization can be achieved by allowing many services and applications to share datacenter infrastructure. To reduce costs of building and maintaining datacenters, numerous customers rely on infrastructure provided by large cloud companies [1, 2, 3] with datacenters consisting of hundreds of thousands of servers.
Figure 1 shows the structure of a typical datacenter cluster network with many racks. A datacenter often hosts multiple such clusters with thousands of machines per cluster. A cluster is usually made up of up to hundreds of racks [4, 5, 6]. A rack is essentially a group of machines which can communicate at line rate with minimum number of hops. All the machines in a rack are connected to a Top of Rack (ToR) switch which provides non-blocking connectivity among them. Rack size is typically limited by maximum number of ports that ToR switches provide and the ratio of downlink to uplink bandwidth. There is usually about tens of machines per rack [4, 5, 6]. ToR switches are then connected via a large interconnection allowing machines to communicate across racks. An ideal network should act as a huge non-blocking switch to which all servers are directly connected allowing them to simultaneously communicate with maximum rate.
Datacenter network topology plays a significant role in determining the level of failure resiliency, ease of incremental expansion, communication bandwidth and latency. The aim is to build a robust network that provides low latency, typically up to hundreds of microseconds [7, 8, 9], and high bandwidth across servers. Many network designs have been proposed for datacenters [10, 5, 11, 12, 13, 14, 15, 16]. These networks often come with a large degree of path redundancy which allows for increased fault tolerance. Also, to reduce deployment costs, some topologies scale into large networks by connecting many inexpensive switches to achieve the desired aggregate capacity and number of machines [17, 4]. Although the majority of these topologies are symmetrical, in practice, datacenter networks turn out to be often asymmetrical due to frequent failures of network elements (switches, links, ports, etc.) [18, 19, 20]. In contrast to fixed networks, reconfigurable topologies involve optical circuit switching, wireless or a combination of both to adapt to traffic demands [21, 22, 23, 24]. These topologies rely on fast algorithms that take into account the reconfiguration latency.
Many applications may need to span over multiple racks to access required volume of resources (storage, compute, etc.). This increases the overall volume of traffic across racks. A datacenter network with full bisection bandwidth allows for flexible operation and placement of applications across clusters and improves overall resource utilization and on-demand scale out for applications [4, 10, 5, 15]. This allows resources of any machine to be used by any application which is essential for hyper-scale cloud providers [1, 2, 3].
Designing networks for full bisection bandwidth is costly and unnecessary for smaller companies or enterprises. As a result, many datacenters may be over-subscribed, meaning the total inter-rack network capacity may be less than sum of intra-rack capacities across all racks. The underlying assumption is that applications are mostly rack local. Increasing the over-subscription ratio affects performance of different topologies differently. For instance, over-subscribed Fat-Trees provide less flexible communication across machines compared to over-subscribed Jellyfish networks .
There is growing demand for datacenter network bandwidth. This is driven by faster storage devices, rising volume of user and application data, reduced cost of cloud services and ease of access to cloud services. Google reports 100% increase in their datacenter networking demands every 12 to 15 months . Cisco forecasts a 400% increase in global datacenter IP traffic and 2.6 times growth in global datacenter workloads from 2015 to 2020 .
Applications running on datacenters determine their traffic characteristics and the communication patterns across machines. Some popular applications include web services, cache followers, file stores, key-value stores, data mining, search indexing and web search. Many datacenters, especially cloud providers, run a variety of applications that results in a spectrum of workloads. Some applications generate lots of internal datacenter traffic, such as scatter-gather (also known as partition-aggregate) [27, 28, 29, 30] and batch computing tasks [31, 32]. As a result, the total traffic volume within a datacenter is often much more than that of entering or leaving it. Cisco reports this ratio to be greater than 3 which is expected to increase further by 2020 . Traffic control is necessary to highly utilize network bandwidth, keep latency low, offer quality of service, and fairly share resources among many users and applications by managing flow of traffic across the datacenter. There is a significant body of work on traffic control for datacenters. In this tutorial paper, we aim to review concepts in design and operation of traffic control schemes.
The rest of this paper is organized as follows. In §2, we present some related work. In §3, we review a variety of datacenter topologies, provide an overview of datacenter traffic patterns, set forth the challenges of traffic control for datacenter networks, and the objectives of datacenter traffic control. In §4, we review management schemes that can be applied to datacenters and point to some research work that use different management schemes. Next, in §5, we present a variety of traffic control techniques, discuss them in detail and explore the benefits and drawbacks associated with them. In §6, we discuss some general open traffic control problems. In §7, we point to rising paradigms related to datacenters. In §8, we introduce a new research area that is a result of global distribution of datacenters. Finally, in §9, we conclude the paper.
In this section, we briefly present some survey articles related to datacenter traffic control. In , authors provide a short survey of low latency datacenter networking by reviewing approaches taken to achieve low latency, namely reducing queuing delay, accelerating retransmissions, prioritizing mice flows and utilizing multi-path. In , authors survey the methods used to address the transport control protocol (TCP) incast problem (please see §3.3.6). In , authors survey bandwidth sharing in multi-tenant datacenters using techniques of static reservation, minimum guarantees and no guarantees (resources obtained in a best effort fashion). In , authors point out datacenter transport problems namely TCP incast, latency, challenges in virtualized environments, bandwidth sharing in multi-tenant environments, under-utilization of bandwidth, and TCP in lossless Ethernet environments also known as Converged Enhanced Ethernet (CEE). In , authors discuss TCP issues in datacenters pointing to TCP incast, queue buildup and buffer pressure. In , authors provide a comprehensive overview of datacenter networking for cloud computing discussing cloud computing network architectures, communication technologies used, topologies, routing and virtualization approaches. In , authors discuss various congestion notifications for datacenter networks. Finally, in , authors survey transport protocols proposed for datacenter networks and briefly explain each one of the variety of research efforts made in addressing the incast problem, outcast problem and in reducing latency for datacenter networks.
In this tutorial paper, we merely focus on traffic control concepts that can be applied to a variety of transport protocols including TCP. We also point to research efforts that use different techniques as examples so that readers can elevate their knowledge in case they are interested in further details. We try to uncover the (not obvious) tradeoffs in terms of complexity, performance and costs. This paper is different from prior work in that it covers a variety of aspects in traffic control in a conceptual way while not focusing on any specific transport, network or data link layer protocol. In addition, we provide an overview of datacenter traffic properties, topologies as well as traffic control challenges, objectives and techniques and their relationships which has not been done in prior work to our knowledge. Finally, at the end of this paper, we point to a recent research direction that involves inter-datacenter networks and offer three areas that demand further attention from the research community.
In this section, we dive deeper into specifics of datacenter networks. We review a variety of topologies proposed and general traffic properties of datacenter networks, point to traffic control challenges in datacenters and explain several traffic control objectives sought by various parties (i.e., operators, tenants, etc.) involved in using datacenter networks.
We shortly review popular physical datacenter topologies proposed and used in the literature either in testbed or simulation. Figure 2 shows examples of datacenter network topologies reviewed in the following (notice the network connecting ToR switches).
Fat-Tree , shown in Figure 2(a), is a multi-rooted tree topology where every root is called a core switch. It can be considered as a folded Clos  network . By increasing the number of roots, one can reduce the over subscription ratio (considering fixed capacity per link). This topology allows for high bisection bandwidth using a large number of less expensive switches allowing support for a large number of hosts at much less cost. There is an aggregate layer between the core and edge (ToR switches). The number of hops between any two servers attached to the same ToR switch is 2, to the same aggregate switch is 4 and otherwise is 6. A Fat-Tree topology built with -port switches supports up to physical servers (assuming one physical port per server) and paths between any source and destination pair. As a result, it is possible to scale to huge clusters by interconnecting many inexpensive switches.
To effectively use a Fat-Tree, complex routing configurations may need to be done on switches to avoid creation of loops while using available paths for load balancing. For example, Portland  is a custom routing and forwarding protocol which works out of Layer 2 and improves on fault tolerance (link failures), scalability, and ease of management (e.g. moving VMs across physical servers). Portland uses a Fabric Manager that holds information on address mappings and a fault matrix that maintains per link health status.
Leaf-Spine (or Spine-and-Leaf) [44, 13], shown in Figure 2(g), is a two tier network topology where leaf (i.e., ToR) switches are attached to servers and every spine switch is directly connected to all leaf switches similar to a bipartite graph. The links connected between the servers and leaf switches may have a different capacity from the ones connecting leaf switches to the spine. Leaf-Spine makes it easier to expand on capacity and ports (by adding more spine or leaf switches) and also allows straight-forward usage of Layer 3 routing with load balancing support without creation of loops. As a downside, in case high bisection bandwidth is intended, scaling to more than hundreds of servers in a cluster can lead to increased costs due to need for spine switches with many high capacity ports.
VL2 , shown in Figure 2(c), implements a complete routing and forwarding suite on top of 3-tier folded Clos networks (a multi-rooted tree) but differs from Fat-Tree in that switch-to-switch links are assigned much higher capacity than server-to-switch links. This requires less number of cables for connecting the aggregation and core layer. This topology has an intermediate tier that is connected to the aggregation trier in a bipartite graph topology. Each edge (ToR) switch is connected to two aggregation switches in a way that each aggregation switch gets connected to equal number of edge switches.
JellyFish , shown in Figure 2(e), is a topology where ToR switches are connected in a random setting and according to some rules: first, ports are randomly connected between switches until no more links can be added; next, for any switch S with two or more free ports, an existing link A-B is removed and two links are added between two of S’s free ports and two ends of the removed network link (i.e., A-S and B-S), until no switch has more than one free port. Since ToR switches are connected directly, the average path length (number of hops) is considerably smaller compared to 3-tier folded Clos. In addition, this topology is much easier to expand gradually. Authors show that with full bisection bandwidth, JellyFish supports more servers at the same cost compared to Fat-Tree. Also, with the same number of failed links, JellyFish offers a higher average throughput per server than Fat-Tree. One major problem with this topology is that the existing routing schemes cannot effectively use all paths since the number of hops across parallel paths changes by a large degree. Authors propose usage of -shortest path routing for flows along with multipath TCP (MPTCP) .
DCell , shown in Figure 2(d), is a hierarchical datacenter topology where a higher level DCell is built by putting together multiple lower level DCell structures. It can be incrementally expanded and does not have a single point of failure. DCell uses a custom routing algorithm that takes into account failures (DCell Fault-tolerant Routing) while aiming for near shortest path routing. Authors show that DCell offers higher network capacity compared to conventional tree topologies; however, it can perform much worse than multi-rooted trees such as Fat-Tree . Implementing DCell requires changes to the server networking protocol stack.
BCube , shown in Figure 2(b), is a leveled structure where a higher level is built by recursively attaching lower levels. Servers require multiple network ports to connect to switches and they also act as forwarding elements for other servers. BCube uses source routing and requires changes to the server networking protocol stack which can be done either in hardware (network card) or software. If forwarding is implemented in software, it can add CPU overhead especially at high rates. Authors show that BCube is more resilient to switch failures compared to Fat-Tree and almost as resilient to server failures.
Xpander , shown in Figure 2(f), is a datacenter topology based on expander graphs  which offers all the performance improvements of JellyFish over Fat-Tree topology with higher throughput as network is incrementally expanded. This topology is made by connecting multiple meta nodes based on the following rules: first, each meta node is made up of equal number of ToR switches; second, no two ToRs are connected within the same meta node; third, same number of links is used to connect every pair of meta nodes. Compared to JellyFish, Xpander offers the benefit of being structured rather than random which improves implementation predictability. Authors test Xpander topology with similar routing and forwarding techniques used in JellyFish: -shortest paths routing and MPTCP for multi-path load distribution.
Traffic characteristics of datacenters is highly dependant on applications and determines distribution of flow arrivals, flow sizes, and flow durations. For example, flows generated by web search queries are usually much smaller and shorter than that of batch computing jobs. This variety of applications could cause creation of long lived connections as well as short microbursts on the same network . There are limited publications on datacenter traffic characteristics. We review these works briefly focusing on applications and flow properties.
In , authors collect a traffic dataset from a cluster running query processing applications on large volumes of data that run MapReduce like workloads under the hood to respond to queries. Authors find that more than 50% of flows last less than 100 ms, 80% of flows last less than 10 seconds, and almost all flows last less than 100 seconds. They also find that more than 50% of traffic is carried by flows that last less than 25 seconds, less than 10% by flows that last less than 10 seconds, and almost 1% of traffic by flows that last less than 1 second. In terms of flow arrivals, authors find periodic short-term bursts of flows and periods of long silence.
In another work , authors collect and process datasets from 19 datacenters with a wide range of workloads including web services, instant messaging and MapReduce. This work is focused on packet level traffic analysis and does not offer statistics on a per-flow basis. Authors observed an ON/OFF pattern for packet arrivals at ToR switches they monitored, meaning there was varying periods with no packet arrivals between bursts of packets.
In , authors study 10 datacenters categorized as educational (storage, email, web, video), enterprise (custom applications in addition to storage, web and email) and cloud (instant messaging, web, search, indexing, video). They report the number of active flows less than 10000 per second per rack across all datacenters. More than 40% of flows were less than 1 KB, more than 80% of were less than 10 KB, and almost all flows were less than 10 MB. According to their results, durations are more dependant on the specific datacenter: the smallest median flow duration was about 500 µs in one datacenter while the largest median flow duration was 1 second in a different datacenter. The largest flow duration was between 50 seconds and 200 seconds across all datacenters. This work also confirms the ON/OFF behavior of packet arrivals.
In a recent paper , Facebook shares statistics on their traffic characteristics. They report flow size distributions on a per application basis for three major applications they run. Median and tail flow sizes for Hadoop, Web Server and Cache applications are reported to be between about 100 KB and 100 MB, 3 KB and 10 MB, 1 KB and 10 KB within racks while 1 KB and 1 MB, 5 KB and 500 KB, 30 KB and 3 MB between racks, respectively. Regarding flow durations, Hadoop flows had a median of about 300 ms and tail of less than 1000 seconds, Web Server flows had a median of about 900 ms and a tail of about 200 seconds, and Cache flows had a median of almost 400 seconds and a tail of almost 800 seconds, respectively. Per server, the median inter arrival time of various flow types was between 1000 µs and 10000 µs and the tail was between 10000 µs and 100000 µs. Finally, authors did not observe an ON/OFF packet arrival pattern at the switches which is suggested to be due to a large number of concurrent destinations, since ON/OFF pattern was observed on a per destination basis.
In addition to per-flow properties, since racks constitute a main unit in datacenter networks, one may be interested in how much traffic stays within racks. This could be helpful in deciding the over subscription ratio of cluster interconnection. The ratio of rack local to inter-rack traffic is dependent on applications that run within the datacenter and how instances of such applications are deployed on the servers. As a result, some prior work report a highly rack local traffic pattern  while some find traffic neither rack local nor all to all ,i.e., for some applications (e.g. Hadoop) traffic is mainly rack local while for others (e.g. Web Server and Cache) traffic is not at all rack local.
In summary, traffic characteristics, such as packet sizes, flow size distributions and flow inter-arrival times are highly correlated with applications. In addition, locality of traffic to racks is highly dependant on the way applications are deployed across the network and how such applications communicate with one another. For example, in , authors report that servers of different types are deployed in separate racks and since most communication occurs between Web Servers and Cache Followers (different server types), there is less rack locality for these clusters. Given such strong dependency on applications, it is relatively hard to draw a general conclusion about datacenter traffic. Some common findings include the following. There is usually several orders of magnitude difference between median and maximum flow sizes (the actual difference varies according to applications). In addition, there can be a large number of flow arrivals per server (as many as thousands per second) with many concurrent flows. Finally, distributions of flow sizes and durations may be considerably different due to possibly uneven distribution of traffic across the network (flows may have to compete for bandwidth) and application of techniques like connection pooling which leads to long-lived connections that are not always transmitting.
We present some datacenter traffic control challenges frequently pointed to in the literature. Table 1 provides a summary of these challenges.
Unpredictable Traffic Matrix §3.3.1
A traffic matrix represents the communication volume between pairs of end-points in a computer network. In datacenters, traffic matrix is varying and unpredictable.
Complicates traffic engineering and capacity planning.
Mix of Flow Types and Sizes §3.3.2
Due to variety of applications that share the datacenter infrastructure, a mix of various flow types and sizes are generated. Flows may have deadline constraints or not and may be short latency-sensitive or large throughput-oriented. Also, for specific applications, flow sizes may be unknown.
Complicates flow scheduling to meet requirements of different flows over a shared medium.
Traffic Burstiness §3.3.3
Burstiness has two different aspects. Traffic per flow could be highly bursty and flow arrival itself could be bursty as well. Burstiness intensity may change according to where traffic is measured, i.e., at end-point interfaces, at ToR switch ports, and so on.
Large buffer space at the switches to absorb bursts, careful and responsive traffic control to minimize average buffer space usage and react to bursts quickly.
Packet Reordering §3.3.4
Can be caused while applying some traffic control schemes to improve load balancing or increase throughput via using parallel paths at the same time. At high rates, reordering can exhaust end-point CPU and memory resources for buffering and putting segments back in order.
Efficient methods to put packets in order at receivers with minimal memory and CPU overhead and careful transmission methods at senders to minimize reordering when packets arrive at the receiver.
Performance Isolation §3.3.5
In cloud environments with many tenants where network resources are shared across tenants, mechanisms should be put in place to make sure tenants’ use of resources cannot impact other tenants.
Allocation of bandwidth on a per tenant basis rather than per flow taking into account possibility of selfish or malicious behavior from tenants.
The Incast Problem §3.3.6
A variety of applications, such as search, use the partition-aggregate communication pattern which can lead to a large number of incoming flows to end-points. If not properly handled, this can lead to congestion, dropped packets and increased latency.
Larger available switch buffer space, responsive and careful traffic control to keep switch buffer occupancy low and avoid dropped packets.
The Outcast Problem §3.3.7
Is observed in switches that use DropTail queues and when a disproportionate number of flows arrive at different incoming ports and exit the same switch output port. This problem is caused by synchronous dropped packets under high utilization.
Responsive and careful traffic control to keep switch buffer occupancy low and avoid dropped packets.
A variety of applications usually run on a datacenter creating many flows with different properties. Most datacenter flows are short 1 (a few packets at most) [29, 53, 49, 48, 27, 50, 5, 7] and many short flows may be only one packet long . However, most bytes are delivered by large flows [29, 5, 7].  reports 80% of flows to be less than 10 KB and 99% of flows to be less than 100 MB.
Datacenter applications also determine the flow arrival rates and distributions. The median flow inter-arrival time for a single machine was reported between 2 ms to 10 ms (100 to 500 flows per second) for different servers in Facebook datacenters  (this median is calculated by measuring number of arriving flows per machine per second as samples averaged over number of machines).  reports between 100 to 10000 flow arrivals per switch in a one second bin in different educational and private datacenters. Finally,  finds the median arrival rate of flows in a cluster to be 100 flows per millisecond.
High flow arrival rate majority of which being short can create an unpredictable and fluctuating traffic matrix which makes it hard to perform traffic engineering or capacity planning in longer time scales to improve performance [27, 5, 54, 55].
Traffic in datacenters is a mix of various flow types and sizes [5, 51, 29, 56, 49, 48]. Knowledge of various flow requirements and characteristics can help us design and tune transport protocols to more efficiently use network resources. Size and priority of a flow are usually determined by the application that initiates it. For some applications, flow sizes maybe unknown upon initiation.
Interactive flows which are created as a result of user interactions (for example generated by soft real time applications such as web search) can generate latency-sensitive flows that are usually short and should be delivered with high priority. Examples include queries (2 to 20 KB) and short messages (100 KB to 1 MB) . Size of these flows is usually known apriori . Responsiveness of online services depends on how interactive flows are handled which can impact the number of users for an online service in the long run. In a study by Google, 400 ms increase in delay reduced the number of searches by 0.6% per user per day . Also, 100 ms added latency could reduce Amazon sales by 1% .
Throughput-oriented flows are not as sensitive to delay, but need consistent bandwidth . They may range from moderate transfers (1 MB to 100 MB) such as ones created by data parallel computation jobs (e.g. MapReduce), to background long-running flows that deliver large volumes of data such as transfers that move data across datacenter sites for data warehousing and geo-replication . For these flows, it is still preferred to minimize the transfer time.
Deadline flows have to be completed prior to some deadlines. Their size is either known  or a good estimate can typically be drawn . Both latency sensitive and throughput oriented flows might have deadlines. Deadlines can be either soft or hard which implies how value of its completion drops as time passes . A soft deadline implies that it is still profitable to complete the flow and that the value decreases according to a utility function as we move past and away from the deadline. A hard deadline means zero value once the deadline has passed. For example, in interactive scatter-gather applications, if a query is not replied by its deadline (usually less than 300 ms [29, 61]), the final answer has to be computed without it  (i.e., zero value for that flow), while if a backup process is not completed in time, it is still valuable to finish it, although it might increase the risk of user data loss due to failures.
Several studies find datacenter traffic bursty [63, 49, 50, 29, 9, 48]. Theoretically, bursty traffic has been shown to increase packet loss and queuing delay while deceasing throughput . In bursty environments, packet losses have been found more frequent at the network edge due to higher burstiness . Burstiness can also lead to higher average queue occupancy in the network leading to increased flow completion times (FCT) for many flows  and increased packet drops due to temporary creation of full buffers in the network [65, 29]. In addition, highly bursty traffic can cause buffer space unfairness in shared memory switches if a switch port exhausts shared memory due to receiving long bursts .
Several causes can lead to creation of bursty traffic. Hardware offloading features, such as Large Send Offload (LSO), that reduce CPU load, can lead to higher burstiness. Interrupt Moderation (Coalescing) , which reduces CPU load and increases throughput by processing packets in batches, can lead to bursty traffic at the sender. Transport control features in software can create bursty traffic by scheduling a large window of packets to be sent together such as TCP slow start. Applications can increase burstiness by sending large pieces of data at once .
Out of order arrival of packets can increase memory and CPU utilization and latency at the receiver especially at high rates [67, 68]. Some transport protocol features, such as fast retransmit , may mistake reordering with packet loss. Different hardware and software features are involved in putting packets back in order including Large Receive Offloading (LRO) , Receive Side Scaling (RSS)  and Generic Receive Offloading (GRO) . LRO and GRO are usually implemented as part of driver software. Some NICs provide hardware support for these features. LRO focuses mostly on TCP/IPv4 stack while GRO is general purpose.
To understand the extend to which reordering can affect performance, we point to a prior work on improving the performance of handling out of order packet arrivals . Authors performed experiments with Vanilla Linux kernel and realized that at high rates (e.g. gigabits), significant reordering can increase CPU utilization to 100% and limit server interface link utilization to 70%. Even after applying optimizations at the driver and network stack level, CPU load increased by about 10% with server interface link utilization at 100%.
Performance isolation is necessary in cloud environments where multiple tenants use shared network resources [73, 74, 75]. Isolation prevents selfish or malicious behavior that aims to either unfairly obtain more resources, such as by creating many flows or using custom aggressive transport protocols [76, 77], or to cause disruptions.
Enforcing performance isolation over a shared infrastructure is hard. To effectively isolate the effect of tenants and users, mechanisms need to be put in place in various layers and parts of the network. For example, a queue management scheme will need to divide buffer space according to users and tenants, bandwidth needs to be fairly divided, computational and memory overheads due to network communication needs to be controlled on a per user or per tenant basis, and all of this need to be done according to service level agreements between the operator, users and tenants.
When many end-hosts send traffic to one destination concurrently, the link attached to the destination turns into a bottleneck resulting in queue buildups, large queuing delays, and dropped packets [65, 29, 78, 79, 80, 81]. This problem becomes more serious in high bandwidth low latency networks  and with shallow buffer switches .
For example, the incast problem could appear in clusters running applications such as search and batch computing jobs like MapReduce that follow the partition-aggregate processing model. In search applications, a server may query a large number of other servers the results to which may be returned to that server at the same time creating sudden increase in incoming traffic. In a MapReduce job, a Reduce instance may download outputs of many Map instances for reduction which can cause a similar situation. Both scenarios follow the many-to-one communication pattern.
This problem occurs due to synchronized drops of packets from an input port of a switch which is referred to as port blackout . This eventually leads to unfairness. For such synchronous drops to happen, two predicates have to be present. First, there needs to be contention between a large group of flows coming from one input port and a small group of flows coming from a different input port for access to the same output port. Second, the output port uses queues that follow TailDrop policy (if queue is full, the arriving packet is discarded). Port blackout occurs for the small group of flows and is observed temporarily over short periods of time. When the output queue is full, any arriving packet (during this window) is dropped which leads to consecutive drops for incoming flows. Such consecutive drops affect the smaller set of flows more than they affect the larger set (a smaller number of flows in the larger set are affected). The intensity of this problem increases as the ratio of flows in the larger over smaller group increases. This problem is called the “outcast” problem because some flows are cast aside (they cannot obtain bandwidth).
This could simply occur in tree-based topologies such as Fat-Tree and in partition-aggregate communication scenarios where many flows from different servers could be returning results to one server (many-to-one communication pattern). A disproportionate number of flows from incoming servers may end up on different input ports of the ToR switch attached to the receiver which could lead to flows on some port receiving less average throughput.
Datacenter environments involve operators, tenants, and end-users with different objectives. Operators would like to use available resources as much as possible to provide higher volume of services, accommodate more tenants and eventually increase revenues. In addition, datacenter infrastructure may be shared by several tenants running various types of applications to offer different services. Tenants would like to receive a fair share of resources according to their service level agreement (SLA) which determines the cost of services. Finally, many end-users may rely upon services offered by datacenter tenants. They would like such services to be responsive and highly available. These possibly conflicting objectives are presented in the following. Table 2 provides a summary of traffic control objectives in datacenters.
Minimizing Flow Completion Times (FCT) §3.4.1
Faster completion times reduces the communication delay for distributed applications and improves their end-to-end responsiveness.
Minimizing Deadline Miss Rate or Lateness §3.4.2
For time constrained applications, it is important to meet specific deadline requirements. For some applications, only transactions that complete prior to their deadlines are valuable in which case deadline miss rate is the right performance metric. For some applications transactions completed past the deadlines are still valuable or even necessary in which case lateness (i.e., by how much we miss deadlines), is the right metric.
Maximizing Utilization §3.4.3
To maximize performance, it is desired to use available resources as much as possible.
Resources should be fairly divided across tenants and users while paying attention to their class of service and service level agreements (SLAs).
Flow Completion Time (FCT) is the time from the initiation of a flow to its completion. Depending on applications, FCT can directly or indirectly affect the quality of experience and service offered to end users [85, 29, 62]. Major factors that determine FCT in datacenters include queuing and packet loss [7, 29]. Occasionally, retransmission of lost segments can significantly increase latency due to the time it takes to identify and retransmit the lost data. For some applications, it may be more helpful to focus on minimizing mean or median latency (e.g. static web applications) [85, 29, 86, 87, 80], while for other applications tail latency may be more important (e.g. partition-aggregate) [30, 62].
Many applications require timely delivery of data that can be viewed as flows with deadlines. In such applications, as long as pieces of data are delivered prior to the deadlines, customer SLAs are satisfied. The quality of services decrease as the fraction of missed deadlines increases. For some applications, delivery after the deadlines is still valuable and necessary. As a result, sometimes minimizing the amount by which we miss deadlines is more important which is referred to as “lateness” (e.g. synchronization of search index files).
Increasing resource utilization can reduce provisioning costs and increase revenues. By better utilizing network bandwidth, operators can accommodate more tenants or improve the quality of service for existing ones. Effectively utilizing the network depends partly on network topology and design parameters, and partly on network traffic control scheme.
Many flows share datacenter resources such as link bandwidth and buffer space. In multi-tenant datacenters, it is necessary to make sure tenants receive fair share of network resources according to their service level agreement (SLA). Enforcing fairness also mitigates the starvation problem and prevents malicious behavior.
There are several definitions of fairness in networking context including max-min fairness, proportional fairness, and balanced fairness [88, 89]. Fairness criteria determine how link bandwidth or buffer space is divided among flows. Max-Min Fairness (MMF) , which aims at maximizing the minimum share, is the most widely used. In general, fairness can be considered over multiple dimensions each representing a different kind of resource (e.g., bandwidth, CPU cycles, and memory) [91, 92]. We however focus on network bandwidth fairness in this paper.
Fairness should be considered among proper entities as defined by the fairness policy. For example, fairness may be across groups of flows as opposed to individual flows to prevent tenants from obtaining more resources by creating many flows. Strict fairness across all flows can also lead to increased number of missed deadlines  and sub-optimal FCTs . One approach is to first apply fairness across tenants according to their classes of service and then across flows of each tenant considering flow priorities.
In addition to the traffic control objectives we mentioned, there are other objectives followed by many datacenter operators. An important objective is reducing energy costs by increasing energy efficiency. Since datacenter networks usually connect a huge number of servers, they are made of a large number of network equipment including fiber optics cables and switches. Due to varying amount of computational and storage load, average network load may be considerably less than its peak. As a result, operators may reduce energy costs by turning off a fraction of networking equipment at off-peak times [93, 94, 95] or by dynamically reducing the link bandwidths across certain links according to link utilizations . There are however several challenges doing so. First, the resulting system should be fast enough to increase network capacity as computational or storage load increases to prevent additional communication latency. Second, it may be unclear where to reduce network capacity either by turning equipment off or by reducing link bandwidths (correlation between network load and placement of computation/storage can be considered for additional improvement ). Third, the effectiveness of such systems depends on load/utilization prediction accuracy. Further discussion on reducing datacenter power consumption is beyond the scope of this paper.
To enforce traffic control, some level of coordination is needed across the network elements. In general, traffic control can range from fully distributed to completely centralized. Here we review the three main approaches used in the literature namely distributed, centralized or hybrid. Table 3 provides an overview of traffic control management schemes.
Higher scalability and reliability. Solutions can be completely end-host based or use explicit network feedback. More information in §4.1.1.
Access limited to local view of network status and flow properties. Limited coordination complicates enforcement of network-wide policies. More information in §4.1.2.
Access to global view of network status and flow properties. Central control and management can improve flexibility and ease of enforcing network-wide policies. More information in §4.2.1.
A central controller can become a single point of failure or a network hotspot. Latency overhead of communicating with a central entity and control plane inconsistencies are other concerns. More information in §4.2.2.
Potentially higher reliability, scalability, and performance. More information in §4.3.1.
Higher complexity. Also, the final solution may not be as good as a fully centralized system. More information in §4.3.2.
Most congestion management schemes coordinate in a distributed way as it is more reliable and scalable. A distributed scheme may be implemented as part of the end-hosts, switches, or both. Some recent distributed traffic control schemes include those presented in [86, 78, 98, 99, 8].
Designs that can be fully realized using end-hosts are usually preferred over ones that need changes in the default network functions or demand additional features at the switches such as custom priority queues , in-network rate negotiation and allocation , complex calculations in switches , or per flow state information . End-host implementations are usually more scalable since every server handles its own traffic. Therefore, popular transport protocols rely on this type of implementation such as [102, 29].
As an example, the incast problem §3.3.6, which is a common traffic control challenge, can be effectively addressed using end-host based approaches considering that incast congestion occurs at receiving ends. Some approaches are Server-based Flow Scheduling (SFS) , pHost , NDP  and ExpressPass . SFS uses the generation of ACKs to control the flow of data towards receivers and avoid congestion. The sender looks at the flows it is sending and schedules the higher priority flows first, while the receiver controls the reception by deciding on when to generate ACKs. pHost uses a pull-based approach in which the sender decides on reception schedule based on some policy (preemptive or non-preemptive, fair sharing, etc). A source dispatches a Request To Send (RTS) to a receiver. The receiver then knows all the hosts that want to transmit to it and can issue tokens to allow them to send. NDP limits the aggregate transmission rate of all incast senders by maintaining a PULL queue at the receiver that is loaded with additional PULL requests when new packets arrive from a sender (a PULL request contains a counter which determines number of packets its associated sender is allowed to send). PULL requests are then sent to senders in a paced manner to make sure the overall incoming transmission rate at the receiver is not larger than per interface line rate. ExpressPass manages congestion across the network by controlling the flow of credit packets at the switches and end-hosts according to network bottlenecks (a sender is allowed to send a new data packet when it receives a credit packet).
Shifting more control to the network may allow for better resource management due to ability to control flows from more than a single server and availability of information about flows from multiple end-hosts. For example, flows from many hosts may pass through a ToR switch giving it further knowledge to perform scheduling and allocation optimizations.
Some examples of this approach include RCP , PDQ , CONGA , Expeditus , and RackCC . RCP and PDQ perform in-network rate allocation and assignment by allowing switches and end-hosts to communicate using custom headers, CONGA gets help from switches to perform flowlet based load balancing in leaf-spine topologies, Expeditus performs flow based load balancing by implementing custom Layer 2 headers and localized monitoring of congestion at the switches, and RackCC uses ToR switches as means to share congestion information among many flows between the same source and destination racks to help them converge faster to proper transmission rates. To implement advanced in-network features, changes to the network elements might be necessary and switches may need to do additional computations or support new features.
Higher scalability and reliability. Can be completely implemented using end-hosts. To further improve performance, network (i.e., switches, routers, etc.) can be involved as well. Completely end-host based approaches can operate simply by controlling the transmission rate according to implicit feedbacks from network (e.g. loss, latency). Network can offer explicit feedbacks (e.g. network queues’ occupancy) to improve transmission rate management by allowing senders to make more accurate control decisions. For example, Explicit Congestion Notification (ECN) allows network to communicate high queue occupancy to end-hosts [108, 109] or trimming packet payloads in case of highly occupied network queues (instead of fully discarding them) can help receivers get a complete picture of transmitted packets [110, 103].
Usually just access to local view of network status and flow properties which allows for only locally optimal solutions. For example, while every end-point may strive to achieve maximum throughput for its flows by default (locally optimal), it may lead to a higher network wide utility if some end-points reduce their transmission rate in favor of other end-points with more critical traffic (globally optimal). Using distributed management, it may be harder to enforce new network wide policies (e.g. rate limits in a multi-tenant cloud environment) due to lack of coordination and limited view of network condition and properties.
In centralized schemes a central unit coordinates transmissions in the network to avoid congestion. The central unit has access to a global view of network topology and resources, state information of switches, and end-host demands. These include flow sizes, deadlines and priorities as well as queuing status of switches and link capacities. Scheduler can proactively allocate resources temporally and spatially (several slots of time and different links) and plan transmissions in a way that optimizes the performance and minimizes contentions. To further increase performance, this entity can translate the scheduling problem into an optimization problem with resource constraints the solution to which can be approximated using fast heuristics. For large networks, scheduler effectiveness depends on its computational capacity and communication latency to end-hosts.
TDMA , FastPass  and FlowTune  are examples of a centrally coordinated network. TDMA divides timeline into rounds during which it collects end-host demands. Each round is divided into fixed sized slots during which hosts can communicate in a contention-less manner. All demands are processed at the end of a round and schedules are generated and given to end-hosts. FastPass achieves high utilization and low queuing by carefully scheduling traffic packet by packet considering variation in packet sizes. FlowTune improves on scalability of FastPass using centralized rate-assignment and end-host rate-control.
There are several challenges using a fully centralized approach. The central controller could be a single point of failure since all network transmissions depend on it. In case of a failure, end-hosts may have to fall back to a basic distributed scheme temporarily . There will be scheduling overhead upon flow initiation, that is the time it takes for the scheduler to receive, process the request, and allocate a transmission slot. Since majority of flows are short in datacenters, the scheduling overhead has to be tiny. In addition, this approach may only scale to moderate datacenters due to processing burden of requests and creation of a congestive hot-spot around the controller due to large number of flow arrivals. Bursts in flow arrivals  can congest the controller temporarily and create scheduling delays. It may be possible to apply general techniques of improving scalability for central management of larger networks such as using a hierarchical design.
Can provide higher performance with a global view of network status and flow properties. Such information may include utilization of different network edges, their health status as well as flows’ size, deadline and priority. With this information, one can potentially direct traffic carefully according to network capacity across a variety of paths while allowing flows to transmit according to their priorities to maximize utility. Central management can improve flexibility and ease of managing network policies. For example, new routing/scheduling techniques can be rolled out much faster by only upgrading the central fabric. Centralized schemes also increase ease of admission control in case strict resource management is necessary for guaranteed SLAs.
A central controller or management fabric can be a single point of failure or it may become a network hotspot in case of large networks. There is also latency and computational overhead of collecting network status and flow properties from many end-hosts (controller will have to process and understand incoming messages at high speed and act accordingly). Overhead of network resource allocation and scheduling (if central rate allocation is used). Finally, consistency of network updates can be an issue in large networks. For example, some updates my not be applied correctly at network elements (e.g. software bugs ) or different updates may be applied with varying latency that can lead to transient congestion or packet losses which may hurt performance of sensitive services.
Using a hybrid system could provide the reliability and scalability of distributed control and performance gains obtained from global network management. A general hybrid approach is to have distributed control that is assisted by centrally calculated parameters.
Examples of this approach include OTCP , Fibbing , Hedera  and Mahout . OTCP uses a central controller to monitor and collect measurements on link latencies and their queuing extent using methods provided by Software Defined Networking (SDN)  which we will discuss further in §7.1. For every new flow, the controller calculates parameters such as initial and minimum retransmission timeout and initial and maximum congestion window size considering which path is taken by flows which allows for fast convergence to steady state. Fibbing relies on a central controller to manipulate the costs associated with routes in the network or insert fake nodes into the routing tables of routers to force them to use or avoid some paths to control the distribution of load across network. Hedera and Mahout operate by initially allowing network to route all incoming flows using a distributed approach, then monitoring and detecting large flows that can be moved dynamically to different routes using a central controller with a global view of network status which allows for a more balanced distribution of load. While Hedera uses readily available switch capabilities (flow counters) to detect large flows, Mahout engineers and inserts a shim layer at the end-hosts to monitor and detect large flows to improve scalability.
Reliability, scalability, and higher performance. A central controller can offer coarse grained solutions while fine-grained control is applied using a distributed scheme. Division of work between central and distributed components can be done in a way that minimizes the effects of centralized and distributed schemes’ drawbacks. For example, in multi-tenant cloud datacenters, a hierarchical approach can be used where aggregate bandwidth given per tenant is calculated centrally while transmission control across flows per tenant is performed in a distributed fashion reducing central management overhead and increasing per tenant management scalability.
Complexity may increase. Central controller now has limited control; therefore, the final solution may not be as good as a fully centralized system. Due to presence of centralized control, there still exists a single point of failure however with less impact in case of a failure compared to a fully centralized scheme. Also, the distributed component still operates on a locally optimal basis. For example, in multi-tenant cloud datacenters, if bandwidth per tenant is managed in a distributed fashion, due to limited local information per network element, it may be challenging to apply routing/scheduling policies that maximize utility according to flow properties.
Many design parameters and approaches can be considered in development of any traffic control scheme. Figure 3 provides a high level breakdown of major datacenter traffic control techniques. In the following, we provide a list of these techniques and discuss some research efforts made regarding each. Figure 4 provides an overview of the relationships between challenges, objectives and techniques. Details of each relationship is further explained in Tables 5 and 4. More detailed information is provided in the following sections.
Transmission control is the mechanism that controls the flow of data sent to the
network. There is typically a window of outstanding bytes for each flow at the
sender determining the volume of data that can be transmitted before data
reception is acknowledged. This allows for
Some recent window-based approaches include DCTCP , D2TCP , L2DCT , MCP , DAQ , and PASE . DCTCP uses explicit congestion signals from switches that is piggybacked on ACKs to change the window size according to the extent of congestion. Packets are marked to convey congestion signal according to instantaneous queue length upon their arrival at the queue. D2TCP modulates window size based on flow deadlines: flows with closer deadlines reduce window size less than others in response to network congestion signals. L2DCT modulates window size of a flow based on its estimated size which is dynamically calculated by counting the number of packets it has seen from a flow up to current time. MCP changes window size to approximate the solution to an stochastic optimization problem that minimizes long term mean packet delay using flow information and network state, such as queue length at the switches. DAQ allows ToR switches to calculate initial sender window for end-hosts to converge to proper transmission rates faster. PASE also gets help from network elements to decide on its initial coarse-grained window size, and then performs fine-grained adjustments to the window size using congestion signals received similar to DCTCP.
Although fairly simple, window-based approaches only allow for coarse-grained control of rate which can result in considerable variations and create highly bursty traffic. For example, a sender can release a full window of packets before it has to wait for the receiver to acknowledge. This may be followed by a batch of acknowledgements that move the window forward and allow for another full window of packets to be sent.
To decrease the transmission rate variations at the senders,
Examples of earlier rate-based protocols include TCP Friendly Rate Control (TFRC)  and Rate Control Protocol . Several recent work also use rate-based methods in datacenters including PDQ , D3 , TIMELY , and RACKS . TFRC calculates the allowed sending rate to compete fairly with TCP using an equation which captures various network parameters such as round trip time (RTT), loss rate, retransmission timeout, and segment size. RCP uses an equation based rate control to minimize average FCT with the help of network switches that divide bandwidth according to processor sharing policy. D3 uses flow deadlines as a factor in the rate calculation equation to reduce deadline miss rate (a closer deadline allows for a higher rate). RACS uses a similar approach to RCP but assigns weights to flows according to their priority, to emulate different scheduling algorithms at the switches according to weight assignment. PDQ improves on both RCP and D3 by rate allocation using switches and adding support for preemption. TIMELY calculates the sending rate as a function of network latency.
Explicit rate control can be applied both in hardware (e.g. NIC) and software (e.g. OS Kernel). While the former provides more precise rates, the latter is more flexible and allows for a larger number of flows to be rate controlled as well as more complex policies to be considered. Rate-control in software can result in creation of occasional bursts due to challenges of precise packet scheduling in OS kernel .
Complementary to these two approaches for rate control at the sender, one can
Transmission control at senders determines number of packets given to network driver for transmission. If a bunch of packets are sent to network interface for transmission, bursts may be created.
If transmission is performed over multiple interfaces for multipath delivery, careful transmission control is required to minimize packet reordering. This shall be done according to latencies of different paths to the receiver.
Rate limiting can reduce burstiness by adding a protective layer after transmission control in case a large number of packets are given to the network driver for transmission.
Rate limiting can improve performance isolation across tenants and users by preventing any tenant/flow from taking too much bandwidth starving other tenants/flows (a tenant may initiate many flows).
Packet pacing reduces burstiness by adding time spacing between consecutive packets prior to transmission.
Prioritization helps allocate resources to flows accordingly and can improve the performance while scheduling a mix of flow types/sizes. For example, one could prioritize highly critical deadline flows over other flows.
Prioritization can shield a tenant/application with high service guarantees from other tenants/applications with lower service guarantees. Priorities shall be determined according to tenant/application service level agreements (SLAs) or Quality of Service requirements (QoS).
Datacenter topologies usually come with a large degree of path redundancy to provide low over-subscription ratio for better inter-rack communication. With load balancing, this capacity can be used to a higher extent giving the operators/tenants a better chance to handle any traffic matrix.
Multipathing can increase packet reordering. To reduce reordering while using multiple paths, one can perform careful packet scheduling according to path latencies at transmission control level so that packets arrive in the right order at the receiver (it is nearly impossible to eliminate reordering). Assignment of data segments to sub-flows is also an important factor in how well receivers can handle reordering.
Transmission control, rate limiting, and packet pacing all depend on careful scheduling of packets at senders which can mitigate burstiness.
When a mix of flow types is present, scheduling of packets according to flow priorities, deadlines, sizes and arrival order can help us better meet traffic control objectives.
If multiple paths are used for a flow, scheduling can reduce reordering by determining when packets should be sent at the sender over each path to arrive at the receiver with minimal reordering.
Scheduling can mitigate the incast problem by preventing all the incoming data (from many flows) from arriving at a receiver at the same time. Various techniques can be used such as adding random delays while initiating requests (jittering) and receiver based scheduling using either ACKs or receiver window size.
Effective queuing disciplines which determine how packets in a switch queue are scheduled for transmission, such as Stochastic Fair Queuing (SFQ), can mitigate port blackout §3.3.7.
Transmission control can impact overall end-to-end latency by affecting queue occupancies at the network switches.
Transmission control determines how many packets are sent by each flow which can be done according to flow deadlines to reduce deadline miss rate and/or lateness.
Proper transmission control can maximize network bandwidth utilization while avoiding congestion (which usually leads to dropped packets and wasted bandwidth).
Transmission control plays significant role in how fairly flows share the network. For example, if one of two equally important flows is given higher transmission quota over longer periods of time, this can lead to unfairness.
Rate limiting can prevent congestion and reduce dropped packets. As a result, it helps maximize utilization and minimize wasted bandwidth.
Rate limiting can improve fairness by preventing selfish behavior of bandwidth hungry flows/tenants.
Packet pacing can reduce average queue occupancy in the network (switches/routers) and therefore reduces end-to-end latency.
Packet pacing can increase effective utilization by preventing bursty behavior of flows which can lead to higher buffer occupancy and dropped packets. It stabilizes transmission rates and reduces transmission spikes.
Prioritization can reduce average latency and flow completion times by giving higher priority to shorter flows.
Prioritization according to flow deadlines can improve the overall deadline miss rate. For example, search queries with critical deadlines (e.g. 300 ms after arrival) can be given high priority and can be addressed before long running backup operations in a shared network environment.
Load balancing allows us to make best use of large path diversity in datacenters and maximize utilization over all available paths.
Multipathing can increase utilization by allowing long running flows to use bandwidth of several available paths.
Scheduling can reduce latency by applying scheduling disciplines that mimic Shortest Remaining Processing Time (SRPT).
Deadline miss rate or lateness can be reduced by scheduling flows according to their deadlines such as by allotting more capacity for flows with closer deadlines.
Scheduling can improve utilization by reducing contention within the network for using available bandwidth. For example, by carefully deciding on when packets should be sent by end-hosts, we can avoid sudden arrival of many packets at the switches which can lead to dropped packets.
Scheduling can improve fairness by giving bandwidth to flows based on a fair sharing policy such as Max-Min Fairness (MMF) .
| || || ||
Maximum outstanding window size.
Simplicity, no careful packet scheduling overhead.
Coarse-grained control over transmission rate which can lead to occasional bursts of packets (e.g. when a bunch of ACKs arrive and suddenly the transmission window moves forward).
Desired transmission rate (in bits per second).
More accurate control over rate, reduced variations in rate, a more natural choice for better management of network bandwidth.
Overhead of packet scheduling either in software (less accurate, less costly) or in hardware (more accurate, requires hardware support).
Transmission quota towards a specific receiver.
Receiver enforced quota prevents senders from congesting receivers. Can be applied complementary to window/rate-based control.
A sender needs to first coordinate with the receiver before it embarks on transmission which can add some latency overhead.
We can improve network performance by making sure that it conforms to required profile and policy rules. This can reduce contention while using network resources. For example, traffic shaping can prevent some flows from hogging others. Shaping can be used to provide some level of resource isolation and guarantees in cloud environments with many users. Finally, traffic shaping can be useful in resource scheduling where senders follow rates specified in the schedule. A summary of traffic shaping techniques has been provided in Table 7.
| || ||
Rate Limiting §5.2.1
Limits the transmission rate of outgoing packets from the rate limiter. A token bucket is usually used to limit maximum persistent packet transmission rate to token arrival rate.
Rate limiting has limited accuracy depending on how it is enforced, in software (usually less accurate but cheaper) or hardware (usually more accurate but needs hardware support). Also, various implementation techniques in software lead to different accuracy such as in the operating system kernel or using kernel bypass modules (see §7.4).
Packet Pacing §5.2.2
Inserts time spacing between consecutive packs to spread them uniformly across a window of round trip time (RTT). This reduces traffic burstiness and average network buffer occupancy, therefore improving end-to-end latency.
Packet pacing has limited accuracy and similar to rate-limiting, its accuracy depends on the implementation technique used.
Rate limiting is usually applied by passing traffic through a Token Bucket filter. This ensures that average transmission rate does not exceed the token generation rate. Tokens are generated at a specific rate and an arriving packet can only be sent if there is a token available. Tokens are accumulated if there is no packet to be sent, but there is usually a cap on how many. This cap is to limit traffic burstiness in case the token bucket is idle for a while. Examples of works employing rate-limiting to provide resource isolation and guarantees include [135, 113, 73, 132, 140, 51, 141].
Rate-limiting can be done in OS Kernel, as part of the hypervisor in a virtualized environment, or via NIC. Scheduling of packets in software (Kernel or Hypervisor) is generally less precise and can be computationally intensive in high bandwidths and with a large number of flows. To reduce CPU utilization, OSes usually send packets to NIC in batches which can further reduce the scheduling precision. For example, Classful Queuing Disciplines (Qdisc) offered by Linux allows for coarse grained rate-control by determining the time and count of packets that NIC receives from RAM, however, the actual schedule of packets on the wire is determined by NIC.
To improve software rate-limiting performance one can use userspace packet processing tools some of which we point to in §7.4. For example, Carousel  is a rate-limiter that is implemented as part of a software NIC in userspace. Carousel uses a variety of techniques to reduce CPU and memory usage and improve scalability including deferred completion signalling (rate-limiter only signals completion to applications when data is actually transmitted to offer backpressure and minimize buffer space usage) and single queue shaping (using a timing wheel and by assigning timestamps to packets over the time horizon).
Using NIC to schedule and send packets given rate limits reduces CPU load and traffic burstiness. Effective rate-limiting in hardware demands support for multiple queues and classes of traffic with hardware rate-limiters attached to them.
Hybrid rate-limiting approaches can be used to support a large number of priority classes while reducing hardware complexity and keeping scheduling precision high. NicPic  classifies and stores packets in queues located in RAM and labels them with proper rate limits by host CPU. Packets are then fetched by NIC via DMA and scheduled using hardware rate-limiters. NIC first decides on which flow’s packets to send and then pulls them from RAM.
As the last option, rate-limits can also be applied at the application layer. Applications can do this by limiting the volume of data handed off to transport layer over periods of time. This approach is simple but requires changes to applications. Also, rate-limiting precision will be limited. Finally, this may lead to bursty traffic as incoming application data to transport layer may get buffered prior to transmission on the wire, i.e., applications have no control over how data is eventually sent.
Packet Pacing is the process of adding space between consecutive packets so that they do not arrive at the network back to back which reduces traffic burstiness. Burstiness can degrade network performance in several ways. Long bursts can overload switch buffer and create consecutive packet drops. Average latency of all packets then increases since they have to wait in longer queues. In addition, it creates transmission rate oscillations making it hard to do careful bandwidth allocation [29, 9].
Earlier work  experimenting with pacing in a general network setting has shown that it can considerably reduce queuing delay. Combined with other types of congestion signals, pacing can improve the performance by evenly distributing traffic across the timeline . In addition, pacing should only be applied to long-running and throughput-oriented flows to reduce their impact on short latency-sensitive flows . The benefit of pacing depends on the network bandwidth-delay product, buffer size, and the number of flows. Having so many flows that share a buffer can reduce the effectiveness of pacing due to inter-flow burstiness and creation of synchronized drops .
Pacing can be done in both hardware and software, but hardware pacing can be more effective due to higher scheduling precision, especially at high rates where spacing between packets is tiny . Software pacing may be performed by end-hosts as part of a driver or a kernel module. In cloud environments, due to widespread use of virtualization, packets may be paced at the virtual interfaces in the hypervisor software. Pacing in software may be overridden by NIC offloading features such as LSO, if enabled, since NIC is the actual entity that sends packets out on the wire. In hardware pacing, packets are buffered at the NIC, each assigned a timer, and scheduled to be sent when their timer goes off.
Pacing can also be done at the network edges (e.g. ToR switches) as opposed to end-hosts. For example, Queue Length Based Pacing (QLBP)  uses a pacing controller attached to edge queues to determine when the next packet in the queue is supposed to be sent as a function of queue length.
| || ||
A flow’s priority is fixed once assigned. This approach can be applied when flow properties (e.g. size) are known apriori.
A flow’s priority may change over time according to its behavior, i.e., number of packets sent over time. This approach can be used if flow properties are unknown apriori.
by flow size
Mimics the Shortest Remaining Processing Time (SRPT) scheduling discipline which aims at minimizing mean flow completion times.
by flow deadline
To minimize deadline miss rate or lateness by first satisfying flows with closer deadlines.
by class of service
If an application or a tenant has a higher service level requirement or agreement, flows associated with them can be prioritized accordingly to minimize the effect of other applications/tenants using the same physical network.
at the switches
Switches can keep state information on flows passing through them and determine their priority according to flows’ behavior. For example, in case of prioritization by flow size, switches can estimate flow sizes by counting the number of packets they have sent (mimicking least attained service discipline). Keeping state information at the switches may be costly when there are many flows.
at the end-hosts
End-hosts can mark packets with priority tags allowing switches to simply enforce priorities according to tags. This reduces switch complexity but requires changes to the end-hosts’ software or hardware protocol stack.
at Layer 2
Ethernet standard IEEE 802.1Q priority based forwarding.
at Layer 3
Differentiated Services (DiffServ) can be used at the IP layer.
Can be used by adding support to switches and/or end-hosts, may require changes in software/hardware to switches and/or end-hosts.
A lower priority flow is only sent when there are no packets available from any of the higher priority flows. This minimizes the effect of lower priority flows on higher priority ones but can lead to starvation of lower priority flows.
A lower priority flow can be sent even if there are packets available from higher priority flows. This occurs when a required volume of higher priority flows’ demand is satisfied (e.g. one low priority packet is sent for every K ≥ 1 high priority packets sent) and mitigates the starvation problem of lower priority flows.
Table 8 provides an overview of this section. Classifying flows based on their priorities and treating them accordingly can improve performance. Such prioritization can be done in-network by using multiple queues at the switches and allowing higher priority traffic to go over lower priority traffic [134, 53, 122, 135], and at the senders by performing rate-control according to priorities [51, 28, 86, 119].
Priorities are usually assigned either based on flow size to minimize mean latency (by mimicking SRPT scheduling policy) [134, 53] or based on deadlines to minimize the number of deadline missing flows [122, 80]. Control traffic is naturally prioritized to improve the feedback timeliness and quality (e.g. ACKs in TIMELY  and Trimmed Packets in NDP ) or decrease control loop delay (e.g. RTS in pHost ).
For many applications, flow sizes are either known or can be roughly estimated upon initiation [53, 122, 86, 51] making it easy to assign priorities by size to reduce mean latency. In case flow sizes are unknown apriori, dynamic prioritization can be used where packets of a flow first receive the highest priority, but get demoted to lower priorities as more of them is seen.
For example, dynamic Packet Prioritization (DPP)  uses two queues, an express queue which has higher priority and a normal queue. It reduces the priority of long running flows by counting their packets against a threshold. Having multiple queues allows for more precise classification of flows [134, 122]. However, recent work shows that most benefit in reducing mean FCT can be obtained using up to 8 queues . Finding proper threshold values based on which flows are demoted to lower priorities may require solving an optimization problem that takes into account the flow arrival rates and flow size distribution. In datacenter environments with known traffic characteristics and workloads, such thresholds may be determined offline and applied in real-time . It is also possible to virtually emulate the behavior of having infinite number of queues using only two actual queues per switch port, a high and a low priority queue . This can be achieved by assigning flows with highest priority to high priority queue while the rest of flows to low priority queue and dynamically changing flows assigned to high priority queue when other flows with a higher priority complete.
Prioritization can be performed fully at switches by keeping state on flows passing through and using some priority assignment criteria such as the total number of packets sent. This simplifies end-hosts at the expense of higher computational and memory burden on the switches.
Another approach is for the end-hosts to tag flows with priority tags while switches just process tags and put packets in proper queues [135, 134]. End-host priority tagging can be done at the NIC, OS kernel, hypervisor, or even by applications before packets are sent to the network. In case of virtualization, if end-host VMs cannot be trusted to assign priorities properly, middleboxes can be used (e.g. at the hypervisor) that monitor and tag packets (e.g. using OpenVSwitch ) which applies to both static  and dynamic [122, 134] prioritization.
Priorities can also be assigned in different ways. Ethernet standard IEEE 802.1Q priority based forwarding  that provides 8 levels is supported by many switch vendors and can also be used in datacenters . At the IP layer, Differentiated services (DiffServ)  can be used . Custom queuing techniques and headers can also be used which may require changes to both switches and end-hosts .
Strictly prioritizing flows can lead to starvation where lower priority flows cannot make progress due to large volume of higher priority traffic. A simple solution is to use weighted queuing instead of strict prioritization. For instance, DAQ  uses a weighted round-robin between long and short flows to make sure that throughput-oriented flows keep making progress. Aging can also be used to address starvation while minimally affecting critical flows. An aging-rate can be used to increase the priority of low priority flows as a function of their waiting time .
Datacenter topologies typically provide a large degree of path redundancy. Properly distributing load across these paths reduces contention among flows while increasing overall resource utilization. Without effective load balancing many links may not be utilized while some experiencing congestion [48, 148]. Table 9 provides an overview of general load balancing concepts and their tradeoffs.
| || ||
A new flow is assigned to any of the available paths using some fixed criteria such as by hashing parts of its packets’ header. This approach is simple but inflexible. For example, in case two throughput oriented flows are assigned to the same path, they cannot be moved to other less utilized paths later.
Flows can be moved across any of the available paths according to available bandwidth. Offers a better performance in general but adds the complexity of measuring link utilizations, accounting for flows, and calculating best flow assignments accordingly.
After a flow is assigned to one of the available paths according to some criteria, its assignment will remain fixed. The initial assignment is performed according to network conditions such as available bandwidth. This approach is somewhat between the previous two assignments above in terms of implementation overhead, flexibility and performance.
Finest load balancing but can lead to high reordering.
Coarse load balancing but achieves minimal reordering.
A flowlet’s size dynamically changes according to differences of latencies of candidate paths. At high rates and/or high latency difference between available paths, a flowlet can become significantly large. As a result, this can result in both fine and coarse grained load balancing (it is always somewhere between per packet and per flow). Flowlets have been found effective for load balancing over asymmetric (i.e., with different available bandwidth) paths . As a drawback, flowlets may cause reordering of small flows and hurt their completion times.
A flowcell has a fixed size that is usually about tens of packets. Using flowcells simplifies load balancing compared to flowlets (no need to carefully measure path latencies and schedule accordingly) and reduces possible reordering of small flows. It can however significantly increase reordering for larger flows that will be broken into many flowcells.
Load balancing can be static or dynamic (adaptive). Static approaches use a fixed criteria to assign traffic to available paths such as by hashing specific fields from packet headers. For example, ECMP  is a popular static load balancing technique that only distributes load across equal cost paths. Adaptive load balancing dynamically selects paths for traffic according to distribution of load to minimize hot-spots. Various criteria can be used for path assignment such as per-hop or per-path queue occupancies . After choosing the initial path according to current load, some adaptive approaches keep monitoring the network status and distribution of traffic. They change direction of traffic to eliminate or reduce hot-spots. These approaches are referred to as reactive. If not applied with care, reactive load balancing might lead to oscillations.
Examples of reactive dynamic load balancing techniques include Planck , Hedera , MPTCP , DIBS  and CONGA . Planck uses a controller that monitors traffic and generates congestion events that include the transmission rate of flows passing through the congested link. It then routes traffic away from congested spots. Hedera initially places flows via hashing, and then uses a central controller to monitor the network, detect long running flows and reschedule such flows on a lightly loaded path to balance the load. MPTCP establishes multiple sub-flows from the beginning and then shifts load between them according to the level of congestion observed across each sub-flow. DIBS forwards packets that arrive at a full queue to one of the nearby switches instead of dropping them which will be forwarded towards the destination through another path. CONGA proposes a technique for leaf-spine topologies  based on lazy evaluation. A leaf switch has a table which holds the load seen along its outgoing paths. Such load information is collected by receiving switches and then piggybacked on traffic.
Some proactive adaptive approaches include DeTail , Presto  and Expeditus . DeTail uses a per-hop adaptive method and operates in lossless environments with layer 2 flow control [153, 154, 155]. At every hop, packets are forwarded to the egress port with minimal queuing. Presto breaks flows into small units called cells and sends them across all available paths. This implies that small flows are kept intact as long as they fit into one cell. Expeditus dynamically assigns flows to paths in 3-tier Clos topologies. It uses dedicated packet tags to communicate load information across switches for path selection upon arrival of a new flow. For path election, the upstream switch sends a message to its downstream peer expressing congestion at its egress ports. The receiving switch compares the upstream congestion metrics with the ones for its ingress ports choosing the ports that minimize the maximum congestion along the path.
Load balancing can be done per-packet, per-group of packets that belong to the same flow, and per flow. While considering these options, two important performance metrics are packet reordering and distribution of load across the network. Reordering is known to waste server resources and increase latencies .
Another option is to group several packets of a flow and perform load balancing
In flowlet scheduling, smaller timeout values allow for finer load balancing while larger values reduce reordering. To minimize reordering, one can choose the timeout value to be greater than the difference between latencies of paths with minimum and maximum latencies. Flowlets have been found to effectively balance load while incurring minimal reordering in datacenters [105, 113]. However, dependence of flowlet switching on inter-packet intervals could lead to creation of arbitrarily large flowlets at high rates, which could lead to congestion in case they collide. Another drawback is the possibility that small flows are split into several flowlets which could lead to reordering and increased latency.
In flowcell scheduling, smaller flowcells balance load better while larger ones reduce reordering. Flows shorter than the cell size are guaranteed to be sent on a single path minimizing their latency and reordering. Previous work has used grouping thresholds of tens of kilobytes (10 KB at ToR switches  and 64 KB at the hypervisor layer ) to effectively spread the load across the network and reduce creation of random hot-spots. As a drawback, this approach may lead to higher reordering for long flows compared to flowlets.
Many datacenter applications need to access data stored on multiple servers to perform computations or respond to queries. Therefore, placement of data determines what options are available to access them. Such data could be a value for a key in a distributed key-value store or an object in a replicated or erasure coded store. For example, any of the replicas in a replicated store can be accessed or any of the pieces out of pieces of data would allow data recovery in an ( , ) erasure coded storage system.
Pieces of data can be distributed across racks and servers (depending on topology) to allow wider load balancing options. For example, Ceph  uses Selective Replication that distributes copies of the original data across the cluster according to their popularity. Ceph also distributes contents of large directories and ones with lots of writes across many servers to reduce hot-spots. HDFS  allows for similar features but also considers the network topology while distributing replicas. For example, there is better connectivity and usually higher available bandwidth within a rack.
Placement of tasks (execution) could be as important as placement of data. Task schedulers and resource managers can place computation in accordance with placement of data to reduce network usage, contention for network access and queuing [164, 165, 166, 167, 168, 169]. A task scheduler may consider the flow scheduling policy of the network (FCFS, SRPT, Fair Sharing, etc.) in addition to placement of data to improve overall task completion times .
Switches forward packets according to Forwarding Information Base (FIB) which contains a set of rules that determine outgoing port(s) for incoming packets. FIB rules can be installed proactively or reactively. Proactive installation may result in a larger number of rules as not all of them may be used at all times while reactive installation of rules may incur setup time overhead. Such rules can be installed either directly or by a routing protocol that calculates the rules and installs them such as BGP, IS-IS or OSPF. In case of routers, FIB usually reflects a subset of Routing Information Base (RIB) which is a table of routes learned or calculated by a router. For load balancing, various forwarding techniques can be used to direct traffic across several paths.
Standard distributed protocols can be used for load balancing. As a Layer 2 solution, VLAN based load balancing puts same machines on several virtual networks allowing traffic to be spread across paths via using different VLAN tags. It however provides limited scalability due to creation of large broadcast domains. Layer 3 routing for large networks with support for load balancing can be used in case multiple next hops are available for a destination. For example, Equal Cost Multipathing (ECMP) statically selects the next hop by hashing packet header fields. For fast convergence, IGP routing protocols can be used such as IS-IS or OSPF. Load balancing using these protocols is challenging since path costs need to be exactly equal and the number of supported equal paths is limited. BGP provides higher flexibility for this purpose  and can be customized to converge fast .
Load balancing can be performed using centralized techniques. In Layer 2, scalable forwarding can be built by replacing the default MAC address broadcast and discovery approach with a centralized one [173, 174]. A controller can then setup Layer 2 forwarding that accounts for path diversity . Centrally implemented Layer 3 approaches allow FIBs to be calculated centrally using free routing software stacks such as Quagga , and installed on switches to implement various protocol stacks such as OSPF and IS-IS with ECMP support which provides much higher flexibility [176, 5, 177, 43]. For example, to help BGP converge faster, a central controller can calculate and update BGP tables in routers to achieve desired forwarding behavior . Simpler centralized approaches in which new FIB rules are installed directly by a controller upon arrival of a new flow can also be used. Forwarding can be done in a way that distributes load, either by hashing or adaptively selecting least loaded paths. Using this approach, careful consideration of scalability is necessary.
Previous approaches relied mostly on network for routing. An end-host based approach is Source Routing which simplifies the network by moving the forwarding information to packets, and eliminating the need to disseminate updates of forwarding state . To properly encode paths in packets, end-hosts need to be aware of network topology and paths. In addition, a mechanism is needed to detect and disseminate network status and failure information to the end-hosts. BCube  uses probe packets to measure available bandwidth across paths and assign new flows to least loaded ones. PSSR  encodes a list of outgoing port numbers instead of addresses for next hops to decouple addressing from forwarding.
The scale of datacenter networks has made failures an important concept. Even using high quality and expensive equipment, the probability that some network element fails (e.g. a switch, a port or a link, etc.) can be considerably large at any moment . When some network equipment fails, capacity in the network decreases accordingly; however, since datacenter networks are usually connected with large degrees of redundancy, network failures rarely lead to complete inability to reach parts of the network. However, the capacity loss due to failures in some parts of network can lead to complications in load balancing by affecting the effective capacity of different paths, i.e., creating capacity asymmetries. This may not have a significant impact on topologies that are inherently asymmetrical (e.g. JellyFish, Xpander, etc.); however, more basic load balancing techniques, such as ECMP, which are used in symmetric topologies (e.g. Fat-Tree, Leaf-Spine, VL2, etc.), will have a hard time effectively distributing load in case of failures [18, 105, 149]. This is noteworthy considering that many industry datacenters are based on symmetric topologies.
A variety of solutions have been proposed to perform better load balancing in case of capacity asymmetries across paths examples of which include WCMP , CONGA , HULA , Presto , LetFlow , DRILL  and Hermes . WCMP mitigates asymmetries by extending ECMP and assigning weights to different paths proportional to their capacity referred to as weighted traffic hashing. Using WCMP, weights determine the number of hash entries per outgoing port at the switch which are selected proportional to capacity. CONGA and HULA operate by performing path-wise congestion monitoring and shifting traffic accordingly. As a result, if capacity of a path is reduced due to failures, its congestion metric will increase faster and it will automatically be assigned less traffic. Presto applies a weighted forwarding mechanism in a way similar to WCMP where weights are pushed to the end-host virtual switches. LetFlow is a simple approach where flowlets are used as means to dynamically adapt to varying path capacities. LetFlow relies on natural property of flowlets which allows them to shrink or expand (in size) according to available capacity over paths. DRILL performs load balancing over asymmetric topologies by first breaking them into groups of symmetric components and then performing load balancing on them per switch. Symmetric paths should have equal number of hops and their queues should be shared by same flows at every hop from source to destination. Hermes uses comprehensive sensing at the end-hosts to monitor path conditions and reroute flows affected by failures or congestion caused by asymmetries. Hermes considers monitoring of latency and ECN markings for detection of congestion while looking at frequent timeouts and retransmissions as signs of failures. In addition, to improve visibility into path conditions, Hermes uses active probing by sending small probe packets between end-host pairs periodically.
To improve the overall throughput for a single flow and provide better reliability in case of failures, multi-path techniques can be used [45, 139, 180, 68]. A flow is split into multiple sub-flows each sent over a different path. To receive traffic on multiple paths, the receiver needs to buffer data received from each sub-flow and put them in order. Buffering is proportional to sum of throughput across paths times the latency of the longest path [68, 181]. Although latencies are quite small in datacenters, link bandwidths can be significantly high. Additional sub-flows can generally increase both memory and CPU utilization .
Depending on flow sizes, the applications may decide whether to use multiple sub-flows. Overhead of setup and tear-down for multiple sub-flows may be considerable for short flows. For long-running background flows, using every bit of bandwidth available through multipathing may improve their total average throughput.
Several examples of multipath transports include MPTCP , XMP  and MMPTCP . MPTCP leverages ECMP to route sub-flows over various paths and increase total throughput while balancing load across paths by moving load across sub-flows. XMP approximates the solution to an optimization problem that maximizes utilization. It uses RED [184, 185] with ECN marking to keep the queue occupancies low. XMP achieves most benefit by using two sub-flows. MMPTCP aims to improve FCT for short flows by employing a two phase approach. First phase uses packet scatter by randomizing source port numbers and routing via ECMP to minimize FCT for short flows and second phase uses MPTCP to maximize throughput for long flows.
Although multipathing is generally helpful in increasing utilization, the benefit it offers is limited under various conditions. Under heavy workloads, multipathing may not improve throughput if most paths are already highly utilized . In addition, if paths have different characteristics, such as latency or capacity, multipathing may offer marginal increase (or even decrease) in throughput . This may occur in datacenters in case different communication technologies are used across machines, such as a combination of wireless and wired networks or if available paths have varying number of hops or different capacities. Another complication is caused by overlapping paths where a multihomed sender’s paths to the receiver actually have common edges . This can occur in datacenters as well in case of link failures which reduce available paths or if sub-flows are mistakenly hashed to the the same links. Finally, multipathing may lead to unfairness if multipath flows share a bottleneck with regular flows . Unfairness may also arise when multipath flows with different number of sub-flows compete for bandwidth over a bottleneck.
Given a list of flows with their priorities and demands, the scheduling problem aims to optimize an utility function of several performance metrics such as utilization, fairness or latency. The objective is usually to maximize throughput for bandwidth-hungry flows and minimize FCT for latency-sensitive flows considering fairness among flows in each class of service. Different scheduling techniques can be used to reduce FCT, provide better bandwidth guarantees to long-running flows, and help deadline flows meet their deadlines. In general, scheduling a mix of flow types requires formulation of a complex optimization problem that is generally computationally expensive to solve. Table 10 offers an overview of scheduling techniques presented here.
| || ||
Reserving bandwidth prior to a sender’s transmission minimizes congestion and queuing latency.
Reservation adds the latency overhead of calculating a transmission schedule before a flow is allowed to transmit. In addition, it is challenging to enforce a transmission schedule network wide. Inaccuracies in transmission rates may necessitate continues updates to the schedule which is costly.
A flow can be replicated multiple times to reduce the effect of high tail latency. The fastest reply is obtained and then replicas are terminated.
Effective if there is a considerable gap between tail and median latency. In addition, it is less effective when network is heavily loaded.
Scheduling can be done according to deadlines to minimize deadline miss rate and lateness.
It may not be possible to meet all the deadlines in which case it should be determined whether deadline miss rate is more important (hard deadlines) or lateness (soft deadlines). In presence of deadlines, it is unclear how to effectively schedule traffic if flow sizes are not known apriori or cannot be estimated.
A variety of scheduling disciplines can be applied according to desired traffic control objectives. For example, SRPT minimizes mean latency while Fair Queuing maximizes fairness.
Disciplines can usually optimize for only one performance metric. A mix of scheduling policies can hierarchically optimize for more than one objective. If a utility of objectives is desired, well-known policies may provide solutions far from optimal.
Allows the network to update current schedule (along with already scheduled flows) according to new flow arrivals.
Preemption may offer limited benefit if all flows have similar properties (i.e., size, deadline, etc.).
Prevents a sender from initiating many flows together. For example, this helps mitigate the incast problem §3.3.6.
This approach only offers very coarse grained control over incoming traffic.
ACK Control §5.6.7
By carefully controlling when ACKs are sent to senders, a receiver can control the incoming flow of traffic.
This approach only offers coarse grained control over incoming traffic.
To provide better bandwidth guarantees and prevent creation of congestion spots resources may be first checked for availability and then allocated before an end-host can start transmitting a flow. Requesting resources can be done in different units and such requests might be processed centrally or in a distributed fashion.
In a fully centralized approach, end-hosts can report their demands to a central scheduler and ask for transmission slots or rates. Some examples are TDMA , FastPass , FlowTune  and TAPS . TDMA uses a coarse-grained centralized approach in which end-hosts send their demands to a fabric manager which then allocates them contention-less transmission slots. The scheduling is performed in a round by round basis where each round is several slots during which different hosts can communicate over the fabrics. FastPass allocates slices of time on a per-packet basis to improve utilization and considers variations in packet size. FlowTune performs centralized rate allocation on a per-flowlet basis. TAPS uses a central SDN controller (please refer to §7.1) to receive and process flow requests with known demands, verify whether deadlines can be met on a per-task basis, allocate contention-less slices of time for senders to transmit and install forwarding rules.
Distributed reservation can be done upon connection setup. A sender can request the rate at which it would like to transmit. This rate along with the available bandwidth is considered by network fabrics to determine the rate and path of the flow. Allocated rate can be piggybacked on ACKs. RCP  and PDQ  perform the rate allocation at switches with this approach. Receiver-based reservation can also be used. Receivers view multiple incoming flows and determine their transmission schedule. Senders can also communicate their demands to the receiver which calculates the transmission schedule . In addition, token-based techniques can be used where a receiver sends back tokens to allow senders to transmit a unit of data (e.g. packet) per token. The token-based approach has been found very effective in addressing the incast problem which is achieved by controlling the rate at which tokens are sent back to senders from the receiver to ensure that data arriving at the receiver conforms with available capacity [80, 103, 104].
Tail latency is an important quality metric for many latency-sensitive applications, such as search, as it determines quality of user experience. Late flows can take much longer than median latency to finish [29, 62]. An effective approach is to replicate flows, use the fastest responding replica and terminate the rest. For simplicity, creation of replicates may be performed completely at the application layer. The probability of more than one replica being late is usually significantly small. Replicated flows can be scheduled on different paths and can even target different servers to reduce correlation.
One approach is to replicate every flow and then take the one whose handshaking is finished earlier and terminate the rest . Another approach is to only replicate slow requests by reissuing them . It is necessary to judiciously decide on the number of redundant requests to balance resource usage and response time. Since only a tiny portion of all flows are usually laggards, additional resources needed to allow large improvements may be small .
For many applications, criticality of flows can be captured as deadlines. Scheduling techniques are expected to minimize deadline miss rate. In case of hard deadlines, in which case delivery after deadline is pointless, flows can be terminated early if their deadlines cannot be met. D2TCP , D3 , PDQ , and MCP  are examples of deadline-aware scheduling schemes that reduce deadline miss rate by performing rate control according to flow deadlines, either by explicit rate allocation (D3, PDQ) or implicitly by adapting sender’s outstanding window size (D2TCP, MCP). Tempus  formulates a complex optimization problem to maximize the fraction of flows completed prior to their deadlines while considering fairness among flows. Amoeba  aims to guarantee deadlines by performing initial admission control via formulating an optimization problem. RCD  and DCRoute  aim to quickly determine whether deadlines can be met by performing close to deadline scheduling and guarantee deadlines via bandwidth reservation. In addition, considering task dependencies in meeting deadlines can help reduce deadline miss rate of tasks [193, 188].
While it is important to meet deadlines, finishing deadline flows earlier than necessary can hurt the FCT of latency-sensitive traffic . In general, we can form an optimization problem for scheduling flows . A study of how well-known scheduling policies perform under mix flow scenarios with varying fraction of deadline traffic can be found in .
The following are some well-known scheduling disciplines. First Come First Serve (FCFS) is a simple policy where a task has to be completed before the next task can begin. FCFS provides bounded lateness when scheduling flows with deadlines  and is close to optimal for minimizing tail completion times given light-tailed flow size distributions . Processor Sharing (PS) divides available resources equally among flows by giving them access to resources in tiny time scales. Fair Queuing (FQ) approximates PS within transmission time of a packet and can be used to enforce max-min fairness. Earliest Deadline First (EDF) minimizes deadline miss rate for deadline flows. Shortest Job First (SJF) minimizes mean flow completion time in offline systems where all flows and their demands are known apriori. For online systems where requests can be submitted at any time, Shortest Remaining Processing Time (SRPT) which is preemptive, minimizes mean FCT . SRPT also offers close to optimal tail completion times while scheduling flows with heavy-tailed size distributions . Least Attained Service (LAS) , which prioritizes less demanding flows, can be used to approximate SJF without apriori knowledge of flow demands .
Many works use policies that approximate or implement well-known
scheduling disciplines. RCP  performs explicit rate control to enforce
processor sharing across flows sharing the same links. PDQ  combines EDF
and SJF giving higher priority to EDF to first minimize deadline miss rate
and then minimize FCT as much as possible. FastPass , implements
Aside from research, one may be interested in what disciplines can be enforced using industry solutions. Switches by default provide support for FIFO queues per outgoing port which enforce FCFS scheduling policy. Some switches provide support for multiple levels of priority at the outgoing ports (multiple FIFO queues with different priorities). Using these queues, it is possible to mimic the SRPT scheduling policy by putting smaller flows into higher priority queues. For example, Cisco offers switches that support this feature using dynamic packet prioritization (DPP)  which tracks flows as their packets arrive (estimating a flow’s size according to LAS policy) and assigns them to priority queues according to their sizes. Weighted Fair Queuing (WFQ) is also supported by many switch vendors per “class of service” or per flow where weights determine the proportional importance of flows, i.e., a flow is assigned bandwidth proportional to its weight.
Many practical systems have to address arrival of requests in an online manner where requests can arrive at any time and have to be addressed upon arrival. Order of arrivals can impact the performance of scheduling algorithms due to race conditions which can lead to priority inversion. For example, in an online scenario with the objective of minimizing mean FCT, SJF might perform poorly if many short flows arrive shortly after a large flow. Preemptive scheduling policies (SRPT in this case) can be used to address this problem .
A server issuing many fetch requests can use jittering to desynchronize arrival of traffic from various flows and reduce peaks in traffic volume. Such peaks can lead to temporary congestion, dropped packets and increased latency. Jittering can be applied by adding random delays at the application layer when initiating multiple requests at the same time .
ACKs may be used as part of the network traffic scheduling process since they determine how a sender advances its outstanding window of bytes. They can be thought of as permits for the sender to transmit more data. A receiver can stall a sender by intentionally holding back on ACKs or limit sender’s rate by delaying them. For example, a receiver can pause ACKs for low priority flows upon arrival of higher priority traffic and can generate ACKs to control which flows are assigned more bandwidth . In addition, by reporting a small receiver window in ACKs, a receiver may limit the transmission rate of a sender [78, 65]. This approach has considerable similarities with the token-based transmission control applied in schemes such as pHost , NDP  and ExpressPass .
In this section, we point to a few open challenges with regards to traffic control. To find an optimal solution, these problems may be modeled as complex optimization scenarios that are computationally expensive to solve (large number of variables and constraints, presence of integer variables and/or non-linear constraints, complex objective functions) and practically hard to enforce (lack of hardware support, slow response time of software implementations, presence of failures and errors). Current approaches apply a variety of heuristics and simplifying assumptions to come up with solutions that are practical and attractive to industry. In addition, such optimization scenarios may be infeasible due to presence of contradictory constraints meaning it may not be possible to optimize for all objectives given available resources. Therefore, it becomes necessary to relax some requirements. In the following, we do not provide any optimization models, rather we point to cases where such complex models may appear. Table 11 offers an overview of open challenges in this section.
Handling Mix Workloads
Datacenter environments house a variety of applications that generate flows with different properties and requirements. Effectively handling the mix of traffic workload requires clearly defining a utility function of performance metric variables (delay, utilization, deadline miss rate, lateness, fairness) and formulating an optimization problem.
Load Balancing vs. Packet Reordering
To minimize packet reordering, a sender needs to carefully schedule packets over all available paths while paying attention to other traffic workload. Enforcing a no reordering constraint can result in lower network utilization. As a result, to increase utilization while imposing acceptable level of reordering, one can consider a utility function of these factors and formulate an optimization problem solving which provides a desirable transmission schedule.
Achieving High Throughput and Low Latency
In distributed traffic control approaches, a feedback from network is provided to senders to adapt their transmission rate. While transmitting at high rates, the network may not provide feedbacks fast enough leading to network overload and congestion. Therefore, limited network responsiveness may increase average queuing delay and latency at high throughput.
Due to resource constraints, it may not be possible to come up with a transmission schedule where all performance objectives are optimal. It then becomes necessary to define a utility of performance metric variables and aim to maximize utility by formulating an optimization scenario.
We review a few networking paradigms that have affected the design and operation of datacenter networks. Essentially, networks have become more flexible and controllable giving operators more room for performance optimizations.
Programmable forwarding planes offer significant room for efficient control and management of datacenter networks. They make it possible to develop and implement custom policies and algorithms for network control and management. Forwarding plane can be programmed centrally using a controller that takes into account policies and resource constraints through some interface provided by forwarding elements. This is considered as part of Software Defined Networking (SDN) . A comprehensive survey of SDN architecture and applications can be found in .
The dominant framework in this realm is OpenFlow [118, 203] where forwarding elements, such as switches, can be managed via an open interface. An important benefit of an open interface is that switches built by different vendors can be operated in the same way allowing for cost effective expansion. Rolling out new updates to the network also becomes much easier as only controllers need to be patched. In general, there could be multiple controllers each managing a part of the network while coordinating together which is most applicable to large networks. There is two-way communication between switches and controllers: a controller can register for specific events at the switches and perform changes to the switches’ forwarding table by adding new rules, modifying existing rules or removing them.
The forwarding process begins when a packet enters a switch from any port. The switch tries to match the packet to a forwarding rule according to its header fields and will forward it to the correct outgoing port. It is also possible to modify the packet contents (packet rewrite) before forwarding it. If the packet cannot be matched to any rule, it can be sent to the controller (if forwarding table is configured with the right table-miss entry) for further inspection and if necessary the controller will update the forwarding plane accordingly for forwarding of this packet and the rest of packets from the same flow. If there are multiple matches, the highest priority rule will be executed. More complex operations can be executed using Group Tables which allow for forwarding to multiple outgoing ports, selection of the outgoing port according to some hash of the packet (e.g. for load balancing), and switching to a connected output port for failover. Group Table features have been added since version 1.1 of OpenFlow  and currently version 1.5 has been released .
By centrally managing forwarding rules, one can implement different routing protocol stacks in a centralized manner. For example, BGP protocol stack can be deployed on top of SDN . In addition, one can implement a more sophisticated control plane protocol that understands and communicates with a variety of other protocols, such as legacy BGP routers, while running a custom protocol stack itself .
Generally, data plane operations are implemented using Application-Specific Integrated Circuits (ASICs) at the hardware layer allowing for forwarding at maximum rate but only offering a set of fixed switch functions that can only be changed by replacing the ASICs. This introduces a few issues namely being prone to bugs as well as long and unpredictable time to implement new functions [207, 208]. Programmable data planes (PDPs) allow the packet processing functions to be changed at the forwarding devices, i.e., switches can apply new forwarding functions at the line rate. This is orthogonal to programmable forwarding planes (e.g., SDN) where different forwarding rules can be selectively applied to packets. PDPs make it easier to deploy traffic control schemes that depend on custom in-network processing or feedback. For example, [9, 53, 86] rely on custom packet headers and feedback from switches.
PDPs can be realized using Protocol-Independent Switch Architecture (PISA) using which new features can be introduced to switches or bug fixes can be applied in a short time . There are emerging proposals for hardware designs that allow for PISA [210, 211]. Hardware prototypes (switch chips) have also been built that have made this possible . P4 [212, 213] is a high level language to program PISA switches which is vendor independent, and protocol independent (i.e., operates directly on header bits and can be configured to work with any higher layer protocol). P4 compiler can also compile P4 code to run on a general purpose processor as software switches.
NICs have been providing basic offloading features to OSes, such as segmentation offloading, for many years. Several vendors have been developing NICs with advanced offloading features that perform complex transport tasks and deliver the results without minimal involvement from OS and CPU. These features allow complex operations at high line rates of datacenters (40 Gbps and more) doing which at the OS may incur significant CPU overhead and additional communication latency.
Examples of offloading features include cryptography, quality of service, encapsulation, congestion control, storage acceleration, erasure coding, and network policy enforcement. Examples of such NICs include Mellanox ConnectX  and Microsoft SmartNIC [215, 216] developed as part of Open Compute Project (OCP) . SmartNIC relies on FPGA accelerators and can be used to apply SDN/Networking policies. It also makes low latency transport possible using Lightweight Transport Layer (LTL) that creates end-to-end transport connection between NICs (FPGA does all processing of segmentation, ordering, and ACKs).
By default, packets pass through the Operating System networking stack before they are received by applications. When packets arrive at the NIC, interrupts are generated to invoke the Operating System routines that read and process them. Processing of packets is usually done in batches to reduce CPU utilization at high rates, for example by enabling Interrupt Moderation .
To improve the packet processing performance (decrease packet processing latency and increase throughput), a different approach would be to bypass Operating System’s networking stack and use polling instead of interrupts. This can be realized using kernel bypass modules, such as Netmap [218, 219], Vector Packet Processing (VPP) [220, 221] and Data Plane Development Kit (DPDK) , which have been shown to reduce the number of required cycles to process a packet by up to 20× on average . These modules allow userspace programs to directly access NIC buffers to read incoming packets or write packets for transmission.
Userspace networking stacks have been developed on top of kernel bypass modules. Sandstorm  and mTCP , implement TCP in userspace and rely on Netmap and VPP, respectively. SoftNIC  is built on top of DPDK and allows developers to program custom NIC features in software. RAMCloud  distributed key-value store and FastPass  make use of kernel bypass and polling to speed up Remote Procedure Calls (RPC). NDP  is a datacenter transport protocol that is designed for ultra-low latency and operates on top of DPDK.
TCP has been the dominant transport protocol across datacenters for the majority of applications. Since implemented as part of OS protocol stack, using TCP at high transmission rates can exhaust considerable CPU resources and impose notable amount of communication latency. Remote Direct Memory Access (RDMA) is a transport protocol that allows delivery of data from one machine to another machine without involving the OS networking protocol stack. RDMA operates on the NICs of machines communicating. Compared to TCP, RDMA offers higher bandwidth and lower latency at lower CPU utilization [99, 227, 8]. RDMA can also be used for seamless offloading of large datasets to nearby machines (as opposed to using pagefiles) .
To use RDMA, the underlying network has to be lossless since RDMA does not support recovery from lost data by default. Ethernet which is the favorite choice of transport in datacenters, however, does not support reliability by default. Lossless Ethernet, also known as Converged Enhanced Ethernet (CEE), supports per-hop flow control at Layer 2. Backpressure is used as the mechanism to stop senders in case of a full buffer in the network. PAUSE messages are sent to previous hops, pausing the output ports that are connected to inputs with full buffers, until the ultimate senders receive a PAUSE message. The flow control mechanism used by lossless Ethernet is referred to as Priority Flow Control (PFC) and offers 8 priority levels for various classes of traffic [154, 155].
For Layer 2 networks, RDMA can be deployed using RDMA over Converged Ethernet (RoCE)  which is based on lossless Ethernet and works across a single Layer 2 domain. For larger networks that span across Layer 3, RDMA can be deployed on top of IP and UDP using version 2 of RoCE (RoCEv2) . It can also be deployed on top of IP and TCP using iWARP  which implements a full TCP stack on end-host NIC to provide a lossless end to end transport. iWARP does not require a lossless infrastructure and can work on top of usual Ethernet, but is less performant and has limited capabilities compared to RoCE .
Using lossless Ethernet can lead to a few performance issues in general. Some hindering issues include Layer 2 Head of Line (HOL) Blocking, unfairness (because Layer 2 has no understanding of upper layer notions such as flows), and deadlocks (due to per-port/class PAUSE feature and possible circular dependency of routes). HOL blocking might occur since pausing happens on a per-port/class basis. Therefore, a flow can overflow a port causing it to be blocked stopping other flows going through that port as well. As a result, it is necessary to prevent formation of full buffers. Furthermore, loss-based congestion control approaches are rendered useless since there is no packet loss in case of full buffers.
Quantized Congestion Notification (QCN) , which is fully implemented in Layer 2, can be used to reduce PAUSE messages by signaling senders before buffers are full. It sends notifications back to the sender’s NIC from switches. Packets need to be tagged with a flow ID at the senders which will be used at the switches when notifications are generated to determine which flows should be slowed down. QCN is limited to boundaries of a single Layer 2 domain and therefore is insufficient for datacenters with large networks.
TCP Bolt  and DCQCN  operate across Layer 3 using RoCEv2. Both of these schemes use DCTCP  like ECN marking to reduce buffer occupancy and minimize PAUSE signals. To prevent deadlocks, TCP Bolt creates edge disjoint spanning trees (EDSTs) across the network with different PFC classes to prevent cyclic dependencies as flows are routed in the network. TIMELY  can also be used on lossless networks which uses a delay-based approach to detect increased buffer occupancy and manages its rate accordingly to reduce it.
Cloud companies and content providers, such as Google , Microsoft , Facebook  and Amazon , have built multiple datacenters in different continents and countries. Multiple datacenters offer a variety of benefits for distributed applications with geographically wide range of users such as email, multimedia (e.g., YouTube), social networks (e.g., Facebook, Instagram, and Google Plus) and online storage. These benefits include increased availability and fault-tolerance, global or regional load balancing, reduced latency to customers and reduced global bandwidth usage via caching. For example, to minimize user access latency, data can be placed on local datacenters close to users via replication.
To further improve reliability, load balancing and data availability, large datacenter operators (such as Amazon and Microsoft) operate in two hierarchies of zones and discrete datacenters. Each availability zone is usually made up of a few discrete datacenters that are close enough to communicate with negligible latency (e.g. less than 2 ms for Microsoft Azure, i.e., few tens of miles), while far enough to allow them to operate as distinct failure domains. Datacenters usually have rich connectivity within zones which themselves are connected using long haul fiber optics (usually hundreds of miles) . These links are either owned by datacenter operators or leased from a provider with existing backbone infrastructure. These links that connect multiple datacenters within regions and across them are referred to as inter-datacenter networks. Maintaining and operating inter-datacenter networks requires significant capital investment from datacenter companies which makes it imperative to efficiently use them [177, 59, 237].
In general, we can categorize traffic that goes within and across datacenters into traffic that is a result of direct interaction with users and the business internal traffic that is a result of backend data processing or migration. Recent work points to significant increase in the overall business internal traffic (which includes both intra and inter-datacenter traffic) that is growing at a much faster pace than user generated traffic [177, 4, 238]. Such increase not only demands higher interconnection bandwidth across servers within datacenters, but also higher network capacity across datacenters. Over the past decade, significant attention has been given to intra-datacenter networks to improve their performance and efficiency which was the topic of discussion in previous sections. However, similar attention has not been paid to increasing efficiency and performance of connectivity across datacenters.
To communicate across datacenters, many companies purchase bandwidth from ISP networks with present WAN infrastructure and are billed according to some usage criteria. A widely used pricing scheme calculates traffic costs by looking at 95 percentile of network bandwidth usage over some period of time . A variety of research efforts focus on minimizing inter-datacenter traffic transit costs considering a similar pricing scheme by aiming to not increase the peak bandwidth usage or minimally increase it if necessary [240, 241, 242, 243, 244, 245, 246, 247, 248].
Large datacenter operators take advantage of dedicated inter-datacenter connections over long haul optical networks. Google operates their own backbone network referred to as B4 [177, 20]. Microsoft Global WAN  connects Microsoft datacenters over their private (dark) fiber network. Facebook has also developed their cross datacenter backbone network referred to as Express Backbone . These private dedicated networks offer a unique opportunity for further optimization of network resource management. Considering that all end-points of such networks are managed by one organization, we can improve efficiency by coordinating transmission across such end-points. In addition, despite their huge geographical scale, these networks are usually made up of tens to hundreds of links (that connect distant locations) making such coordination practically feasible. For example, B4 is currently managed centrally by a system called Bandwidth Enforcer , Microsoft Global WAN is managed by SWAN  and Express Backbone is also managed centrally by a traffic engineering controller .
Bandwidth allocation is an effective approach for global traffic engineering over private dedicated networks [250, 237]. To take into account latency overhead of bandwidth allocation, a hybrid approach is usually taken where an aggregate bandwidth is set aside for short latency-sensitive flows (mostly user generated) per link or such flows are assigned strictly higher traffic priority. Large long-running flows which we refer to as transfers will then become the focus of per flow bandwidth allocation. In addition, since the network environment is continuously changing as new transfers arrive or due to varying volume of higher priority traffic (from latency-sensitive flows), a slotted timeline is considered where transmission rates can be updated on a per timeslot basis.
Figure 7 shows an example architecture for this purpose which is based on similar concepts as [250, 237]. One way ro realize this architecture is using SDN. An inter-datacenter network that can be operated using SDN is sometimes referred to as Software Defined WAN (SDWAN). This allows data transmission and routing to be managed centrally according to transfer properties and requirements as well as network status and topology .
A central Traffic Engineering Server (TES) calculates transmission rates and routes for submitted transfers as they arrive at the network. Rates are then dispatched to local agents that keep track of local transfers (initiated within the same datacenter) called site brokers. Senders first communicate with local site broker which in turn forwards transfer requests to TES. The local site broker then receives TES’s response with the transmission rates and forwards it to the sender. Local site brokers add a level of indirection between senders and TES which can help in several ways. It reduces the request response overhead for TES by maintaining a persistent connection with brokers and possibly aggregating several transfer requests before relaying them to the server, which can be useful for hierarchical bandwidth allocation that is done by locally grouping many transfers and presenting them to TES as one (trading some traffic engineering accuracy for scalability). Site broker may also inform TES of network conditions within datacenters for more careful rate allocation and routing. Finally, site broker may modify TES’s response according to varying local network conditions or allow senders to switch to a standby backup TES in case it goes offline. Other tasks in this system setup include the following.
In this section, we provide an overview of several research directions considering the scenario mentioned above.
The majority of inter-datacenter related research is on global optimization of inter-datacenter networks. Metrics similar to ones we discussed in §2 can be considered as objectives. For example, B4  and SWAN  focus on maximizing utilization, Tempus  aims to meet transfer deadlines and maximizes minimal fraction of transfers that complete prior to deadlines in case there is not enough capacity, Amoeba  performs admission control for incoming traffic and only admits new transfers if their deadlines can be guaranteed and DCRoute  performs fast admission control to guarantee deadlines while minimizing packet reordering. The majority of prior work related to this problem either focus on delay tolerant transfers or aim at meeting deadlines while not considering completion times as a metric. For many long-running data transfer operations, completion times are important in increasing overall utility. For example, faster completion of backup operations may reduce the chance of data loss due to failures or speedy replication of data objects can improve average user’s quality of experience. In addition, most prior work formulate complex optimization problems that are computationally expensive and slow especially if need be solved as transfers arrive and can increase scheduling latency. Further research is then necessary in developing global rate computation and routing algorithms that lead to solutions quickly and also consider completion times of transfers.
Many services run across several datacenters close to regional users to offer a better quality of experience by minimizing customer access latency (e.g. CDNs cache objects for local viewers [253, 254, 240, 246, 50]). This approach also reduces overall inter-datacenter bandwidth consumption by keeping a copy of popular objects close to users. There is also need for propagating application state to multiple locations for synchronization (e.g. search index databases ) or making multiple distant data copies for higher reliability and availability . All of these lead to data delivery from one datacenter to multiple datacenters referred to as Point to Multipoint (P2MP) transfers . P2MP data delivery over private dedicated inter-datacenter networks can be considered as a special case of multicasting for which a significant body of work is available including in-network multicasting [256, 257] and using overlay networks [258, 259]. However, there is need for coordinated schemes that improve performance by carefully selecting multicast forwarding trees and assigning transmission slots to transfers. One should also note that centralized multicasting solutions proposed for intra-datacenter networks, such as [260, 261], may not work for inter-datacenter networks due to significant differences in topology (former is usually structured and regular while latter is not). New objectives can be considered in addition to ones proposed in §2 for P2MP transfers, such as maximizing number of receivers per transfer that complete reception in a given period of time.
Given their scale, inter-datacenter networks may be exposed to a variety of physical conditions and natural environments. As a result, different inter-datacenter links may have significantly different link failure probabilities (possibly by as much as three orders of magnitude ). In addition, inter-datacenter traffic in general is made up of a variety of classes with different quality of service requirements and priorities [20, 50]. In general, transfers can be rerouted and rescheduled in reaction to failures. This however leads to possible service disruptions which can more seriously affect services that are sensitive to loss or latency. Given unique properties of private dedicated inter-datacenter networks, steps can be taken proactively as new transfers arrive to mitigate the effect of failures on overall utility. Several objectives can be considered per failure event including minimizing number of affected transfers upon failures considering their class of service, minimizing the amount by which latency increases across a variety of services, or maximizing utilization while leaving off spare capacity to perform quick in-network failover rerouting . The latter could be applied only to a fraction of inter-datacenter traffic (i.e., more important traffic). In addition, one could study these objectives for both one-to-one and one-to-many transfer scenarios (or a mix of them).
Many online services today depend on datacenter infrastructures to provide high availability and reliability along with scalable compute at minimal costs. Datacenter environments may be shared by many tenants and applications offering different services to their consumers (end-users). Traffic control is a necessary task in datacenter networks to efficiently use the resources and fairly share them.
In this tutorial paper, we reviewed several challenges operators, tenants and applications are faced with in datacenter traffic control. We discussed different elements of traffic control explaining the problems, challenges, proposed solutions and their tradeoffs. Despite significant research efforts, the majority of traffic control proposals are in the state of research and far from adoption in the industry considering metrics of complexity, performance and cost altogether. Further research efforts are required to reach solutions that offer higher performance at same or lower costs while maintaining low complexity.
More broadly, at the end of this paper, we also pointed to inter-datacenter communication as an evolving research area that requires further attention. We proposed a centralized architecture that captures the essence of existing solutions and discussed how efficient traffic control can be realized using it. We also pointed to three major active research areas regarding inter-datacenter communication that need further attention from the research community.
We would like to thank the anonymous reviewers of IEEE Communications Surveys and Tutorials for their valuable comments and suggestions that helped us significantly improve the quality of this paper. The first author would also like to especially thank David Oran, Joshua Gahm, Narayan Venkat and Atif Faheem who provided valuable intellectual support during an internship at Cisco Research Center located in Cambridge, Massachusetts, USA.
A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon,
S. Boving, G. Desai, B. Felderman, P. Germano
 A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim,
P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta, “VL2: A
Scalable and Flexible Data Center Network,”
 S. Subramanian,
 Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn,
Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang,
“Congestion Control for Large-Scale RDMA Deployments,”
 M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and
M. Yasuda, “Less is more: trading a little bandwidth for ultra-low latency
in the data center,”
 M. Al-Fares, A. Loukissas, and A. Vahdat, “A Scalable, Commodity
Data Center Network Architecture,”
 J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber,
“HyperX: topology, routing, and packaging of efficient large-scale
 C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “Dcell:
A Scalable and Fault-tolerant Network Structure for Data Centers,”
 A. Valadarsky, G. Shahaf, M. Dinitz, and M. Schapira, “Xpander:
Towards Optimal-Performance Datacenters,”
 A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish:
Networking data centers randomly,”
 J. Zhou, M. Tewari, M. Zhu, A. Kabbani, L. Poutievski, A. Singh,
and A. Vahdat, “WCMP: Weighted cost multipathing for improved
fairness in data centers,”
 V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson, “F10:
A Fault-Tolerant Engineered Network,”
 R. Govindan, I. Minei, M. Kallahalla, B. Koley, and A. Vahdat,
“Evolve or Die: High-Availability Design Principles Drawn from Google’s
 G. Porter, R. Strong, N. Farrington, A. Forencich, P. Chen-Sun,
T. Rosing, Y. Fainman, G. Papen, and A. Vahdat, “Integrating
Microsecond Circuit Switching into the Data Center,”
 N. Hamedazimi, Z. Qazi, H. Gupta, V. Sekar, S. R. Das, J. P.
Longtin, H. Shah, and A. Tanwer, “FireFly: a reconfigurable wireless
data center fabric using free-space optics,”
 S. Bojja Venkatakrishnan, M. Alizadeh, and P. Viswanath, “Costly
Schedules and Approximate Caratheodory Theorems,”
 M. Ghobadi, R. Mahajan,
A. Phanishayee, N. Devanur, J. Kulkarni, G. Ranade, P.-A. Blanche,
H. Rastegarfar, M. Glick, and D. Kilper, “Projector: Agile reconfigurable
data center interconnect,”
 S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken,
“The nature of data center traffic: measurements & analysis,”
 M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel,
B. Prabhakar, S. Sengupta, and M. Sridharan, “Data Center
 D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz, “DeTail:
reducing the flow completion time tail in datacenter networks,”
 M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:
Distributed Data-parallel Programs from Sequential Building Blocks,”
 Y. Ren, Y. Zhao, P. Liu, K. Dou, and J. Li, “A survey on TCP
Incast in data center networks,”
 L. Chen, B. Li, and B. Li, “Allocating bandwidth in datacenter
networks: A survey,”
 B. Wang, Z. Qi, R. Ma, H. Guan, and A. V. Vasilakos, “A
survey on data center networking for cloud computing,”
 R. P. Joglekar and P. Game, “Managing Congestion in Data
Center Network using Congestion Notification Algorithms,”
 P. Sreekumari and J.-i. Jung, “Transport protocols for data center
networks: a survey of issues, solutions and challenges,”
 C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian,
Y. Zhang, and S. Lu, “BCube: a high performance, server-centric network
architecture for modular data centers,”
 K. S. Solnushkin,
 R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang,
P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat,
“PortLand: A Scalable Fault-tolerant Layer 2 Data Center
 M. Hashemi,
 A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C.
Snoeren, “Inside the Social Network’s (Datacenter) Network,”
 C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better
Never Than Late: Meeting Deadlines in Datacenter Networks,”
 M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown,
B. Prabhakar, and S. Shenker, “pFabric: Minimal Near-optimal
 G. Linden,
 H. Zhang, K. Chen, W. Bai, D. Han, C. Tian, H. Wang, H. Guan,
and M. Zhang, “Guaranteeing deadlines for inter-datacenter transfers,”
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,
A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:
Amazon’s highly available key-value store,”
 V. Jalaparti, P. Bodik, S. Kandula, I. Menache, M. Rybalkin,
and C. Yan, “Speeding Up Distributed Request-response
 R. Kapoor, A. C. Snoeren, G. M. Voelker, and G. Porter, “Bullet
trains: a study of NIC burst behavior at microsecond timescales,”
 Y. Geng, V. Jeyakumar, A. Kabbani, and M. Alizadeh, “Juggler: a
practical reordering resilient network stack for datacenters,”
 C. Raiciu, C. Paasch, S. Barre, A. Ford, M. Honda, F. Duchene,
O. Bonaventure, and M. Handley, “How hard can it be? designing
and implementing a deployable multipath TCP,”
 A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha, “Sharing
the Data Center Network,”
 L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, Y. Turner, and
J. R. Santos, “ElasticSwitch: practical work-conserving
bandwidth guarantees for cloud computing,”
 K. He, E. Rozner, K. Agarwal, Y. J. Gu, W. Felter, J. Carter,
and A. Akella, “AC/DC TCP: Virtual congestion control enforcement
for datacenter networks,”
 B. Cronkite-Ratcliff,
A. Bergman, S. Vargaftik, M. Ravi, N. McKeown, I. Abraham, and
I. Keslassy, “Virtualized congestion control,”
 Y. Yang, H. Abe, K. i. Baba, and S. Shimojo, “A Scalable Approach
to Avoid Incast Problem from Application Layer,”
 V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G.
Andersen, G. R. Ganger, G. A. Gibson, and B. Mueller, “Safe
and Effective Fine-grained TCP Retransmissions for Datacenter
 K. Zarifis, R. Miao, M. Calder, E. Katz-Bassett, M. Yu, and
J. Padhye, “DIBS: Just-in-time congestion mitigation for data centers,”
 P. Prakash, A. Dixit, Y. C. Hu, and R. Kompella, “The TCP outcast
problem: exposing unfairness in data center networks,”
 T. Bonald, L. Massoulie, A. Proutiere, and J. Virtamo, “A queueing
analysis of max-min fairness, proportional fairness and balanced fairness,”
 D. Nace and M. Pioro, “Max-min fairness and its applications to
routing and load-balancing in communication networks: a tutorial,”
 A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,
and I. Stoica, “Dominant Resource Fairness: Fair Allocation of
Multiple Resource Types,”
 C. Joe-Wong, S. Sen, T. Lan, and M. Chiang, “Multiresource
Allocation: Fairness-efficiency Tradeoffs in a Unifying Framework,”
 B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis,
P. Sharma, S. Banerjee, and N. McKeown, “ElasticTree: Saving Energy
in Data Center Networks,”
 Y. Shang, D. Li, and M. Xu, “Energy-aware Routing in
Data Center Network,”
 R. Wang, S. Gao, W. Yang, and Z. Jiang, “Energy aware
routing with link disjoint backup paths,”
 D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu,
“Energy Proportional Datacenter Networks,”
 L. Wang, F. Zhang, J. A. Aroca, A. V. Vasilakos, K. Zheng,
C. Hou, D. Li, and Z. Liu, “GreenDCN: A General Framework for
Achieving Energy Efficiency in Data Center Networks,”
 K. He, E. Rozner, K. Agarwal, W. Felter, J. Carter, and A. Akella,
“Presto: Edge-based load balancing for fast datacenter networks,”
 R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel,
M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats,
“TIMELY: RTT-based Congestion Control for the Datacenter,”
 J. Padhye, V. Firoiu, D. F. Towsley, and J. F. Kurose, “Modeling
TCP Reno performance: a simple model and its empirical validation,”
 M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore,
G. Antichi, and M. Wójcik, “Re-architecting Datacenter Networks
and Stacks for Low Latency and High Performance,”
 I. Cho, K. Jang, and D. Han, “Credit-Scheduled Delay-Bounded
Congestion Control for Datacenters,”
 M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan,
K. Chu, A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav,
and G. Varghese, “CONGA: Distributed Congestion-aware
Load Balancing for Datacenters,”
 P. Wang, H. Xu, Z. Niu, D. Han, and Y. Xiong, “Expeditus:
Congestion-aware Load Balancing in Clos Data Center Networks,”
 D. Zhuo, Q. Zhang, V. Liu, A. Krishnamurthy, and T. Anderson,
“Rack-level Congestion Control,”
 W. Bai, K. Chen, L. Chen, C. Kim, and H. Wu, “Enabling ECN
over Generic Packet Scheduling,”
 P. Cheng, F. Ren, R. Shu, and C. Lin, “Catch the whole lot in an
action: Rapid precise packet loss notification in data center,”
 J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal,
“Fastpass: A Centralized "Zero-queue" Datacenter Network,”
 J. Perry, H. Balakrishnan, and D. Shah, “Flowtune:
Flowlet control for datacenter networks,”
 S. Jouet, C. Perkins, and D. Pezaros, “OTCP: SDN-managed
congestion control for data center networks,”
 S. Vissicchio, O. Tilmans, L. Vanbever, and J. Rexford, “Central
Control Over Distributed Routing,”
 N. McKeown, T. Anderson,
H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and
J. Turner, “OpenFlow: enabling innovation in campus networks,”
 A. Munir, G. Baig, S. M. Irteza, I. A. Qazi, A. X. Liu, and
F. R. Dogar, “Friends, Not Foes: Synthesizing Existing Transport
Strategies for Data Center Networks,”
 A. Aggarwal, S. Savage, and T. Anderson, “Understanding the
performance of TCP pacing,”
 J. Cao, R. Xia, P. Yang, C. Guo, G. Lu, L. Yuan, Y. Zheng,
H. Wu, Y. Xiong, and D. Maltz, “Per-packet load-balanced, low-latency
routing for Clos-based data center networks,”
 S. Radhakrishnan, V. Jeyakumar, A. Kabbani, G. Porter, and
A. Vahdat, “NicPic: Scalable and Accurate End-Host Rate Limiting,”
 A. Saeed, N. Dukkipati, V. Valancius, V. The Lam, C. Contavalli,
and A. Vahdat, “Carousel: Scalable Traffic Shaping at End
 K. He, W. Qin, Q. Zhang, W. Wu, J. Yang, T. Pan, C. Hu,
J. Zhang, B. Stephens, A. Akella, and Y. Zhang, “Low Latency
Software Rate Limiters for Cloud Networks,”
 W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and W. Sun, “PIAS:
Practical information-agnostic flow scheduling for data center networks,”
 M. P. Grosvenor, M. Schwarzkopf, I. Gog, R. N. Watson, A. W.
Moore, S. Hand, and J. Crowcroft, “Queues don’t matter when you can
 C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, and
Y. Zhang, “Secondnet: a data center network virtualization architecture
with bandwidth guarantees,”
 S. Hu, W. Bai, K. Chen, C. Tian, Y. Zhang, and H. Wu, “Providing
bandwidth guarantees, work conservation and low latency simultaneously
in the cloud,”
 C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and
M. Handley, “Improving datacenter performance and robustness with
 M. Alizadeh, A. Kabbani, B. Atikoglu, and B. Prabhakar, “Stability
analysis of QCN: the averaging principle,”
 C. Tian, A. Munir, A. X. Liu, Y. Liu, Y. Li, J. Sun, F. Zhang,
and G. Zhang, “Multi-tenant multi-objective bandwidth allocation in
datacenters using stacked congestion control,”
 Y. Lu, G. Chen, L. Luo, K. Tan, Y. Xiong, X. Wang, and E. Chen,
“One more queue is enough: Minimizing flow completion time with explicit
 B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme,
J. Gross, A. Wang, J. Stringer, P. Shelar
 C. Guo, H. Wu, Z. Deng, G. Soni,
J. Ye, J. Padhye, and M. Lipshteyn, “RDMA over commodity Ethernet
 E. Vanini, R. Pan, M. Alizadeh, P. Taheri, and T. Edsall,
“Let It Flow: Resilient Asymmetric Load Balancing with Flowlet
 J. Rasley, B. Stephens, C. Dixon,
E. Rozner, W. Felter, K. Agarwal, J. Carter, and R. Fonseca, “Planck:
Millisecond-scale monitoring and control for commodity networks,”
 H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury,
“Resilient Datacenter Load Balancing in the Wild,”
 S. Ghorbani, Z. Yang, P. B. Godfrey, Y. Ganjali, and
A. Firoozshahian, “DRILL: Micro Load Balancing for Low-latency Data
 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn,
“Ceph: A scalable, high-performance distributed file system,”
 V. Jalaparti, P. Bodik, I. Menache, S. Rao, K. Makarychev,
and M. Caesar, “Network-Aware Scheduling for Data-Parallel
Jobs: Plan When You Can,”
 R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and
A. Akella, “Multi-resource Packing for Cluster Schedulers,”
 Q. Xie, A. Yekkehkhany, and Y. Lu, “Scheduling with multi-level
data locality: Throughput and heavy-traffic optimality,”
 A. Munir, T. He, R. Raghavendra, F. Le, and A. X. Liu, “Network
Scheduling Aware Task Placement in Datacenters,”
C. Filsfils, P. Mohapatra, J. Bettink, P. Dharwadkar, P. De Vriendt,
Y. Tsier, V. Van Den Schrieck, O. Bonaventure, P. Francois
 K. Agarwal, C. Dixon, E. Rozner, and J. Carter, “Shadow MACs:
Scalable label-switching for commodity Ethernet,”
 B. Stephens, A. Cox, W. Felter, C. Dixon, and J. Carter, “PAST:
Scalable Ethernet for data centers,”
 M. R. Nascimento, C. E. Rothenberg, M. R. Salvador, and
M. F. Magalhães, “QuagFlow: Partnering Quagga with OpenFlow,”
 A. Greenberg, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta,
“Towards a next generation data center architecture: scalability and
 S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh,
S. Venkata, J. Wanderer, J. Zhou, M. Zhu
 S. A. Jyothi, M. Dong, and P. Godfrey, “Towards a flexible data
center fabric with source routing,”
 N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford, “HULA:
Scalable Load Balancing Using Programmable Data Planes,”
 Y. Cao, M. Xu, X. Fu, and E. Dong, “Explicit multipath congestion
control for data center networks,”
 M. Kheirkhah, I. Wakeman, and
G. Parisis, “MMPTCP: A multipath transport protocol for data centers,”
 L. Liu, D. Li, and J. Wu, “TAPS: Software Defined Task-Level
Deadline-Aware Preemptive Flow Scheduling in Data Centers,”
 H.-W. Tseng, W.-C. Chang, I. Peng, P.-S. Chen
 M. Noormohammadpour, C. S. Raghavendra,
and S. Rao, “DCRoute: Speeding up Inter-Datacenter Traffic Allocation
while Guaranteeing Deadlines,”
 M. Chowdhury, Y. Zhong, and I. Stoica, “Efficient Coflow
Scheduling with Varys,”
 M. Noormohammadpour and C. S. Raghavendra, “Comparison of
Flow Scheduling Policies for Mix of Regular and Deadline Traffic
in Datacenter Environments,”
 A. Wierman and B. Zwart, “Is Tail-Optimal Scheduling Possible?”
 I. A. Rai, G. Urvoy-Keller, M. K. Vernon, and E. W. Biersack,
“Performance Analysis of LAS-based Scheduling Disciplines in
a Packet Switched Network,”
 L. Jose, L. Yan, M. Alizadeh, G. Varghese, N. McKeown,
and S. Katti, “High Speed Networks Need Proactive
 B. A. A. Nunes, M. Mendonca, X. N. Nguyen, K. Obraczka, and
T. Turletti, “A Survey of Software-Defined Networking: Past, Present,
and Future of Programmable Networks,”
 P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown,
M. Izzard, F. Mujica, and M. Horowitz, “Forwarding Metamorphosis:
Fast Programmable Match-action Processing in Hardware
 A. Sivaraman, A. Cheung, M. Budiu, C. Kim, M. Alizadeh,
H. Balakrishnan, G. Varghese, N. McKeown, and S. Licking, “Packet
Transactions: High-Level Programming for Line-Rate Switches,”
 P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown,
J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese
 E. Jeong, S. Wood, M. Jamshed, H. Jeong, S. Ihm, D. Han, and
K. Park, “mTCP: a highly scalable user-level TCP stack for multicore
 S. Han, K. Jang, A. Panda, S. Palkar, D. Han, and S. Ratnasamy, “SoftNIC: A Software NIC to Augment Hardware,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-155, May 2015. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-155.html
 J. Ousterhout, A. Gopalan,
A. Gupta, A. Kejriwal, C. Lee, B. Montazeri, D. Ongaro, S. J. Park,
H. Qin, M. Rosenblum
 J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G.
Shin, “Efficient Memory Disaggregation with Infiniswap,”
 B. Stephens, A. L. Cox, A. Singla, J. Carter, C. Dixon, and
W. Felter, “Practical DCB for improved data center networks,”
 M. Russinovich,
Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and
R. Wattenhofer, “Achieving high utilization with software-driven WAN,”
 S. Su, Y. Wang, S. Jiang, K. Shuang, and P. Xu, “Efficient
algorithms for scheduling multiple bulk data transfers in inter-datacenter
 Y. Lin, H. Shen, and L. Chen, “EcoFlow: An Economical and
Deadline-Driven Inter-Datacenter Video Flow Scheduling System,”
 V. Jalaparti, I. Bliznets, S. Kandula, B. Lucier, and I. Menache,
“Dynamic Pricing and Traffic Engineering for Timely
 A. Kumar, S. Jain, U. Naik, A. Raghuraman, N. Kasinadhuni,
E. C. Zermeno, C. S. Gunn, J. Ai, B. Carlin, M. Amarandei-Stavila
 R. Meshenberg, N. Gopalani, and L. Kosewski,
 M. Noormohammadpour, C. S. Raghavendra, S. Rao, and
S. Kandula, “DCCast: Efficient Point to Multipoint Transfers
 J. Cao, C. Guo, G. Lu, Y. Xiong, Y. Zheng, Y. Zhang, Y. Zhu,
C. Chen, and Y. Tian, “Datacast: A Scalable and Efficient Reliable
Group Data Delivery Service for Data Centers,”
 M. Ghobadi and R. Mahajan, “Optical Layer Failures
in a Large Backbone,”