Hide

Problem A
Datacenter Network Multi-path Load Balancing

1 Background

Networking is a key part of distributed systems. In industry, fat tree is employed to build high performance networks under acceptable costs. In a k-ary fat tree network, there are multiple paths between two server nodes, as shown in Figure 1.

Traditional network transmission protocols, such as TCP/IP, only designate a dedicated route for transmitting data between two nodes and cannot fully utilize the fat tree networks. Therefore multi-path transmission protocols are required to further improve network performance.

$\includegraphics[width=\linewidth ]{figure/k-fat-tree.png}$

Figure 1: Sample topology for a 4-ary fat tree.

2 Description

During the competition, teams are provided with an abstract network that conforms to the k-fat-tree structure.

The structure contains $N+1$ nodes, with $ids$ ranging from $-1$ to $N-1$, which are divided into 5 levels:

${Level}_{0}$ : servers; each server is connected to only one access switch (check the last row shown in Figure 1).
${Level}_{1}$ : access switches; each access switch is connected to several aggregate switches and servers (see the 3rd row shown in Figure 1).
${Level}_{2}$ : aggregate switches; each aggregate switch is connected to several access switches and several core switches (see the 2nd row shown in Figure 1).
${Level}_{3}$ : core switches; each core switch is connected to several aggregate switches (see the first row shown in Figure 1).
${Level}_{4}$ : cluster controller (not shown in Figure 1), which is connected to all switches. There is only 1 cluster controller, with an $id$ equal to $-1$.

The whole game contains $T$ time slices, and we have several messages that need to be transported inside this network.

Any message transport will start and finish (whether successful or not) at the same time slice.

Each message starts at $T_ i$ and successful transmission requires the following conditions to be met:

the target node is not a cluster controller, even though the controller node is connected to all switches.
the target node is not the final destination node if the target node is a server node.
the target node is connected with the origin node.
1 outbound bandwidth is taken from the origin node.
1 inbound bandwidth is taken from the target node.
1 buffer size is token from the target node if the target node is not the final destination node of the message.

So, each node (server, switch, controller) contains several corresponding attributes:

${BandwidthIn}_{id}$ : the inbound bandwidth of each time slice, meaning remaining bandwidth will be reset after a new time slice starts.
${BandwidthOut}_{id}$ : the outbound bandwidth of each time slice, meaning remaining bandwidth will be reset after a new time slice starts.
${BufferSize}_{id}$ : the total buffer size of the node in question. Note that remaining buffer size will not be reset after or during time slices switch, and the remaining buffer size might only be changed after the message is sent and received.

Messages may fail at different stages of validation. Possible failures are as follows:

The target node is cluster controller: the whole program crashes.
The target node is a server node but the target node is not the final destination node: the whole program crashes.
Connection relationship check failed: the whole program crashes.
No outbound bandwidth remains for the origin node at $T_ i$: the whole program crashes as the use of outbound bandwidth is independently controlled by the node itself.
No inbound bandwidth remains for the target node at $T_ i$: the current message (only) fails at $T_ i$, while the outbound bandwidth used at $T_ i$ is not returned.
No buffer remains for the target node: the message fails at $T_ i$, while the outbound bandwidth and inbound bandwidth used at $T_ i$ are not returned.

Note that a message remains in the buffer of the sender if transmission fails.

Otherwise, the message is successfully transmitted, and 1 buffer size will be returned to the origin node as ownership of the transmitted message passes to the target node.

We assign $M$ bidirectional edges to represent the physical links between nodes, with each edge contains corresponding attributes:

${From}_{i}$ : one node of this bidirectional edge.
${To}_{i}$ : another node of this bidirectional edge.
to simply this problem, we don’t have any transport limitation to edges.

Note that these $M$ edges are completely unrelated to the cluster controller as it is connected with all switches.

We have $R$ requests, each of which will be split into several message after being successfully accessed by the corresponding access switch:

${Id}_{j}$ : a unique integer for identifying different requests.
${StartTime}_{j}$ : a request appears when time slice $StartTime_ j$ starts.
${From}_{j}$ : the node id that the request is trying to access within $StartTime_ j$. This node is an access switch (whose level equal 1).
${Dest}_{j}$ : the node id that is the final destination of the request. This node is a server (whose level equals to 0).
${Size}_{j}$ : the size of the request; once the request is accepted by access switch $From_ j$, the request is split into $Size_ j$ messages.
after a request is split into messages, each message’s major id is equal to the request id $Id_ j$ while the minor id is 0 to $Size_ j - 1$. This allows you to use a major and minor id pair to identify any message.

A Successful request requires all $Size_ j$ messages to be successfully sent to server $Dest_ j$. It is also important to note the following:

each message might be sent at different time slice.
messages might be sent in a different order, but it is only important that all messages are sent.

We use $EndTime_ j$ to represent the time slice at which all messages of a request are finished, then the latency of the request will be represented by ${Latency}_{j} = {EndTime}_{j} - {StartTime}_{j} + 1$.

After the entire simulation is complete, we might have several failed requests and several successful requests, along with the corresponding latency. This will give us

${SucceedRequest}$ : the number of successful requests.
${SuccessRate} = \frac{SucceedRequest}{R}$ : the rate at which requests were successful.
${Latency}_{P99}$ : The $Latency$ of the 99%-th request of all successful requests sorted by latency into ascending order.
${Latency}_{AVG}$ : the average $Latency$ of all successful requests.
Note that the P99 fraction is rounded up, meaning the 100th request latency will be considered to be P99 latency if there are a total of 101 successful requests.

For each simulation test case:

\[ {Score} = {SuccessRate} * 100.0 - \frac{{Latency}_{P99}}{20.0} - \frac{{Latency}_{AVG}}{20.0} \]

3 Tasks

You are expected to design an algorithm (only python3 supported) for a switch instance and cluster controller instance. An independent algorithm instance will be initialized for each switch and controller.

For each algorithm instance, required initialization information is as follows:

the $Id$, $Level$, $BandwidthIn$, $BandwidthOut$, $BufferSize$ of corresponding node.
all basic edge information.
all basic node information.

You may find that each algorithm instance (corresponding to each switch and controller) seems completely independent.

To make things more realistic, an information interaction model will be designed. The basic rules for this are as follows:

Only instances directly connected by edge can directly communicate with each other.
For each time slice, each algorithm instance can build an information package with a limited basic structure and limited length.
In the next time slice, the simulator will send the package to all other instances that are connected with the sender before each instance makes a decision.

You may also design a special pattern and rule to ensure that two nodes can talk to each other, even if they are not directly connected under the basic communication rule.

Note that the cluster controller (we have only one controller) is connected to all switches, meaning that:

The controller can receive the information package from all switches.
All switches can receive the information package from the controller.

We will cover the simulation and communication steps in more detail in Chapter 4: Details.

For each algorithm instance, the official interface must be implemented as listed (you can check the solution.py demo inside the problem attachment):

$AddRequestList$ : simulator will use every algorithm instance to call this interface (including switches and the controller; not only access switch nodes), and input new access requests (which may be an empty list) into the corresponding node.
- $Input$ : Receiving request list.
- $Output$ : None.
$AskRoundSolution$ : simulator will use every algorithm instance to call this interface (including switches and controller) and send you related communication information from the previous time slice. You should then return your message transport decision regarding this time slice.
- $Input$ : neighbor (might be switches or cluster controller) information package list.
- $Output$ : Outbound message list.
$NextRound$ : the simulator will return all messages and message transport results (both successful and failed) for each instance, defining instances as message sender or message receiver.
- $Input$ : Related message result list.
- $Output$ : Return communication package must be sent to neighbor (will be sent within next time slice).
- Note that each instance in each round can return at most 256 int ( $-2147483648 to 2147483647$ ) within an information package.

We will simulate $T$ time slices one by one, and let each algorithm instance deicide how to use its outbound bandwidth within each time slice.

You are aiming to get the highest simulation scores possible.

Note that we will not wait long. The simulation will stop after the last time slice is simulated and there won’t be any new requests during the last 20 time slices.

4 Details

4.1 Simulation Flow

The simulation actions inside the i-th time slice are carried out in the following order:

Add new requests: call $AddRequestList$ of each algorithm instance.
Add request $j$, during which the following are considered:
1. If the remaining buffer size of $From_ j$ is insufficient (i.e. ${BufferSize}_{{From}_ j} < {UsedBufferSize}_{{From}_ j} + {Size}_{j}$), the request is considered failed.
2. To simplify the problem, the bandwidth limit only applies to message transfer, meaning the inbound bandwidth when adding requests to access switches will not be considered, and not taken into account as part of the bandwidth limit.
3. The requests are split into messages as part of the buffer of the switch node, and the remaining buffer of the access switch is decreased.
Algorithm Instance Decision-making: call $AskRoundSolution$ of each algorithm instance.
Each message makes a transport decision, which consists of:
1. ${From}$ : Where the message is sent from, which should be identical to the node id of current node.
2. $(MajorId, MinorId)$ : The request-message pair identifying the message to be transferred.
3. ${Next}$ : Where the switch should try to send the package.
4. In each time slice decision, a message can only be sent once (meaning the generation of two decisions for the same message should not be within the same time slice).
5. Note that the program will crash if you attempt to send any message that is not within your buffer space.
Decision validation: simulator uses the rule described in Chapter 1: Background to check your message decision too.
Decision simulation: All message transport decisions of this network are sorted by (MajorId, MinorId), and processed (send to next hop) sequentially under the transport rule already described.
Settlement: call $NextRound$ of each algorithm instance.

4.2 Supported Programming Language

Only Python 3 is currently supported, in order to avoid performance and implementation-specific behaviors across different programming languages.

5 Input

5.1 Data Format

Since the algorithm is embedded within the judging program, participants do not need to read in and parse input files. User solutions will receive parsed data objects in the delegated formats, and can use them freely in the initialization ( init ) stage. Participants can refer to files under the attachment/demo directory for more detailed information.

To help participants test their solutions locally, a sample dataset is released publicly under the relevant attachment/data directory, namely input.txt. A detailed document about the data format of the file, named README.md, is also available under the same directory.

5.2 Data Range

The number of nodes (N), edges (M), time slices (T), and requests (R) satisfy the following constraints: $1 \le N \le 208$, $1 \le M \le 384$, $1 \le T \le 200$, $1 \le R \le 5000$.

The properties of a node Ni, which is the abstraction of a server or a switch, satisfy: $0 \le {node\_ id}_ i < N$ , ${Level}_ i \in \{ 0,1,2,3\} $,

$ \begin{cases} {BandwidthIn}_ i = -1 (unlimited), & level_ i = 0\\ 100 \le {BandwidthIn}_ i \le 1000, & level_ i \in \{ 1,2,3\} \end{cases} $

$ \begin{cases} {BandwidthOut}_ i = -1 (unlimited), & level_ i = 0\\ 1 \le {BandwidthOut}_ i \le 250, & level_ i \in \{ 1,2,3\} \end{cases} $

$ \begin{cases} {BufferSize}_ i = -1 (unlimited), & level_ i = 0\\ 400 \le {BufferSize}_ i \le 1200, & level_ i \in \{ 1,2,3\} \end{cases} $

The properties of edge $E_ i$ satisfy $0 \le u,v < N$, and it is guaranteed that no parallel edges or self-loops exist in the input data.

The properties of request $R_ i$ satisfy:

${0} \le {id}_{i} < {N}$, ${1} \le {Size}_{i} \le {100}$, ${0} \le {StartTime}_{i} \le {T} - 20$,

${0} \le {From}_{i} < N$, ${Level}_{{From}_ i} = 1$,

$0 \le {Dest}_ i < N$, ${Level}_{{Dest}_ i} = 0$.

Please note the following:

Each instance in each round can return at most 256 int ( $-2147483648$ to $2147483647$ ) within an information package.
This graph is k-ary fat tree, meaning all edges go between nodes on different levels, and the level difference is exactly 1.
During the add request list interface, we won’t drop requests out of the buffer size for you, as you need to know how many requests failed to gain access.

You must receive requests from the request list one by one (ordered). Once these are not buffer remain, you need to ignore remain requests yourself.

(If any bugs exist inside your receiving logic, you may try to send a request that should be dropped. This will cause your submission fail due to validation logic )

6 Visualizer

To help you solve the problem, we have developed a visualizer tool, which also makes the problem more interesting.

During judging simulation, we can log each switch’s real-time statistics for each time slice.

You can open attachments/demo/visualizer/visualizer.html, and drag log file attachments/demo/visualizer/sample-log.json into visualizer.

You can then replay simulations and check hot areas (by node color), node buffer sizes (by node diameter), and node statistics (hover mouse over node) as shown in Figure 2.

For further details, please check attachments/demo/visualizer/README.md.

Note that we won’t offer you simulation log files, as we need to protect the test dataset.

However, you can use attachments/data or generate your own test data and produce log files yourself.

$\includegraphics[width=\linewidth ]{figure/visualizer.png}$

Figure 2: visualizer Sample.

Attachments

attachments.rar

Problem ADatacenter Network Multi-path Load Balancing