## LIBBLE-DL

### Introduction

LIBBLE-DL is the LIBBLE variant for distributed deep learning, which is implemented based on PyTorch.

Currently, PyTorch only provides an AllReduce framework for distributed deep learning, the communication cost of which is high. Here, we design and develop three distributed deep learning frameworks based on PyTorch: MR-DisPyTorch, RA-DisPyTorch and PS-DisPyTorch. These three distributed deep learning frameworks have lower communication cost than the AllReduce framework in PyTorch. MR-DisPyTorch, RA-DisPyTorch and PS-DisPyTorchh can handle different kinds of application scenarios, and users can choose suitable frameworks according to their specific need in real applications.

### Tutorial

• MR-DisPyTorch

MR-DisPyTorch is a distributed deep learning framework based on MapReduce programming model [Dean et al., 2008]. MR-DisPyTorch adopts a synchronous update strategy. MR-DisPyTorch is able to handle the application scenarios where the size of network model is small, the number of distributed nodes are small and the computation sources of nodes are even.

The source code is stored inside the dispytorch/mapreduce directory. We define two classes: master and worker, which can make users create nodes conveniently. The instances of them represent master node and worker node, respectively. Users can start a distributed deep learning task with MapReduce programming model by following the code below:

import torch
import torch.distributed as dist
import dispytorch
rank = dist.get_rank()
world_size = dist.get_world_size()
if rank == 0:
n = dispytorch.mapreduce.master(**args)
else:
n = dispytorch.mapreduce.worker(**args)
n.train()


Parameters of master and worker are listed as follows:

rank rank of current process
num_workers number of worker nodes
cuda if you need to compute on gpu or not
save_path path to save model (default: none)
data_loader training data loader (only on worker nodes)
test_loader test data loader (only on master node)
model define model architecture
criterion define loss function
optim_fn define optimizer (only on master node)
adjust epochs set the learning rate to the initial LR decayed by $10$ (only on master node)
num_epochs number of total epochs to run
start_epoch manual epoch number (useful on restarts)
bucket_comm if you need to communicate layer by layer or not

• RA-DisPyTorch

RA-DisPyTorch is a decentralized distributed deep learning framework based on Ring AllReduce programming model [Gibiansky et al., 2017]. RA-DisPyTorch adopts a synchronous update strategy. RA-DisPyTorch is able to handle the application scenarios where the network model is large, the number of distributed nodes are large and the computation sources of nodes are even.

The source code is stored inside the dispytorch/ring directory. There is only one kind of computation node in Ring AllReduce programming model. Therefore, we define a node class, each instance of which represents a computation node. Users can start a distributed deep learning task with Ring AllReduce programming model by following the code below:

import torch
import torch.distributed as dist
import dispytorch
rank = dist.get_rank()
world_size = dist.get_world_size()
n = dispytorch.ring.node(**args)
n.train()


Parameters of node are listed as follows:

rank rank of current process
world_size number of processes in the distributed group
cuda if you need to compute on gpu or not
save_path path to save model (default: none)
data_loader training data loader
test_loader test data loader
model define model architecture
criterion define loss function
optim_fn define optimizer
adjust epochs set the learning rate to the initial LR decayed by $10$
num_epochs number of total epochs to run
start_epoch manual epoch number (useful on restarts)
bucket_comm if you need to communicate layer by layer or not

• PS-DisPyTorch

PS-DisPyTorch is a distributed deep learning framework based on Parameter Server programming model [Li et al., 2014]. PS-DisPyTorch supports synchronous, asynchronous and semi-synchronous (stale-synchronous) update strategies. Synchronous strategy is able to handle the application scenarios where the size of deep learning model is medium, the number of distributed nodes are medium and the computation sources of nodes are even. Asynchronous and semi-synchronous strategies are able to handle the application scenarios where the size of deep learning model is medium, the number of distributed nodes are medium and the computation sources of nodes are not even.

The source code is stored inside the dispytorch/ps directory. We define three classes: coordinator, server and worker, whose instances represent coordinator node, server node and worker node, respectively. Users can start a distributed deep learning task with Parameter Server programming model by following the code below:

import torch
import torch.distributed as dist
import dispytorch
rank = dist.get_rank()
world_size = dist.get_world_size()
if rank == 0:
n = dispytorch.ps.coordinator(**args)
elif 0 < rank <= num_servers:
n = dispytorch.ps.server(**args)
else:
n = dispytorch.ps.worker(**args)
n.train()


Parameters of coordinator, server and worker are listed as follows:

rank rank of current process
servers rank list of server nodes
workers  rank list of worker nodes
cuda if you need to compute on gpu or not
save_path path to save model (default: none)
data_loader training data loader (only on worker nodes)
test_loader test data loader (only on coordinator node)
num_batches  number of batches in an epoch
model define model architecture
criterion define loss function
optim_fn define optimizer (only on server node)
time_window maximal delay time (only on server nodes)
adjust epochs set the learning rate to the initial LR decayed by $10$ (only on server node)
num_epochs number of total epochs to run
start_epoch manual epoch number (useful on restarts)

### Configure Environment

LIBBLE-DL provides a docker image for quick configuration.

• Prerequisites

The list of prerequisites is described below.

• NVIDIA drivers 375.66
• CUDA 8.0
• cuDNN 7.0
• nvidia-docker 1.0
• How to use

Download the tar archive pytorchmpi_cudnn.tar and load image from it:

docker load -i pytorchmpi_cudnn.tar


Run utils/bootmpipytorch.sh to create containers and perform SSH login without password among them:

./bootmpipytorch.sh arg1 arg2


The first argument indicates the number of containers you want to create. The second argument indicates the path to bind mount a volume. After run the .sh file, we will have N containers named from $node_0$ to $node_{N-1}$. We can enter any one of these containers to do further operations by the following command: