LIBBLE-DL

Introduction

LIBBLE-DL is the LIBBLE variant for distributed deep learning, which is implemented based on PyTorch.

Currently, PyTorch only provides an AllReduce framework for distributed deep learning, the communication cost of which is high. Here, we design and develop three distributed deep learning frameworks based on PyTorch: MR-DisPyTorch, RA-DisPyTorch and PS-DisPyTorch. These three distributed deep learning frameworks have lower communication cost than the AllReduce framework in PyTorch. MR-DisPyTorch, RA-DisPyTorch and PS-DisPyTorchh can handle different kinds of application scenarios, and users can choose suitable frameworks according to their specific need in real applications.

Tutorial

Configure Environment

LIBBLE-DL provides a docker image for quick configuration.

Examples

We give examples of using these three distributed deep learning frameworks. We choose $20$-layer $ResNet$ model and $CIFAR10$ dataset for our examples. Both the training schedule and hyper-parameter settings follow the practice in [He et al., 2016].

You can find the complete code inside the example/cifar10 directory. In examples, we use mpi backend to communicate (You can also use other backend as long as PyTorch supports it). You can start the example for MapReduce programming model with a master node and $4$ worker nodes using the following command:

mpirun -n 5 --hostfile hosts python -m LIBBLE-DL.examples.cifar10.mapreduce

The file hosts contains a list of IP addresses, indicating which hosts to start the processes on. The usages of RA-DisPyTorch and PS-DisPyTorch are similar.

Empirical Comparison

Open Source

Development Team