-
Notifications
You must be signed in to change notification settings - Fork 222
Integrate multiverso into torch project
- (For GPU support only) Install CUDA, cuDNN, Torch and Torch cuDNN bindings according to this guide.
- Install Multiverso shared object by referring to the Build instruction of multiverso project.
- Install multiverso torch package by referring to the Installation instruction of multiverso torch/lua binding.
Load and initialize multiverso package and then get some useful parameters at the beginning of the whole project.
-- Load multiverso.
local multiverso = require 'multiverso'
-- Init multiverso.
multiverso.init()
-- Get total number of workers.
multiverso.num_workers = multiverso.num_workers()
-- Get the id for current worker.
multiverso.worker_id = multiverso.worker_id()
-- Easy access to check whether this is master worker.
multiverso.is_master = multiverso.worker_id == 0
Create a Table Handlder as an interface for syncing issues.
-
model
variable is a Module class intorch.nn
package to build neural networks. -
ArrayTableHandler
is used for example as it satisfies most cases. - Actually, we can sync any variables (
tables
in Lua orTensors
in torch) with multiverso but model syncing is used as example here cause it is the most common user case. - During the initialization, we need to specify the exact size to sync.
-- Get static params and gradParams from model variable.
local params, gradParams = model.getParameters()
-- Create ArrayTableHandler for syncing parameters.
local tbh = multiverso.ArrayTableHandler:new(params:size(1))
Before actual training, we also need to make sure each worker has the same initial model for better training performance.
Multiverso use master strategy to initialize model. Only the init_value from the master will be used in the initial model on the server and then all workers fetch same initial models.
-- Create ArrayTableHandler for syncing parameters. In the constructor, Only
-- the init_value from the master worker will be used in the initial model
local tbh = multiverso.ArrayTableHandler:new(size, params)
-- Wait for finishing the initializing phase.
multiverso.barrier()
-- Get the initial model from the server.
params:copy(tbh:get())
During training or any other places we want to sync something. Two steps are needed:
- Add the gradients (delta value) to the server.
- Fetch the newest value from the server.
-
learingRate
variable is the learing rate maintained by the program. - Only gradients (delta value) should be passed to the table handler.
- This step should overwrite other changes to
params
variable, so we useparams:copy()
here.
-- Add the gradients (delta value) to the server.
tbh:add(learingRate * gradParams)
-- Fetch the newest value from the server.
params:copy(tbh:get())
Sometimes, we want to do somehing like log printing or validation. These kind of procedures should only performed in the master worker as we won't be able to see any result from other workers.
if multiverso.is_master then
-- Do something like print or validation here.
end
Sometimes, we want all workers to have the same state after they are trained distributively after several epochs, e.g., each 10 epochs.
-
epoch
variable is the number of current epoch. - This should perform between the first and the second step in sync phase.
-- Add the gradients (delta value) to the server.
tbh:add(learingRate * gradParams)
-- Synchronize all workers each several epochs.
if epoch % 10 == 0:
multiverso.barrier()
end
-- Fetch the newest value from the server.
params:copy(tbh:get())
After finishing training, just remember to shutdown before exit.
multiverso.shutdown()
-- This should be the end of the whole project.
There are some examples demonstrating how to use multiverso torch/lua binding.
DMTK
Multiverso
- Overview
- Multiverso setup
- Multiverso document
- Multiverso API document
- Multiverso applications
- Logistic Regression
- Word Embedding
- LightLDA
- Deep Learning
- Multiverso binding
- Run in docker
LightGBM