Tensorpack trainers contain logic of:
Building the graph.
Running the iterations (with callbacks).
Usually you won’t touch these methods directly, but use higher-level interface on trainers. You’ll only need to select what trainer to use. But some basic knowledge of how they work is useful:
is a trainer that uses “tower function” to build models.
All existing trainers in tensorpack are subclass of
because this concept is able to cover most types of neural-network training tasks.
What is Tower Function¶
Following the terminology in TensorFlow, a tower function is a callable that takes input tensors and adds one replicate of the model to the graph.
The concept of tower is used mainly to support:
Data-parallel multi-GPU training, where a replicate is built on each GPU.
Graph construction for inference, where a replicate is built under inference mode.
A user needs to provide a tower function to use
In particular, when working with the
ModelDesc interface, the
method will be part of the tower function.
Rules of Tower Function¶
The tower function needs to follow some rules:
It may get called multiple times for data-parallel training or inference. As a result:
You’ll need to be careful when modifying global states, e.g. adding ops to collections, setting attributes of a model instance.
To use a tensorflow-hub module, you need to initialize the module outside the tower function, and call the module inside the tower function.
It must respect variable collections:
(Required) Only put variables trainable by gradient descent into
(Recommended) Put non-trainable variables that need to be used in inference into
It must respect variable scope names:
The name of any trainable variables created in the function must be like “variable_scope_name/other/scopes/and/name”. Strictly speaking, the name of any trainable variables must:
Start with the name of the enclosing variable_scope when the tower function is called.
Not use the same variable_scope’s name twice in its name.
Not depend on name_scope’s name.
Not depend on any tensor’s name (because the tensor’s name may depend on name_scope’s name).
Tensorpack layers create variables based on the name given to the layer: e.g.,
Conv2D('test', x)will open a variable scope named “test”. In order to respect the above rules, the name of the layer must not depend on name_scope’s name or any tensor’s name.
It must respect variable scope reuse:
The creation of any trainable variables must respect reuse variable scope. To respect variable reuse (i.e. sharing), use
tf.Variablein the function.
On the other hand, for a non-trainable variable, it may be desirable to not reuse it between towers. In this case,
tf.Variablecan be used to ensure creation of new variables in each tower even when
Do not modify the reuse option (e.g., by
scope.reuse_variables()) of a variable scope that is not created by you. This affects other’s code. You can always open new scopes if you need the reuse option.
It must not create scopes or variables containing the name ‘tower’, as it is reserved for special use.
These conventions are easy to follow, and most layer wrappers (e.g., tf.layers/slim/tensorlayer) do follow them. Note that certain Keras layers do not follow these conventions and will need some workarounds if used within tensorpack.
What You Can Do Inside Tower Function¶
Call any symbolic functions as long as they follow the above rules.
The tower function will be called under a TowerContext, which can be accessed by get_current_tower_context(). The context contains information about training/inference mode, scope name, etc. You can use the context to build a different graph under different mode.
For data-parallel multi-GPU training, different multi-GPU trainers
implement different distribution strategies.
They take care of device placement, gradient averaging and synchronoization
in the efficient way and all reach the same performance as the
official TF benchmarks.
It takes only one line of code change to use them, e.g.
Note some common problems when using these trainers:
In each iteration, instead of taking one input tensor for all GPUs and split, all GPUs take tensors from the
InputSource. So the total batch size across all GPUs is
(batch size of InputSource) * #GPU. You may want to change
steps_per_epochor learing rate appropriately according to the total batch size.
Splitting a tensor for data-parallel training (as done by frameworks like Keras) makes no sense at all. First, it wastes time doing the split because typically data is first concatenated by the user. Second, this puts unnecessary shape constraints on the data, that the inputs on each GPU needs to have compatible shapes.
The tower function (your model code) will get called once on each GPU. You must follow the abovementioned rules of tower function.
Distributed training needs the horovod library which offers high-performance allreduce implementation. To run distributed training, first install horovod properly, then refer to the documentation of HorovodTrainer.
Tensorpack has implemented some other distributed trainers using TF’s native API, but TensorFlow is not actively supporting its distributed training features, and its native distributed performance isn’t very good even today. Therefore those trainers are not actively maintained and are not recommended for use.