tensorpack.train package¶
Relevant tutorials: Trainers, Training Interface
-
exception
tensorpack.train.
StopTraining
[source]¶ Bases:
Exception
An exception thrown to stop training.
-
class
tensorpack.train.
Trainer
[source]¶ Bases:
object
Base class for a trainer.
-
property
epoch_num
¶ The number of the currently ongoing epoch.
An epoch is defined to cover the moment before calling before_epoch until after calling trigger_epoch. i.e., in the trigger_epoch of epoch 3, self.epoch_num is 3. If you need use self.epoch_num in your callback, you’ll need to know this.
-
property
global_step
¶ The tensorflow global_step, i.e. how many times
hooked_sess.run
has been called.Note
global_step is incremented after each
hooked_sess.run
returns from TF runtime.If you make zero or more than one calls to
hooked_sess.run
in onerun_step()
, local_step and global_step may increment at different speed.
-
hooked_sess
= None¶ The
tf.train.MonitoredSession
object the trainer is using. It contains all thebefore_run/after_run
hooks the callbacks have registered. It is used for running the training iterations. Available afterinitialize()
.Note that using
hooked_sess.run
will evaluate all the hooks, just like running a training iteration. It may do the following:Take a datapoint from the InputSource
Increase the global_step
Evaluate some summaries
Typically you should not use
hooked_sess.run
in callbacks, because it is for the “training iteration”. If you just want to evaluate some tensors, usesess.run
if the tensors does not depend on the inputs, or more generally, use before_run/after_run to evaluate the tensors along with the training iterations.
-
initialize
(session_creator, session_init)[source]¶ Create the session and set self.sess. Call self.initiailize_hooks() Finalize the graph.
It must be called after callbacks are setup.
- Parameters
session_creator (tf.train.SessionCreator) –
session_init (sessinit.SessionInit) –
-
initialize_hooks
()[source]¶ Create SessionRunHooks for all callbacks, and hook it onto self.sess to create self.hooked_sess.
A new trainer may override this method to create multiple groups of hooks, which can be useful when the training is not done by a single train_op.
-
is_chief
= True¶ Whether this process is the chief worker in distributed training. Certain callbacks will only be run by chief worker.
-
property
local_step
¶ The number of steps that have finished in the current epoch.
-
main_loop
(steps_per_epoch, starting_epoch, max_epoch)[source]¶ Run the main training loop.
- Parameters
starting_epoch, max_epoch (steps_per_epoch,) –
-
property
max_epoch
¶
-
register_callback
(cb)¶ Register callbacks to the trainer. It can only be called before
Trainer.train()
.
-
run_step
()[source]¶ Defines what to do in one iteration. The default is:
self.hooked_sess.run(self.train_op)
.The behavior of each iteration can be changed by either setting
trainer.train_op
, or overriding this method.
-
sess
= None¶ The
tf.Session
object the trainer is using. Available afterinitialize()
.Using
trainer.sess.run
to evaluate tensors that depend on the trainingInputSource
may have unexpected effect:For example, if you use
trainer.sess.run
to evaluate a tensor that depends on the inputs coming from aStagingArea
, it will take a datapoint from theStagingArea
, making theStagingArea
empty, and as a result make the training hang.
-
setup_callbacks
(callbacks, monitors)[source]¶ Setup callbacks and monitors. Must be called after the main graph is built.
- Parameters
callbacks ([Callback]) –
monitors ([MonitorBase]) –
-
property
starting_epoch
¶
-
property
steps_per_epoch
¶
-
train
(callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch=1, max_epoch=9999999)[source]¶ Implemented by three lines:
self.setup_callbacks(callbacks, monitors) self.initialize(session_creator, session_init) self.main_loop(steps_per_epoch, starting_epoch, max_epoch)
You can call those methods by yourself to have better control on details if needed.
-
train_with_defaults
(_sentinel=None, callbacks=None, monitors=None, session_creator=None, session_init=None, steps_per_epoch=None, starting_epoch=1, max_epoch=9999999, extra_callbacks=None)[source]¶ Same as
train()
, except:Add extra_callbacks to callbacks. The default value for extra_callbacks is
DEFAULT_CALLBACKS()
.Default value for monitors is
DEFAULT_MONITORS()
.Provide default values for every option except steps_per_epoch.
-
property
-
class
tensorpack.train.
TrainConfig
(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999)[source]¶ Bases:
object
A collection of options to be used for single-cost trainers.
Note that you do not have to use
TrainConfig
. You can use the API ofTrainer
directly, to have more fine-grained control of the training.-
__init__
(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999)[source]¶ - Parameters
dataflow (DataFlow) –
data (InputSource) –
model (ModelDesc) –
callbacks (list[Callback]) – a list of
Callback
to use during training.extra_callbacks (list[Callback]) –
This argument is only used to provide the defaults in addition to
callbacks
. The list of callbacks that will be used in the end is simplycallbacks + extra_callbacks
.It is usually left as None, and the default value for this argument is
DEFAULT_CALLBACKS()
. You can override it when you don’t like any of the default callbacks. For example, if you’d like to let the progress bar print tensors, you can useextra_callbacks=[ProgressBar(names=['name']), MovingAverageSummary(), MergeAllSummaries(), RunUpdateOps()]
monitors (list[MonitorBase]) – Defaults to
DEFAULT_MONITORS()
.session_creator (tf.train.SessionCreator) – Defaults to
sesscreate.NewSessionCreator()
with the config returned bytfutils.get_default_sess_config()
.session_config (tf.ConfigProto) – when session_creator is None, use this to create the session.
session_init (SessionInit) – how to initialize variables of a session. Defaults to do nothing.
starting_epoch (int) – The index of the first epoch.
steps_per_epoch (int) –
the number of steps (defined by
Trainer.run_step()
) to run in each epoch. Defaults to the input data size. You may want to divide it by the #GPUs in multi-GPU training.Number of steps per epoch only affects the schedule of callbacks. It does not affect the sequence of input data seen by the model.
max_epoch (int) – maximum number of epoch to run training.
-
-
class
tensorpack.train.
AutoResumeTrainConfig
(always_resume=True, **kwargs)[source]¶ Bases:
tensorpack.train.config.TrainConfig
Same as
TrainConfig
, but does the following to automatically resume from training:If a checkpoint was found in
logger.get_logger_dir()
, set session_init option to load it.If a JSON history was found in
logger.get_logger_dir()
, try to load the epoch number from it and set the starting_epoch option to continue training.
You can choose to let the above two option to either overwrite or not overwrite user-provided arguments, as explained below.
Note that the functionality requires the logging directory to obtain necessary information from a previous run. If you have unconventional setup of logging directory, this class will not work for you, for example:
If you save the checkpoint to a different directory rather than the logging directory.
If in distributed training the directory is not available to every worker, or the directories are different for different workers.
-
__init__
(always_resume=True, **kwargs)[source]¶ - Parameters
always_resume (bool) – If False, user-provided arguments session_init and starting_epoch will take priority. Otherwise, resume will take priority.
kwargs – same as in
TrainConfig
.
Note
The main goal of this class is to let a training job resume without changing any line of code or command line arguments. So it’s useful to let resume take priority over user-provided arguments sometimes.
For example: if your training starts from a pre-trained model, you would want it to use user-provided model loader at the beginning, but a “resume” model loader when the job was interrupted and restarted.
-
tensorpack.train.
DEFAULT_CALLBACKS
()[source]¶ Return the default callbacks, which will be used in
TrainConfig
andTrainer.train_with_defaults()
. They are:MovingAverageSummary()
ProgressBar()
MergeAllSummaries()
RunUpdateOps()
-
tensorpack.train.
DEFAULT_MONITORS
()[source]¶ Return the default monitors, which will be used in
TrainConfig
andTrainer.train_with_defaults()
. They are:TFEventWriter()
JSONWriter()
ScalarPrinter()
-
tensorpack.train.
launch_train_with_config
(config, trainer)[source]¶ Train with a
TrainConfig
and aTrainer
, to present the simple and old training interface. It basically does the following 3 things (and you can easily do them by yourself if you need more control):Setup the input with automatic prefetching heuristics, from config.data or config.dataflow.
Call trainer.setup_graph with the input as well as config.model.
Call trainer.train with rest of the attributes of config.
See the related tutorial to learn more.
- Parameters
config (TrainConfig) –
trainer (Trainer) – an instance of
SingleCostTrainer
.
Example:
launch_train_with_config( config, SyncMultiGPUTrainerParameterServer(8, ps_device='gpu'))
-
class
tensorpack.train.
ModelDesc
[source]¶ Bases:
tensorpack.train.model_desc.ModelDescBase
One subclass of
ModelDescBase
with the assupmtion of single cost and single optimizer training. It has the following constraints in addition toModelDescBase
:build_graph(…) method should return a cost tensor when called under a training context. The cost will be the final cost to be optimized by the optimizer. Therefore it should include necessary regularization.
Subclass is expected to implement
optimizer()
method.
-
class
tensorpack.train.
ModelDescBase
[source]¶ Bases:
object
Base class for a model description.
It is used for the simple training interface described in Training Interface Tutorial.
Subclass is expected to implement
inputs()
andbuild_graph()
, as they together define a tower function.-
build_graph
(*args)[source]¶ A subclass is expected to implement this method.
Build the whole symbolic graph. This is supposed to be part of the “tower function” when used with
TowerTrainer
.- Parameters
args ([tf.Tensor]) – tensors that matches the list of inputs defined by
inputs()
.- Returns
In general it returns nothing, but a subclass may require it to return necessary information to build the trainer. For example, SingleCostTrainer expect this method to return the cost tensor.
-
get_input_signature
()[source]¶ - Returns
A list of
tf.TensorSpec
, which describes the inputs of this model. The result is cached for each instance ofModelDescBase
.
-
inputs
()[source]¶ A subclass is expected to implement this method.
If returning placeholders, the placeholders have to be created inside this method. Don’t return placeholders created in other places.
Also, users should never call this method by yourself.
- Returns
list[tf.TensorSpec or tf.placeholder].
-
-
class
tensorpack.train.
SingleCostTrainer
[source]¶ Bases:
tensorpack.train.tower.TowerTrainer
Base class for single-cost trainer.
Single-cost trainer has a
setup_graph()
method which takes (input_signature, input, get_cost_fn, get_opt_fn), and build the training graph from them.To use a
SingleCostTrainer
object, call trainer.setup_graph(…); trainer.train(…).-
AGGREGATION_METHOD
= 0¶ See tf.gradients.
-
COLOCATE_GRADIENTS_WITH_OPS
= True¶ See tf.gradients. It sometimes can heavily affect performance when backward op does not support the device of forward op.
-
GATE_GRADIENTS
= False¶ See tf.gradients.
-
XLA_COMPILE
= False¶ Use
xla.compile()
to compile the tower function. Note that XLA has very strong requirements on the tower function, e.g.:limited op support
inferrable shape
no summary support
and many tower functions cannot be compiled by XLA. Don’t use it if you don’t understand it.
-
setup_graph
(input_signature, input, get_cost_fn, get_opt_fn)[source]¶ Responsible for building the main training graph for single-cost training.
- Parameters
input_signature ([TensorSpec]) – list of TensorSpec that describe the inputs
input (InputSource) – an InputSource which has to match the input signature
get_cost_fn ([tf.Tensor] -> tf.Tensor) – callable, takes some input tensors and return a cost tensor.
get_opt_fn (-> tf.train.Optimizer) – callable which returns an optimizer. Will only be called once.
Note
get_cost_fn will be part of the tower function. It must follows the rules of tower function..
-
-
class
tensorpack.train.
TowerTrainer
[source]¶ Bases:
tensorpack.train.base.Trainer
Base trainers for models that can be built by calling a tower function under a
TowerContext
.The assumption of tower function is required by some features that replicates the model automatically. For example, TowerTrainer can create a predictor for you automatically, by calling the tower function.
To use
TowerTrainer
, set tower_func and use it to build the graph. Note that tower_func can only be set once per instance of TowerTrainer.-
get_predictor
(input_names, output_names, device=0)[source]¶ This method will build the trainer’s tower function under
TowerContext(is_training=False)
, and returns a callable predictor with input placeholders & output tensors in this tower.This method handles the common case where you inference with the same tower function you provide to the trainer. If you want to do inference with a different tower function, you can always build the tower by yourself, under a “reuse” variable scope and a TowerContext(is_training=False).
- Parameters
- Returns
an
OnlinePredictor
.
Example:
# in the graph: interesting_tensor = tf.identity(x, name='fun') # in _setup_graph callback method: self._predictor = self.trainer.get_predictor(['input1', 'input2'], ['fun']) # After session is initialized (see Tutorials - Write a Callback), can use it by: outputs = self._predictor(input1, input2)
The CycleGAN example and DQN example have more concrete use of this method.
-
property
tower_func
¶ A
TowerFunc
instance. See tutorial on tower function for more information.
-
property
towers
¶ used to access the tower handles by either indices or names.
This property is accessbile only after the graph is set up. With
towers()
, you can then access many attributes of each tower:Example:
# Access the conv1/output tensor in the first training tower trainer.towers.training()[0].get_tensor('conv1/output')
- Type
-
-
class
tensorpack.train.
NoOpTrainer
[source]¶ Bases:
tensorpack.train.trainers.SimpleTrainer
A special trainer that builds the graph (if given a tower function) and does nothing in each step. It is used to only run the callbacks.
Note that steps_per_epoch and max_epochs are still valid options.
-
class
tensorpack.train.
SimpleTrainer
[source]¶ Bases:
tensorpack.train.tower.SingleCostTrainer
Single-GPU single-cost single-tower trainer.
-
tensorpack.train.
SyncMultiGPUTrainer
(gpus)[source]¶ Return a default multi-GPU trainer, if you don’t care about the details. It may not be the most efficient one for your task.
-
class
tensorpack.train.
SyncMultiGPUTrainerReplicated
(gpus, average=True, mode=None)[source]¶ Bases:
tensorpack.train.tower.SingleCostTrainer
Data-parallel training in “replicated” mode, where each GPU contains a replicate of the whole model. It will build one tower on each GPU under its own variable scope. Each gradient update is averaged or summed across or GPUs through NCCL.
It is an equivalent of
--variable_update=replicated
in tensorflow/benchmarks.-
BROADCAST_EVERY_EPOCH
¶ Whether to broadcast the variables every epoch. Theoretically this is a no-op (because the variables are supposed to be in-sync). But this cheap operation may help prevent certain numerical issues in practice.
Note that in cases such as BatchNorm, the variables may not be in sync: e.g., non-master worker may not maintain EMAs.
For benchmark, disable this option.
- Type
-
__init__
(gpus, average=True, mode=None)[source]¶ - Parameters
-
-
class
tensorpack.train.
SyncMultiGPUTrainerParameterServer
(gpus, ps_device=None)[source]¶ Bases:
tensorpack.train.tower.SingleCostTrainer
Data-parallel training in ‘ParameterServer’ mode. It builds one tower on each GPU with shared variable scope. It synchronizes the gradients computed from each tower, averages them and applies to the shared variables.
It is an equivalent of
--variable_update=parameter_server
in tensorflow/benchmarks.
-
class
tensorpack.train.
AsyncMultiGPUTrainer
(gpus, scale_gradient=True)[source]¶ Bases:
tensorpack.train.tower.SingleCostTrainer
Data-parallel training with async update. It builds one tower on each GPU with shared variable scope. Every tower computes the gradients and independently applies them to the variables, without synchronizing and averaging across towers.
-
class
tensorpack.train.
HorovodTrainer
(average=True, compression=None)[source]¶ Bases:
tensorpack.train.tower.SingleCostTrainer
Horovod trainer, support both multi-GPU and distributed training.
To use for multi-GPU training:
# First, change trainer to HorovodTrainer(), then CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO horovodrun -np 4 --output-filename mylog python train.py
To use for distributed training:
# First, change trainer to HorovodTrainer(), then horovodrun -np 8 -H server1:4,server2:4 --output-filename mylog \ python train.py
Note
To reach the maximum speed in your system, there are many options to tune in Horovod installation, horovodrun arguments, and in the MPI command line. See Horovod docs for details.
Due to a TF bug (#8136), you must not initialize CUDA context before the trainer starts training. Therefore TF functions like is_gpu_available() or list_local_devices() must be avoided. You can, however, use tf.config.experimental.list_physical_devices(‘GPU’), introduced in TF 1.14.
Horovod supports both MPI and gloo. There are a few drawbacks of the MPI backend:
MPI does not like fork(). If your code (e.g. dataflow) contains multiprocessing, it may cause problems.
MPI sometimes fails to kill all processes in the end. Be sure to check it afterwards.
The gloo backend is recommended though it may come with very minor slow down. To use gloo backend, see horovod documentation for more details.
Keep in mind that there is one process running the script per GPU, therefore:
Make sure your InputSource has reasonable randomness.
If your data processing is heavy, doing it in a single dedicated process might be a better choice than doing them repeatedly in each process.
You need to make sure log directories in each process won’t conflict. You can set it only for the chief process, or set a different one for each process.
Callbacks have an option to be run only in the chief process, or in all processes. See
Callback.set_chief_only()
. Most callbacks have a reasonable default already, but certain callbacks may need your customization. Report an issue if you find any bad defaults.You can use Horovod API such as hvd.rank() to know which process you are and choose different code path. Chief process has rank 0.
Due to these caveats, see ResNet-Horovod for a full example which has handled these common issues. This example can train ImageNet in roughly an hour following the paper’s setup.
-
BROADCAST_EVERY_EPOCH
¶ Whether to broadcast the variables every epoch. Theoretically this is a no-op (because the variables are supposed to be in-sync). But this cheap operation may help prevent certain numerical issues in practice.
Note that in cases such as BatchNorm, the variables may not be in sync: e.g., non-master worker may not maintain EMAs.
For benchmark, disable this option.
- Type
-
class
tensorpack.train.
BytePSTrainer
(average=True)[source]¶ Bases:
tensorpack.train.trainers.HorovodTrainer
BytePS trainer. Supports both multi-GPU and distributed training. It achieves better scalability than horovod in distributed training, if the model is communication intensive and you have properly set up the machines following its best practices which requires a few extra bandwidth servers than horovod.
To use it, switch the trainer, and refer to BytePS documentation on how to launch server/scheduler/workers.
-
hvd
¶ the byteps module that contains horovod-compatible APIs like rank(),size(). This attribute exists so that downstream code that uses these APIs does not need to worry about which library is being used under the hood.
- Type
module
-