Save and Load models¶
Inspect a TF Checkpoint¶
ModelSaver callback saves the model to the directory defined by
in TensorFlow checkpoint format.
A TF checkpoint typically includes a
.data-xxxxx file and a
Both are necessary.
scripts/ls-checkpoint.py demos how to print all variables and their shapes in a checkpoint.
scripts/dump-model-params.py can be used to remove unnecessary variables in a checkpoint.
It takes a metagraph file (which is also saved by
ModelSaver) and only saves variables that the model needs at inference time.
It dumps the model to a
var-name: value dict saved in npz format.
Load a Model to a Session¶
Model loading (in both training and inference) is through the
Currently there are two ways a session can be restored:
which restores a TF checkpoint,
or session_init=DictRestore(…) which restores a dict.
To load multiple models, use ChainInit.
Many models in tensorpack model zoo are provided in the form of numpy dictionary (
because it is easier to load and manipulate without requiring TensorFlow.
To load such files to a session, use
You can also use
a small helper to create a
DictRestore based on the file name.
DictRestore is the most general loader because you can make arbitrary changes
you need (e.g., remove variables, rename variables) to the dict.
To load a TF checkpoint into a dict in order to make changes, use
Variable restoring is completely based on name match between
variables in the current graph and variables in the
Variables that appear in only one side will be printed as warning.
Therefore, transfer learning is trivial. If you want to load a pre-trained model, just use the same variable names. If you want to re-train some layer, just rename either the variables in the graph or the variables in your loader.
“resume training” is mostly just “loading the last known checkpoint”. Therefore you should refer to the previous section on how to load a model.
A checkpoint does not resume everything!
The TensorFlow checkpoint only saves TensorFlow variables, which means other Python state that are not TensorFlow variables will not be saved and resumed. This means:
Training epoch number will not be resumed. You can set it by providing a
starting_epochto your resume job.
State in your callbacks will not be resumed. Certain callbacks maintain a state (e.g., current best accuracy) in Python, which cannot be saved automatically.
is an alternative of
TrainConfig which applies some heuristics to