fastNLP.core.optimizer

optimizer 模块定义了 fastNLP 中所需的各种优化器,一般做为 Trainer 的参数使用。

class fastNLP.core.optimizer.Optimizer(model_params, **kwargs)[源代码]

别名 fastNLP.Optimizer fastNLP.core.optimizer.Optimizer

Optimizer

__init__(model_params, **kwargs)[源代码]
参数:
  • model_params -- a generator. E.g. model.parameters() for PyTorch models.
  • kwargs -- additional parameters.
class fastNLP.core.optimizer.SGD(lr=0.001, momentum=0, model_params=None)[源代码]

基类 fastNLP.Optimizer

别名 fastNLP.SGD fastNLP.core.optimizer.SGD

SGD
__init__(lr=0.001, momentum=0, model_params=None)[源代码]
参数:
  • lr (float) -- learning rate. Default: 0.01
  • momentum (float) -- momentum. Default: 0
  • model_params -- a generator. E.g. model.parameters() for PyTorch models.
class fastNLP.core.optimizer.Adam(lr=0.001, weight_decay=0, betas=(0.9, 0.999), eps=1e-08, amsgrad=False, model_params=None)[源代码]

基类 fastNLP.Optimizer

别名 fastNLP.Adam fastNLP.core.optimizer.Adam

Adam
__init__(lr=0.001, weight_decay=0, betas=(0.9, 0.999), eps=1e-08, amsgrad=False, model_params=None)[源代码]
参数:
  • lr (float) -- learning rate
  • weight_decay (float) --
  • eps --
  • amsgrad --
  • model_params -- a generator. E.g. model.parameters() for PyTorch models.
class fastNLP.core.optimizer.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[源代码]

别名 fastNLP.AdamW fastNLP.core.optimizer.AdamW

对AdamW的实现,该实现在pytorch 1.2.0版本中已经出现,https://github.com/pytorch/pytorch/pull/21250。 这里加入以适配低版本的pytorch

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[源代码]
参数:
  • (iterable) (params) -- iterable of parameters to optimize or dicts defining parameter groups
  • (float, optional) (weight_decay) -- learning rate (default: 1e-3)
  • (Tuple[float, float], optional) (betas) -- coefficients used for computing running averages of gradient and its square (default: (0.9, 0.99))
  • (float, optional) -- term added to the denominator to improve numerical stability (default: 1e-8)
  • (float, optional) -- weight decay coefficient (default: 1e-2) algorithm from the paper On the Convergence of Adam and Beyond (default: False)
add_param_group(param_group)

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Arguments:
param_group (dict): Specifies what Tensors should be optimized along with group specific optimization options.
load_state_dict(state_dict)

Loads the optimizer state.

Arguments:
state_dict (dict): optimizer state. Should be an object returned
from a call to state_dict().
state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content
    differs between optimizer classes.
  • param_groups - a dict containing all parameter groups
step(closure=None)[源代码]

Performs a single optimization step.

参数:closure -- (callable, optional) A closure that reevaluates the model and returns the loss.
zero_grad(set_to_none: bool = False)

Sets the gradients of all optimized torch.Tensor s to zero.

Arguments:
set_to_none (bool): instead of setting to zero, set the grads to None.
This is will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).