Link Search Menu Expand Document
  1. Optimizer setting
    1. Up-to-date collections (2022-05-07):
      1. MaskFeat (slowfast)
      2. VideoSwin:
      3. My current pratice:
  2. Loss quantization
    1. Cross Entropy
    2. GIoU
  3. Miscellaneous
    1. backbone memory record
    2. Position of Normalization Layers
    3. Video backbone benchmark
  4. Video Data augmentation
    1. Training augmentation
      1. RandomResizeCrop and RandomRescale
      2. Mixup
      3. Cutmix
  5. Essay

Optimizer setting

The optimizer and schedule used by the famous team should be good because their settings must be based on sufficient grid search. For example, I have been keep following the setting used by:

However, I don’t have much computational resource compared with the FAIR. Therefore, I also follows some settings that use fewer training epochs and smaller batch size.

Up-to-date collections (2022-05-07):

MaskFeat (slowfast)

Screenshot from 2022-04-26 17-28-41

  • Cosine decay does NOT include restart, and it ends with 0.01 x base_lr
  • Warmup starts from 0.01 x base_lr
  • It seems that all runs in slowfast use max_norm=1.0 for clipping the grads, e.g. MViT has CLIP_GRAD_L2NORM: 1.0 in its config, which points to torch.nn.utils.clip_grad_norm_. However, the max_norm used by mmaction2 normally is 40/20.


Screenshot from 2022-05-07 13-35-38

My current pratice:

30 epochs:

# optimizer
optimizer = dict(type='AdamW', lr=3e-4, weight_decay=0.01)  # decay=0.05 if very large backbone
# optimizer = dict(type='AdamW', lr=3e-4, paramwise_cfg=dict(custom_keys={'backbone': dict(lr_mult=0.1)})) # if pretrained backbone
optimizer_config = dict(grad_clip=dict(max_norm=40)) # max_norm=1.0 if very large backbone
# learning policy
lr_config = dict(policy='CosineAnnealing',
total_epochs = 30

data = dict(
    videos_per_gpu=32, # batch size=32x2=64, where 2 is the number of gpus 

50 epochs:

# optimizer
optimizer = dict(
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(policy='step', step=[20, 40]) # or [10, 20] for 25 poech 
total_epochs = 50

200 epochs

# optimizer
optimizer = dict(type='AdamW', lr=1e-4, weight_decay=0.05)
optimizer_config = dict(grad_clip=dict(max_norm=1.0))
# learning policy
lr_config = dict(policy='CosineAnnealing',
total_epochs = 200

data = dict(
    videos_per_gpu=16, # batch size 16 * 2 = 32, where 2 is the number of gpus 


Loss quantization

Cross Entropy


Prob. of GTLossComment
0.014.61:disappointed: exploding gradient?
0.50.69:neutral_face: meaningful
0.60.51:flushed: it works
0.70.36:grinning: good
0.80.22:wink: great
0.90.11:relaxed: pretty good
0.990.01:triumph: perfect


Copied from


  • Note that GIoU and IoU do not one-to-one map to each other.
  • The brown area are pair-samples overlapping with each other.
  • Normally the GIoU is close to the IoU when the IoU is close to 1 or when the overlapping samples have similar x/y postion, otherwise the GIoU is smaller than the IoU.
  • GIoU_loss equals 1 - GIoU


backbone memory record

include cls_head
 SourceParams (M)Memory (M)

Position of Normalization Layers

[1] On Layer Normalization in the Transformer Architecture [2] A ConvNet for the 2020s [3] Understanding the Difficulty of Training Transformers [4] RealFormer: Transformer Likes Residual Attention [5] Blog

Let’s start with the Transformer. There are two kinds of known choices for the location of normalization: Post-LN and Pre-LN. Pre-LN is proposed newly than the Post-LN. Screenshot from 2022-06-30 13-34-20


  • Post-LN is relied and sensitive to the Warm-Up[1], because “the Post-LN Transformer cannot be trained with a large learning rate from scratch”[1].
  • Pre-LN is NOT sensitive to the Warm-Up, and “Pre-LN Transformer converges faster than the Post-LN Transformer”[1].
  • Post-LN is NOT as stable as Pre-LN, but it optimal performance is better[3][4][5].
  • Pre-LN is recommended for easy training, while Post-LN is recommended when high performance is the target, at the cost of manul engineering, e.g. the Admin initilization [3] and the Warm-up.

What about CNN? Below is the block structure of the latest ConvNext:

Screenshot from 2022-06-30 13-52-23


  • The widely used Batch-Normalization layers are now replaced by the Layer-Normalization layers. BTW, the amount is reduced.
  • The Pre-LN is recommended.

Video backbone benchmark

 Sourcetop1Params (M)GFLOPsMemory (M)Training speedTesting speed
I3Doriginal71.1%12.2927.90172118.7 iter/s80.0 iter/s
I3Dmmaction273.3%27.2216.74175116.5 iter/s87.5 iter/s
SlowOnlymmaction275.6%31.6384.3729874.9 iter/s34.5 iter/s
X3D-Mmmaction275.6%2.095.1522579.8 iter/s80.0 iter/s
MViT-Bpytorchvideo80.2%36.3070.8315110.2 iter/s35.3 iter/s
  • top1 is reported on the kinetics400 validation set but are directly coied from the paper/repo. These work use different input resolution and training/testing data augmentation, thereby the top1 here is just for reference. It makes NO sense to consider together the top1 and the other variables. While generally speaking, the lower models are newer and should have higher best accuracy.
  • Params and GFLOPs only calculate the backbone, while speed and memory involves the head.
  • Input (N C T H W) is of shape (1, 3, 16, 224, 224).
  • Params and GFLOPs are computed using the fvcore lib.
  • Speed and memory evaluation are conducted on a single 2080ti GPU.
  • Memory refers to the ocupied memory recorded by nvidia-smi during the training.

Video Data augmentation

Training augmentation

RandomResizeCrop and RandomRescale

There are two widely used training data augmentation on the spatial view of input video:

  • Resize(-1, 256) - RandomResizedCrop(scale=(0.08, 1.0), ratio=(0.75, 1.33) - Resize(224, 224)
  • RandomRescale(256, 320) - RandomCrop(224, 224)
ResizeCropCls Acc72.2974.7674.2374.8873.81
 Reg MAE16.6316.4216.5316.1516.26
RescaleCropCls Acc70.9674.5074.4874.7874.76
 Reg MAE16.8516.2916.4016.0116.01


Mixing up two samples for data augmentation: image

A psudo code:

beta = Beta(alpha, alpha)
lambda = beta.sample()
rand_index = torch.randperm(batch_size)
mixed_images = lambda* imgs + (1 - lambda) * imgs[rand_index, :]

The pdf of lambda under different alpha for mixup: mixup_alpha_pdf

  • From the above figure, we can conclude that a small alpha tends to sample values closing to 0 or 1, which represent a weak mixup.
  • The strongest mixup is when lambda = 0.5. As the alpha increases, the probability intensity of lambda abound 0.5 is increasing.
  • In short, largger alpha, stronger mixup.
  • alpha=1 represents uniform distribution.
  • alpha=0.8 is adopted in the MViT for training action recognition on Kinetics400.


Cutting subregions from two samples and mixup them for data augmentation:


Similar to the mixup, because the strongest cutmix is when the lambda=0.5, larger alpha represetns stronger cutmix. FYI, alpha=1.0 is adopted in the MViT for training action recognition on Kinetics400.

One may argue that the lambda=1.0 should be the strongest mixup/cutmix. While because the label will also be mixed, so when lambda=1, the two samples are just simply exchanged after the mixup/cutmix.

Visulizing Beta distribution online


  • Kinetics 400 (trimmed) occupies 10Mb for each video of RGB rawframes. 30Mb for each video of (optical flow+ RGB) rawframes.
  • ActivityNet200 (untrimmed) occupies 3.3Gb for each video of RGB rawframes. 10Gb for each video of (optical flow+ RGB) rawframes.
  • GMA optical flows of the validation set (19404 videos) of Kinetics 400 occupy 61G disk size, showing that each video occupies about 3.21Mb. So training set of 240618 videos should occupies about 800G.