Link Search Menu Expand Document
  1. Optimizer setting
    1. Up-to-date collections (2022-05-07):
      1. MaskFeat (slowfast)
      2. VideoSwin:
      3. My current pratice:
  2. Loss quantization
    1. Cross Entropy
    2. GIoU
  3. Miscellaneous
    1. backbone memory record
    2. Position of Normalization Layers
    3. Video backbone benchmark
  4. Video Data augmentation
    1. Training augmentation
      1. RandomResizeCrop and RandomRescale
      2. Mixup
      3. Cutmix
  5. Essay

Optimizer setting

The optimizer and schedule used by the famous team should be good because their settings must be based on sufficient grid search. For example, I have been keep following the setting used by:

However, I don’t have much computational resource compared with the FAIR. Therefore, I also follows some settings that use fewer training epochs and smaller batch size.

Up-to-date collections (2022-05-07):

MaskFeat (slowfast)

Screenshot from 2022-04-26 17-28-41

  • Cosine decay does NOT include restart, and it ends with 0.01 x base_lr
  • Warmup starts from 0.01 x base_lr
  • It seems that all runs in slowfast use max_norm=1.0 for clipping the grads, e.g. MViT has CLIP_GRAD_L2NORM: 1.0 in its config, which points to torch.nn.utils.clip_grad_norm_. However, the max_norm used by mmaction2 normally is 40/20.

VideoSwin:

Screenshot from 2022-05-07 13-35-38

My current pratice:

30 epochs:

# optimizer
optimizer = dict(type='AdamW', lr=3e-4, weight_decay=0.01)  # decay=0.05 if very large backbone
# optimizer = dict(type='AdamW', lr=3e-4, paramwise_cfg=dict(custom_keys={'backbone': dict(lr_mult=0.1)})) # if pretrained backbone
optimizer_config = dict(grad_clip=dict(max_norm=40)) # max_norm=1.0 if very large backbone
# learning policy
lr_config = dict(policy='CosineAnnealing',
                 min_lr_ratio=0.01,
                 warmup='linear',
                 warmup_ratio=0.01,
                 warmup_iters=2.5,
                 warmup_by_epoch=True)
total_epochs = 30

data = dict(
    videos_per_gpu=32, # batch size=32x2=64, where 2 is the number of gpus 
    ...

50 epochs:

# optimizer
optimizer = dict(
    type='SGD',
    lr=1e-3,
    momentum=0.9,
    weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(policy='step', step=[20, 40]) # or [10, 20] for 25 poech 
total_epochs = 50

200 epochs

# optimizer
optimizer = dict(type='AdamW', lr=1e-4, weight_decay=0.05)
optimizer_config = dict(grad_clip=dict(max_norm=1.0))
# learning policy
lr_config = dict(policy='CosineAnnealing',
                 min_lr_ratio=0.01,
                 warmup='linear',
                 warmup_ratio=0.01,
                 warmup_iters=20,
                 warmup_by_epoch=True)
total_epochs = 200

data = dict(
    videos_per_gpu=16, # batch size 16 * 2 = 32, where 2 is the number of gpus 
    ...

Tips:

Loss quantization

Cross Entropy

image

Prob. of GTLossComment
0100 
0.014.61:disappointed: exploding gradient?
0.12.30 
0.21.61 
0.31.20 
0.40.92 
0.50.69:neutral_face: meaningful
0.60.51:flushed: it works
0.70.36:grinning: good
0.80.22:wink: great
0.90.11:relaxed: pretty good
0.950.05 
0.990.01:triumph: perfect

GIoU

Copied from https://giou.stanford.edu/:

image

  • Note that GIoU and IoU do not one-to-one map to each other.
  • The brown area are pair-samples overlapping with each other.
  • Normally the GIoU is close to the IoU when the IoU is close to 1 or when the overlapping samples have similar x/y postion, otherwise the GIoU is smaller than the IoU.
  • GIoU_loss equals 1 - GIoU

Miscellaneous

backbone memory record

input_size=4x16x4x224x224
is_training=True
include cls_head
 SourceParams (M)Memory (M)
X3D-Mpytorchvideo3.795660
X3D-Mmmaction23.796382
I3Dmmaction228.043601
I3Doriginal12.703815
Slowonlymmaction2-3601

Position of Normalization Layers

[1] On Layer Normalization in the Transformer Architecture [2] A ConvNet for the 2020s [3] Understanding the Difficulty of Training Transformers [4] RealFormer: Transformer Likes Residual Attention [5] Blog

Let’s start with the Transformer. There are two kinds of known choices for the location of normalization: Post-LN and Pre-LN. Pre-LN is proposed newly than the Post-LN. Screenshot from 2022-06-30 13-34-20

Conclusion:

  • Post-LN is relied and sensitive to the Warm-Up[1], because “the Post-LN Transformer cannot be trained with a large learning rate from scratch”[1].
  • Pre-LN is NOT sensitive to the Warm-Up, and “Pre-LN Transformer converges faster than the Post-LN Transformer”[1].
  • Post-LN is NOT as stable as Pre-LN, but it optimal performance is better[3][4][5].
  • Pre-LN is recommended for easy training, while Post-LN is recommended when high performance is the target, at the cost of manul engineering, e.g. the Admin initilization [3] and the Warm-up.

What about CNN? Below is the block structure of the latest ConvNext:

Screenshot from 2022-06-30 13-52-23

Conclusion:

  • The widely used Batch-Normalization layers are now replaced by the Layer-Normalization layers. BTW, the amount is reduced.
  • The Pre-LN is recommended.

Video backbone benchmark

 Sourcetop1Params (M)GFLOPsMemory (M)Training speedTesting speed
I3Doriginal71.1%12.2927.90172118.7 iter/s80.0 iter/s
I3Dmmaction273.3%27.2216.74175116.5 iter/s87.5 iter/s
SlowOnlymmaction275.6%31.6384.3729874.9 iter/s34.5 iter/s
X3D-Mpytorchvideo76.2%-5.15---
X3D-Mmmaction275.6%2.095.1522579.8 iter/s80.0 iter/s
MViT-Bpytorchvideo80.2%36.3070.8315110.2 iter/s35.3 iter/s
  • top1 is reported on the kinetics400 validation set but are directly coied from the paper/repo. These work use different input resolution and training/testing data augmentation, thereby the top1 here is just for reference. It makes NO sense to consider together the top1 and the other variables. While generally speaking, the lower models are newer and should have higher best accuracy.
  • Params and GFLOPs only calculate the backbone, while speed and memory involves the head.
  • Input (N C T H W) is of shape (1, 3, 16, 224, 224).
  • Params and GFLOPs are computed using the fvcore lib.
  • Speed and memory evaluation are conducted on a single 2080ti GPU.
  • Memory refers to the ocupied memory recorded by nvidia-smi during the training.

Video Data augmentation

Training augmentation

RandomResizeCrop and RandomRescale

There are two widely used training data augmentation on the spatial view of input video:

  • Resize(-1, 256) - RandomResizedCrop(scale=(0.08, 1.0), ratio=(0.75, 1.33) - Resize(224, 224)
  • RandomRescale(256, 320) - RandomCrop(224, 224)
  Resize(224)Resize(-1,224)-Center(224)Resize(-1,256)-Center(224)Resize(-1,224)-Three(224)Resize(-1,256)-Three(224)
ResizeCropCls Acc72.2974.7674.2374.8873.81
 Reg MAE16.6316.4216.5316.1516.26
 Iter/s477.3525.0479.8183.7156.1
 mAP@0.539.4%42.8%43.3%42.7%41.7%
RescaleCropCls Acc70.9674.5074.4874.7874.76
 Reg MAE16.8516.2916.4016.0116.01
 Iter/s495.8514.6502.4178.8161.8
 mAP@0.536.4%41.9%42.7%41.6%41.3%

Mixup

Mixing up two samples for data augmentation: image

A psudo code:

beta = Beta(alpha, alpha)
lambda = beta.sample()
rand_index = torch.randperm(batch_size)
mixed_images = lambda* imgs + (1 - lambda) * imgs[rand_index, :]

The pdf of lambda under different alpha for mixup: mixup_alpha_pdf

  • From the above figure, we can conclude that a small alpha tends to sample values closing to 0 or 1, which represent a weak mixup.
  • The strongest mixup is when lambda = 0.5. As the alpha increases, the probability intensity of lambda abound 0.5 is increasing.
  • In short, largger alpha, stronger mixup.
  • alpha=1 represents uniform distribution.
  • alpha=0.8 is adopted in the MViT for training action recognition on Kinetics400.

Cutmix

Cutting subregions from two samples and mixup them for data augmentation:

image

Similar to the mixup, because the strongest cutmix is when the lambda=0.5, larger alpha represetns stronger cutmix. FYI, alpha=1.0 is adopted in the MViT for training action recognition on Kinetics400.

One may argue that the lambda=1.0 should be the strongest mixup/cutmix. While because the label will also be mixed, so when lambda=1, the two samples are just simply exchanged after the mixup/cutmix.

Visulizing Beta distribution online

Essay

  • Kinetics 400 (trimmed) occupies 10Mb for each video of RGB rawframes. 30Mb for each video of (optical flow+ RGB) rawframes.
  • ActivityNet200 (untrimmed) occupies 3.3Gb for each video of RGB rawframes. 10Gb for each video of (optical flow+ RGB) rawframes.
  • GMA optical flows of the validation set (19404 videos) of Kinetics 400 occupy 61G disk size, showing that each video occupies about 3.21Mb. So training set of 240618 videos should occupies about 800G.