DeepSpeed ZeRO DP strategies
Created by: lehr-fa
Adds DeepSpeed ZeRO DP strategies (https://arxiv.org/abs/1910.02054v3), which allow training with --subsampling-depth=100 --cropping-size 400 --batch-size=1
without running out of memory. However, this requires mixed-precision (--precision 16
). Unfortunately, FusedAdam
algorithm does somehow not work, so ZeRO Stage 3 cannot be used at the moment.