The video instances contain a temporal dimension size that could show large variability, and some action classes do not require considering all frames to understand them. To favor fair comparisons between methods and datasets and reduce the huge memory consumption, we define the working memory size of the rehearsal methods in terms of stored frames. This creates a new unique scenario of Class Incremental Learning in video data, in which rehearsal methods must decide first what subset of frames should be selected and then decide what video to store according to selected frames per video. We encourage future works on Class Incremental Learning for Video Understanding to focus on this and beat our Temporal Consistency Regulation strategy.
Model | Frames per video | Kinetics | ActivityNet-Trim | UCF101 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Mem. Frame Capacity | Acc | BWF | Mem. Frame Capacity | Acc | BWF | Mem. Frame Capacity | Acc | BWF | ||
iCaRL | 4 | 3.2 × 104 | 30.73% | 40.36% | 1.6 × 104 | 21.63% | 36.98% | 8.08 × 103 | 80.32% | 17.13% |
iCaRL | 8 | 6.4 × 104 | 32.04% | 38.48% | 3.2 × 104 | 21.54% | 33.41% | 16.16 × 103 | 81.12% | 18.25% |
iCaRL | 16 | 12.8 × 104 | 31.36% | 38.74% | 6.4 × 104 | 25.27% | 29.71% | 32.32 × 103 | 81.06% | 18.23% |
iCaRL | ALL | 2 × 106 | 32.04% | 38.74% | 15.5 × 106 | 48.53% | 19.72% | 3.69 × 105 | 80.97% | 18.11% |
iCaRL+TC | 4 | 3.2 × 104 | 35.32% | 34.07% | 1.6 × 104 | 42.99% | 23.82% | 8.08 × 103 | 73.85% | 26.35% |
iCaRL+TC | 8 | 6.4 × 104 | 36.24% | 33.83% | 3.2 × 104 | 45.73% | 18.90% | 16.16 × 103 | 74.25% | 25.27% |
iCaRL+TC | 16 | 12.8 × 104 | 36.54% | 33.53% | 6.4 × 104 | 44.04% | 22.82% | 32.32 × 103 | 75.84% | 23.23% |