vCLIMB

Challenge of Remembering from fewer frames

The video instances contain a temporal dimension size that could show large variability, and some action classes do not require considering all frames to understand them. To favor fair comparisons between methods and datasets and reduce the huge memory consumption, we define the working memory size of the rehearsal methods in terms of stored frames. This creates a new unique scenario of Class Incremental Learning in video data, in which rehearsal methods must decide first what subset of frames should be selected and then decide what video to store according to selected frames per video. We encourage future works on Class Incremental Learning for Video Understanding to focus on this and beat our Temporal Consistency Regulation strategy.

Model	Frames per video	Kinetics			ActivityNet-Trim			UCF101
Model	Frames per video	Mem. Frame Capacity	Acc	BWF	Mem. Frame Capacity	Acc	BWF	Mem. Frame Capacity	Acc	BWF
iCaRL	4	3.2 × 10⁴	30.73%	40.36%	1.6 × 10⁴	21.63%	36.98%	8.08 × 10³	80.32%	17.13%
iCaRL	8	6.4 × 10⁴	32.04%	38.48%	3.2 × 10⁴	21.54%	33.41%	16.16 × 10³	81.12%	18.25%
iCaRL	16	12.8 × 10⁴	31.36%	38.74%	6.4 × 10⁴	25.27%	29.71%	32.32 × 10³	81.06%	18.23%
iCaRL	ALL	2 × 10⁶	32.04%	38.74%	15.5 × 10⁶	48.53%	19.72%	3.69 × 10⁵	80.97%	18.11%
iCaRL+TC	4	3.2 × 10⁴	35.32%	34.07%	1.6 × 10⁴	42.99%	23.82%	8.08 × 10³	73.85%	26.35%
iCaRL+TC	8	6.4 × 10⁴	36.24%	33.83%	3.2 × 10⁴	45.73%	18.90%	16.16 × 10³	74.25%	25.27%
iCaRL+TC	16	12.8 × 10⁴	36.54%	33.53%	6.4 × 10⁴	44.04%	22.82%	32.32 × 10³	75.84%	23.23%