Object-Centric Spatio-Temporal Activity Detection and Recognition

Abstract

Our ActEV (Activities in Extended Video) experiments from TRECVID 2018 [5] utilized a feature pyramid network (FPN) combined with a deformable convolutional network (DCN) to perform very accurate and fine-grain object detection. This approach provides a strong baseline for our subsequent action detection and leverages IBMs pioneering work in multi-scale CNNs [1]. Object detection is followed by tracking and action proposals; the latter are performed separately for the three classes of actions: vehicle-turns, vehicle-person-interactions, and person-object-interactions. Proposals are generated analogously to a region proposal network in object detection, but on activity tubes cropped out from the original video. Our final action classification is based on an ensemble of temporal relational networks.

Publication
TRECVID
John Henning
John Henning
Deep Learning Engineer

My interests lie in solving problems at the intersection of research and engineering. I love doing research and then taking it to apply to a problem at hand.

Related