Object-Centric Spatio-Temporal Activity Detection and Recognition

Mandis Beigi, Lisa M Brown, Quanfu Fan, John Henning, Chung-Ching Lin, Honghui Shi, Chiao-fe Shu, Rogério Schmidt Feris

January 2018

Abstract

Our ActEV (Activities in Extended Video) experiments from TRECVID 2018 [5] utilized a feature pyramid network (FPN) combined with a deformable convolutional network (DCN) to perform very accurate and fine-grain object detection. This approach provides a strong baseline for our subsequent action detection and leverages IBMs pioneering work in multi-scale CNNs [1]. Object detection is followed by tracking and action proposals; the latter are performed separately for the three classes of actions: vehicle-turns, vehicle-person-interactions, and person-object-interactions. Proposals are generated analogously to a region proposal network in object detection, but on activity tubes cropped out from the original video. Our final action classification is based on an ensemble of temporal relational networks.

Type

Conference paper

Publication

TRECVID

Object-Centric Spatio-Temporal Activity Detection and Recognition

Abstract

John Henning

Deep Learning Engineer

Related