Our ActEV (Activities in Extended Video) experiments from TRECVID 2018 [5] utilized a feature pyramid network (FPN) combined with a deformable convolutional network (DCN) to perform very accurate and fine-grain object detection. This approach provides a strong baseline for our subsequent action detection and leverages IBMs pioneering work in multi-scale CNNs [1]. Object detection is followed by tracking and action proposals; the latter are performed separately for the three classes of actions: vehicle-turns, vehicle-person-interactions, and person-object-interactions. Proposals are generated analogously to a region proposal network in object detection, but on activity tubes cropped out from the original video. Our final action classification is based on an ensemble of temporal relational networks.