UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor

Chadha, A; Abbas, A; Andreopoulos, Y; (2017) Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor. IEEE Transactions on Circuits and Systems for Video Technology 10.1109/TCSVT.2017.2786999. (In press). Green open access

[img]
Preview
Text
Chadha_TCSVT_mvcnn.pdf - Accepted version

Download (2MB) | Preview

Abstract

We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.

Type: Article
Title: Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/TCSVT.2017.2786999
Publisher version: https://doi.org/10.1109/TCSVT.2017.2786999
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: video coding, classification, deep learning, Computer architecture, Decoding, Three-dimensional displays, Two dimensional displays, Training, Complexity theory
UCL classification: UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Electronic and Electrical Eng
URI: https://discovery.ucl.ac.uk/id/eprint/10043769
Downloads since deposit
80Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item