Dartmouth logo Dartmouth College Computer Science
Technical Report series
CS home
TR home
TR search TR listserv
By author: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
By number: 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986

Representations and Models for Large-Scale Video Understanding
Du L. Tran
Dartmouth TR2017-832

Abstract: In this thesis, we investigate different representations and models for large-scale video understanding. These methods include a mid-level representation for action recognition, a deep-learned representation for video analysis, a generic convolutional network architecture for video voxel prediction, and a new high-level task and benchmark of video comprehension. First, we present EXMOVES, a mid-level representation for scalable action recognition. The entries in EXMOVES representation are the calibrated outputs of a set of movement classifiers over spatial-temporal volumes of the input video. Each movement classifier is a simple exemplar-SVM trained on low-level features. Our EXMOVES requires a minimal amount of supervision while also obtaining good action recognition accuracy. It is approximately 70 times faster than other mid-level video representations. Second, we propose an effective method for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large-scale video dataset. We show that 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets. Our learned features, C3D, with a simple linear classifier outperform state-of-the-art methods on four different benchmarks and are comparable with current best methods on the other two benchmarks. The features are also very compact, efficient to compute, and easy to use. Third, we develop a generic 3D ConvNet architecture for video voxel prediction. Our preliminary results show that our architecture can be applied for different voxel prediction problems with good results. Finally, we propose a new task, namely Video Comprehension, construct a large-scale benchmark, and develop a set of fundamental baselines as well as conduct a human study on the newly-proposed benchmark.

Note: Ph.D Dissertation. Advisor: Lorenzo Torresani.


PDF PDF (164960KB)

Bibliographic citation for this report: [plain text] [BIB] [BibTeX] [Refer]

Or copy and paste:
   Du L. Tran, "Representations and Models for Large-Scale Video Understanding." Dartmouth Computer Science Technical Report TR2017-832, August 2016.


Notify me about new tech reports.

Search the technical reports.

To receive paper copy of a report, by mail, send your address and the TR number to reports AT cs.dartmouth.edu


Copyright notice: The documents contained in this server are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Technical reports collection maintained by David Kotz.