Dartmouth College Computer Science
Technical Report series
TR search TR listserv
|By author:||A B C D E F G H I J K L M N O P Q R S T U V W X Y Z|
|By number:||2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986|
In this thesis, we investigate different representations and models
for large-scale video understanding. These methods include a mid-level
representation for action recognition, a deep-learned representation
for video analysis, a generic convolutional network architecture for
video voxel prediction, and a new high-level task and benchmark of
First, we present EXMOVES, a mid-level representation for scalable
action recognition. The entries in EXMOVES representation are the
calibrated outputs of a set of movement classifiers over
spatial-temporal volumes of the input video. Each movement classifier
is a simple exemplar-SVM trained on low-level features. Our EXMOVES
requires a minimal amount of supervision while also obtaining good
action recognition accuracy. It is approximately 70 times faster than
other mid-level video representations. Second, we propose an effective
method for spatiotemporal feature learning using deep 3-dimensional
convolutional networks (3D ConvNets) trained on a large-scale video
dataset. We show that 3D ConvNets are more suitable for spatiotemporal
feature learning compared to 2D ConvNets. Our learned features, C3D,
with a simple linear classifier outperform state-of-the-art methods on
four different benchmarks and are comparable with current best methods
on the other two benchmarks. The features are also very compact,
efficient to compute, and easy to use. Third, we develop a generic 3D
ConvNet architecture for video voxel prediction. Our preliminary
results show that our architecture can be applied for different voxel
prediction problems with good results. Finally, we propose a new task,
namely Video Comprehension, construct a large-scale benchmark, and
develop a set of fundamental baselines as well as conduct a human
study on the newly-proposed benchmark.
Ph.D Dissertation. Advisor: Lorenzo Torresani.
Bibliographic citation for this report: [plain text] [BIB] [BibTeX] [Refer]
Or copy and paste:
Du L. Tran, "Representations and Models for Large-Scale Video Understanding." Dartmouth Computer Science Technical Report TR2017-832, August 2016.
Notify me about new tech reports.
Search the technical reports.
To receive paper copy of a report, by mail, send your address and the TR number to reports AT cs.dartmouth.edu
Copyright notice: The documents contained in this server are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Technical reports collection maintained by David Kotz.