Kosta Derpanis, Ph.D.
Associate Professor, Ryerson University
Department of Computer Science
kosta _at_ scs.ryerson.ca


I am currently on leave at the Samsung AI Centre in Toronto serving as a Research Scientist.

Check out our Ryerson Vision Lab (RVL) page co-directed by Dr. Neil Bruce and myself.

Interested in visual computing (e.g,. computer vision, graphics and virtual reality)? You may want to enrol in the following elective undergraduate courses. For further guidance, please feel free to seek out Prof. Neil Bruce, Prof. McInerney or myself.

Feb. 1, 2019 One paper accepted to the International Conference on Learning Representations (ICLR) 2019.

Mar. 2, 2018 Journal paper accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2018.

Feb. 28, 2018 One paper accepted to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018.

Jul. 16, 2017 One paper accepted to the IEEE International Conference on Computer Vision (ICCV) 2017.

Mar. 1, 2017 Two papers accepted to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017.

Jan. 15, 2017 Papers accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2017.

Jan. 15, 2015 As part of NVIDIA's Academic Hardware Donation Program I received a NVIDIA Titan X (Pascal) for research in deep learning. Thank you NVIDIA!

Oct. 13, 2016 Recognized by European Conference on Computer Vision (ECCV) 2016 as an "outstanding reviewer".

Aug. 9, 2016 Adam Harley successfully completed his M.Sc. defense. His thesis was nominated for Ryerson's Governor General Gold Medal. Adam is now pursuing a Ph.D. at CMU's Robotics Institute. Congrats Adam!!!

Mar. 6, 2016 Paper accepted to IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016.

Dec. 17, 2015 Happy to receive the Faculty of Science Dean's Teach Award. video

Nov. 23, 2015 Promoted to Associate Professor.

Jun. 27, 2014 Recognized by IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014 as an "outstanding reviewer".

Mar. 1, 2014 I'm a distal fellow of the NSERC Canadian Field Robotics Network (NCFRN).

Jan. 2013 I'm an Adjunct Professor at York University.


Matthew Tesfaldet (Ph.D. with Prof. Marcus Brubaker)

Jason Yu (M.Sc. cosupervised with Prof. Marcus Brubaker)

Matthew Kowal (M.Sc. cosupervised with Prof. Neil Bruce)


Adam Harley (M.Sc., now PhD candidate at CMU's Robotics Institute)

Hasan Almawi (M.Sc.)

Domenic Curro (M.Sc.)

Andrei Betlen (Undergraduate thesis)

Christopher Kong (M.Sc. co-supervisor)


Learning What You Can do Before Doing Anything
Intelligent agents can learn to represent the action spaces of other agents simply by observing them act. Such representations help agents quickly learn to predict the effects of their own actions on the environment and to plan complex action sequences. In this work, we address the problem of learning an agent’s action space purely from visual observation. We use stochastic video prediction to learn a latent variable that captures the scene’s dynamics while being minimally sensitive to the scene’s static content. We introduce a loss term that encourages the network to capture the composability of visual sequences and show that it leads to representations that disentangle the structure of actions. We call the full model with composable action representations Composable Learned Action Space Predictor (CLASP). We show the applicability of our method to synthetic settings and its potential to capture action spaces in complex, realistic visual settings. When used in a semi-supervised setting, our learned representations perform comparably to existing fully supervised methods on tasks such as action-conditioned video prediction and planning in the learned action space, while requiring orders of magnitude fewer action labels.
Oleh Rybkin, Karl Pertsch, Konstantinos G. Derpanis, Kostas Daniilidis and Andrew Jaegle
arXiv 2019 (accepted at ICLR 2019)
Two-Stream Convolutional Networks for Dynamic Texture Synthesis
We introduce a two-stream model for dynamic texture synthesis. Our model is based on pre-trained convolutional networks (ConvNets) that target two independent tasks: (i) object recognition, and (ii) optical flow prediction. Given an input dynamic texture, statistics of filter responses from the object recognition ConvNet encapsulates the per frame appearance of the input texture, while statistics of filter responses from the optical flow ConvNet models its dynamics. To generate a novel texture, a noise input sequence is optimized to simultaneously match the feature statistics from each stream of the example texture. Inspired by recent work on image style transfer and enabled by the two-stream model, we also apply the synthesis approach to combine the texture appearance from one texture with the dynamics of another to generate entirely novel dynamic textures. We show that our approach generates novel, high quality samples that match both the framewise appearance and temporal evolution of input imagery.
Matthew Tesfaldet, Marcus A. Brubaker and Konstantinos G. Derpanis
arXiv 2017 (accepted at CVPR 2018)
MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior
Recovering 3D full-body human pose is a challenging problem with many applications. It has been successfully addressed by motion capture systems with body worn markers and multiple cameras. In this paper, we address the more challenging case of not only using a single camera but also not leveraging markers: going directly from 2D appearance to 3D geometry. Deep learning approaches have shown remarkable abilities to discriminatively learn 2D appearance features. The missing piece is how to integrate 2D, 3D and temporal information to recover 3D geometry and account for the uncertainties arising from the discriminative model. We introduce a novel approach that treats 2D joint locations as latent variables, whose uncertainty distributions are given by a deep fully convolutional network. The unknown 3D poses are modeled by a sparse representation and the 3D parameter estimates are realized via an Expectation-Maximization algorithm, where it is shown that the 2D joint location uncertainties can be conveniently marginalized out during inference. Extensive evaluation on benchmark datasets shows that the proposed approach achieves greater accuracy over state-of-the-art baselines. Notably, the proposed approach does not require synchronized 2D-3D data for training and is applicable to "in-the-wild" images, which is demonstrated with the MPII dataset.
Xiaowei Zhou, Menglong Zhu, Georgios Pavlakos, Spyridon Leonardos, Kostantinos G. Derpanis, and Kostas Daniilidis
to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2018
Segmentation-Aware Convolutional Networks Using Local Attention Masks
We introduce an approach to integrate segmentation in-formation within a convolutional neural network (CNN). This counter-acts the tendency of CNNs to smooth information across regions and increases their spatial precision.To obtain segmentation information, we set up a CNN to provide an embedding space where region co-membership can be estimated based on Euclidean distance. We use these embeddings to compute a local attention mask relative to every neuron position. We incorporate such masks in CNNs and replace the convolution operation with a "segmentation-aware" variant that allows a neuron to selectively attend to inputs coming from its own region. We call the resulting network a segmentation-aware CNN because it adapts its filters at each image point according tolocal segmentation cues, while at the same time remaining fully-convolutional. We demonstrate the merit of our method on two widely different dense prediction tasks, that involve classification (semantic segmentation) and regression (optical flow). Our results show that in semantic segmentation we can replace DenseCRF inference with a cascade of segmentation-aware filters, and in optical flow we obtain clearly sharper responses than the ones obtained with comparable networks that do not use segmentation. In both cases segmentation-aware convolution yields systematic improvements over strong baselines.
Adam W. Harley, Konstantinos G. Derpanis and Iasonas Kokkinos
arXiv 2017 (accepted at ICCV 2017)
Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose
This paper addresses the challenge of 3D human pose estimation from a single color image. Despite the general success of the end-to-end learning paradigm, top performing approaches employ a two-step solution consisting of a Convolutional Network (ConvNet) for 2D joint localization only and recover 3D pose by a subsequent optimization step. In this paper, we identify the representation of 3D pose as a critical issue with current ConvNet approaches and make two important contributions towards validating the value of end-to-end learning for this task. First, we propose a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint. This creates a natural representation for 3D pose and greatly improves performance over the direct regression of joint coordinates. Second, to further improve upon initial estimates, we employ a coarse-to-fine prediction scheme. This step addresses the large dimensionality increase and enables iterative refinement and repeated processing of the image features. The proposed approach allows us to train a ConvNet that outperforms all state-of-the-art approaches on standard benchmarks achieving relative error reduction greater than 35% on average. Additionally, we investigate using our volumetric representation in a related architecture which is suboptimal compared to our end-to-end approach, but is of practical interest, since it enables training when no image with corresponding 3D groundtruth is available, and allows us to present compelling results for in-the-wild images.
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis and Kostas Daniilidis
arXiv 2016 (accepted at CVPR 2017)
Harvesting Multiple Views for Marker-less 3D Human Pose Annotations
Recent advances with Convolutional Networks (ConvNets) have shifted the bottleneck for many computer vision tasks to annotated data collection. In this paper, we present a geometry-driven approach to automatically collect annotations for human pose prediction tasks. Starting from a generic ConvNet for 2D human pose, and assuming a multi-view setup, we describe an automatic way to collect accurate 3D human pose annotations. We capitalize on constraints offered by the 3D geometry of the camera setup and the 3D structure of the human body to probabilistically combine per view 2D ConvNet predictions into a globally optimal 3D pose. This 3D pose is used as the basis for harvesting annotations. The benefit of the annotations produced automatically with our approach is demonstrated in two challenging settings: (i) fine-tuning a generic ConvNet-based 2D pose predictor to capture the discriminative aspects of a subject's appearance (i.e.,"personalization"), and (ii) training a ConvNet from scratch for single view 3D human pose prediction without leveraging 3D pose groundtruth. The proposed multi-view pose estimator achieves state-of-the-art results on standard benchmarks, demonstrating the effectiveness of our method in exploiting the available multi-view information.
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis and Kostas Daniilidis
arXiv 2016 (accepted at CVPR 2017)
6-DoF Object Pose from Semantic Keypoints
This paper presents a novel approach to estimating the continuous six degree of freedom (6-DoF) pose (3D translation and rotation) of an object from a single RGB image. The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model. Unlike prior work, we are agnostic to whether the object is textured or textureless, as the convnet learns the optimal representation from the available training image data. Furthermore, the approach can be applied to instance- and class-based pose recovery. Empirically, we show that the proposed approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios with a cluttered background. For class-based object pose estimation, state-of-the-art accuracy is shown on the large-scale PASCAL3D+ dataset.
Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis and Kostas Daniilidis
arXiv 2016 (accepted at ICRA 2017)
Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness
Recently, convolutional networks (convnets) have proven useful for predicting optical flow. Much of this success is predicated on the availability of large datasets that require expensive and involved data acquisition and laborious labeling. To bypass these challenges, we propose an unsupervised approach (i.e., without leveraging groundtruth flow) to train a convnet end-to-end for predicting optical flow between two images. We use a loss function that combines a data term that measures photometric constancy over time with a spatial term that models the expected variation of flow across the image. Together these losses form a proxy measure for losses based on the groundtruth flow. Empirically, we show that a strong convnet baseline trained with the proposed unsupervised approach outperforms the same network trained with supervision on the KITTI dataset.
Jason Yu, Adam Harley and Konstantinos Derpanis
arXiv 2016 (accepted at ECCV Workshops 2016)
Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video
This paper addresses the challenge of 3D full-body human pose estimation from a monocular image sequence. Here, two cases are considered: (i) the image locations of the human joints are provided and (ii) the image locations of joints are unknown. In the former case, a novel approach is introduced that integrates a sparsity-driven 3D geometric prior and temporal smoothness. In the latter case, the former case is extended by treating the image locations of the joints as latent variables. A deep fully convolutional network is trained to predict the uncertainty maps of the 2D joint locations. The 3D pose estimates are realized via an Expectation-Maximization algorithm over the entire sequence, where it is shown that the 2D joint location uncertainties can be conveniently marginalized out during inference. Empirical evaluation on the Human3.6M dataset shows that the proposed approaches achieve greater 3D pose estimation accuracy over state-of-the-art baselines. Further, the proposed approach outperforms a publicly available 2D pose estimation baseline on the challenging PennAction dataset.
Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos Derpanis and Kostas Daniilidis
arXiv (accepted at CVPR 2016)
Learning Dense Convolutional Embeddings for Semantic Segmentation
This paper proposes a new deep convolutional neural network (DCNN) architecture that learns pixel embeddings, such that for any two pixels on the same object, the embeddings are nearly identical. Inversely, the DCNN is trained to produce dissimilar representations for pixels coming from differing objects. Experimental results show that when this embedding network is used to augment a DCNN trained on semantic segmentation, there is a systematic improvement in per-pixel classification accuracy. This strategy is complementary to many others pursued in semantic segmentation, and it is implemented efficiently in a popular deep learning framework, making its integration with existing systems very straightforward.
Adam Harley, Konstantinos Derpanis and Iasonas Kokkinos
arXiv (Workshop paper at ICLR 2016)
Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval
This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analysis, and confirms that this representation strategy is superior to a variety of popular handcrafted alternatives. Extensive experiments show that (i) features extracted from CNNs are robust to compression, (ii) CNNs trained on non-document images transfer well to document analysis tasks, and (iii) enforcing region-specific feature-learning is unnecessary given sufficient training data. This work also makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories.
Adam Harley, Alex Ufkes and Konstantinos Derpanis
ICDAR 2015 (Best Student Paper Award)
Single Image 3D Object Detection and Pose Estimation for Grasping
We present a novel approach for detecting objects and estimating their 3D pose in single images of cluttered scenes. Objects are given in terms of 3D models without accompanying texture cues. A deformable parts-based model is trained on clusters of silhouettes of similar poses and produces hypotheses about possible object locations at test time. Objects are simultaneously segmented and verified inside each hypothesis bounding region by selecting the set of superpixels whose collective shape matches the model silhouette. A final iteration on the 6-DOF object pose minimizes the distance between the selected image contours and the actual projection of the 3D model. We demonstrate successful grasps using our detection and pose estimate with a PR2 robot. Extensive evaluation with a novel ground truth dataset shows the considerable benefit of using shape-driven cues for detecting objects in heavily cluttered scenes.
Menglong Zhu, Konstantinos Derpanis, Yinfei Yang, Samarth Brahmbhatt, Mabel Zhang, Cody Phillips and Kostas Daniilidis
ICRA 2014
From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding
This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to dis- cover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localization as additional output. This output sheds further light into detailed action understanding.
Weiyu Zhang, Menglong Zhu and Konstantinos Derpanis
ICCV 2013
Action Spotting and Recognition Based on a Spatiotemporal Orientation Analysis
This paper provides a unified framework for the interrelated topics of action spotting, the spatiotemporal detection and localization of human actions in video, and action recognition, the classification of a given video into one of several predefined categories. A novel compact local descriptor of video dynamics in the context of action spotting and recognition is introduced based on visual spacetime oriented energy measurements. This descriptor is efficiently computed directly from raw image intensity data and thereby forgoes the problems typically associated with flow-based features. Importantly, the descriptor allows for the comparison of the underlying dynamics of two spacetime video segments irrespective of spatial appearance, such as differences induced by clothing, and with robustness to clutter. An associated similarity measure is introduced that admits efficient exhaustive search for an action template, derived from a single exemplar video, across candidate video sequences. The general approach presented for action spotting and recognition is amenable to efficient implementation, which is deemed critical for many important applications. For action spotting, details of a real-time GPU-based instantiation of the proposed approach are provided. Empirical evaluation of both action spotting and action recognition on challenging datasets suggests the efficacy of the proposed approach, with state-of-the-art performance documented on standard datasets.
Konstantinos Derpanis, Mikhail Sizintsev, Kevin J. Cannons, and Richard Wildes
PAMI 2013
Dynamic Scene Understanding: The Role of Orientation Features in Space and Time in Scene Classification
Natural scene classification is a fundamental challenge in computer vision. By far, the majority of studies have limited their scope to scenes from single image stills and thereby ignore potentially informative temporal cues. The current paper is concerned with determining the degree of performance gain in considering short videos for recognizing natural scenes. Towards this end, the impact of multiscale orientation measurements on scene classification is systematically investigated, as related to: (i) spatial appearance, (ii) temporal dynamics and (iii) joint spatial appearance and dynamics. These measurements in visual space, x-y, and spacetime, x-y-t, are recovered by a bank of spatiotemporal oriented energy filters. In addition, a new data set is introduced that contains 420 image sequences spanning fourteen scene categories, with temporal scene information due to objects and surfaces decoupled from camera-induced ones. This data set is used to evaluate classification performance of the various orientation-related representations, as well as state-of-the-art alternatives. It is shown that a notable performance increase is realized by spatiotemporal approaches in comparison to purely spatial or purely temporal methods.
Konstantinos Derpanis, Matthieu Lecce, Kostas Daniilidis and Richard Wildes
CVPR 2012
Spacetime Texture Representation and Recognition Based on a Spatiotemporal Orientation Analysis
This paper is concerned with the representation and recognition of the observed dynamics (i.e., excluding purely spatial appearance cues) of spacetime texture based on a spatiotemporal orientation analysis. The term “spacetime texture” is taken to refer to patterns in visual spacetime, x-y-t, that primarily are characterized by the aggregate dynamic properties of elements or local measurements accumulated over a region of spatiotemporal support, rather than in terms of the dynamics of individual constituents. Examples include image sequences of natural processes that exhibit stochastic dynamics (e.g., fire, water, and windblown vegetation) as well as images of simpler dynamics when analyzed in terms of aggregate region properties (e.g., uniform motion of elements in imagery, such as pedestrians and vehicular traffic). Spacetime texture representation and recognition is important as it provides an early means of capturing the structure of an ensuing image stream in a meaningful fashion. Toward such ends, a novel approach to spacetime texture representation and an associated recognition method are described based on distributions (histograms) of spacetime orientation structure. Empirical evaluation on both standard and original image data sets shows the promise of the approach, including significant improvement over alternative state-of-the-art approaches in recognizing the same pattern from different viewpoints.
Konstantinos Derpanis and Richard Wildes
PAMI 2012
The Structure of Multiplicative Motions in Natural Imagery
A theoretical investigation of the frequency structure of multiplicative image motion signals is presented, e.g., as associated with translucency phenomena. Previous work has claimed that the multiplicative composition of visual signals generally results in the annihilation of oriented structure in the spectral domain. As a result, research has focused on multiplicative signals in highly specialized scenarios where highly structured spectral signatures are prevalent, or introduced a nonlinearity to transform the multiplicative image signal to an additive one. In contrast, in this paper, it is shown that oriented structure is present in multiplicative cases when natural domain constraints are taken into account. This analysis suggests that the various instances of naturally occurring multiple motion structures can be treated in a unified manner. As an example application of the developed theory, a multiple motion estimator previously proposed for translation, additive transparency, and occlusion is adapted to multiplicative image motions. This estimator is shown to yield superior performance over the alternative practice of introducing a nonlinear preprocessing step.
Konstantinos Derpanis and Richard Wildes
PAMI 2009
On the Role of Representation in the Analysis of Visual Spacetime
The problems under consideration in this dissertation centre around the rep- resentation of visual spacetime, i.e., (visual) image intensity (irradiance) as a function of two-dimensional spatial position and time. In particular, the over- arching goal is to establish a unified approach to representation and analysis of temporal image dynamics that is broadly applicable to the diverse phenomena in the natural world as captured in two-dimensional intensity images. Previous research largely has approached the analysis of visual dynamics by appealing to representations based on image motion. Although of obvious importance, motion represents a particular instance of the myriad spatiotemporal patterns observed in image data. A generative model centred on the concept of spacetime orientation is proposed. This model provides a unified framework for understanding a broad set of important spacetime patterns. As a consequence of this analysis, two new classes of patterns are distinguished that have previously not been considered directly in terms of their constituent spacetime oriented structure, namely multiplicative motions (e.g., translucency) and stochastic-related phenomena (e.g., wavy water and windblown vegetation). Motivated by this analysis, a represen- tation is proposed that systematically exposes the structure of visual spacetime in terms of local, spacetime orientation. The power of this representation is demonstrated in the context of the following four fundamental challenges in computer vision: (i) spacetime texture recognition, (ii) spacetime grouping, (iii) local boundary detection and (iv) the detection and spatiotemporal localization of an action in a video stream.
Konstantinos Derpanis (Supervisor: Richard Wildes)
Dissertation, York University, 2010
CIPPRS 2010 Doctoral Dissertation Award, Honorable Mention


"Computer Vision Goes Back to the Future" presented at the Western/SHARCNET Workshop on Deep Learning (PDF, M4V)

"From 3D Models to Images" presented at the NCFRN Annual General Meeting (PDF, clickable MOV)