3D Human Pose Estimation ‒ CVLAB ‐ EPFL

Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation

Most recent approaches to monocular 3D human pose estimation rely on Deep Learning. They typically involve regressing from an image to either 3D joint coordinates directly or 2D joint locations from which 3D coordinates are inferred. Both approaches have their strengths and weaknesses and we therefore propose a novel architecture designed to deliver the best of both worlds by performing both simultaneously and fusing the information along the way. At the heart of our framework is a trainable fusion scheme that learns how to fuse the information optimally instead of being hand-designed. This yields significant improvements upon the state-of-the-art on standard 3D human pose estimation benchmarks.

Results

Our approach is able to disambiguate challenging poses with mirroring and self-occlusion and achieves state-of-the-art performance by fusing 2D and 3D image cues. We provide several example videos on Human3.6m below. The first skeleton depicts our prediction and the second the ground-truth. Best viewed in full-screen mode.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

We further provide predictions from HumanEva-I sequences below.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

We also demonstrate the performance of our approach on KTH Multiview Football II below.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

Our code is available under the terms of the MIT license in the following link: [code].

Direct Prediction of 3D Body Poses from Motion Compensated Sequences

We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Previous approaches typically compute candidate poses in individual frames and then link them in a post-processing step to resolve ambiguities. By contrast, we directly regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame.

We further show that, for this approach to achieve its full potential, it is essential to compensate for the motion in consecutive frames so that the subject remains centered. This then allows us to effectively overcome ambiguities and improve upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks.

Results

We can disambiguate challenging poses with mirroring and self-occlusion and achieve state-of-the-art performance by combining appearance and motion cues from motion compensated, rectified spatiotemporal volumes (RSTVs). We provide several example videos below. Note that our results are obtained without temporal smoothing or rigid alignment of the pose predictions.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

We obtain RSTVs using our CNN-based motion compensation algorithm. The video below depicts several motion compensation examples on our datasets.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

We provide examples of 3D human pose estimation with kernel ridge regression (KRR), kernel dependency estimation (KDE) and deep network (DN) regressors, applied on rectified spatiotemporal volumes (RSTVs). RSTV+DN yields more accurate 3D pose estimates.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

We provide further visualization for the HumanEva dataset below.

Warning

Embed of video is only possible from Mediaspace, SwitchTube, Vimeo or Youtube

The 3D body pose is recovered from the left camera view, and reprojected on the others. Our method can reliably recover the 3D pose and reprojects well on other camera views which were not used to compute the pose.

References

Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation

B. Tekin; P. Marquez Neila; M. Salzmann; P. Fua

2017. International Conference on Computer Vision (ICCV). p. 3961-3970. DOI : 10.1109/ICCV.2017.425.

Detailed record

Full text – View at publisher

Structured Prediction of 3D Human Pose with Deep Neural Networks

B. Tekin; I. Katircioglu; M. Salzmann; V. Lepetit; P. Fua

2016. British Machine Vision Conference (BMVC), York, UK, September 19-22, 2016. p. 130.1-130.11. DOI : 10.5244/C.30.130.

Detailed record

Full text – View at publisher

Direct Prediction of 3D Body Poses from Motion Compensated Sequences

B. Tekin; A. Rozantsev; V. Lepetit; P. Fua

2016. Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, June 26-July 1, 2016. p. 991-1000. DOI : 10.1109/CVPR.2016.113.

Detailed record

Full text – View at publisher

Contact

Bugra Tekin	[email]
Mathieu Salzmann	[email]
Pascal Fua	[email]