HandMap: Robust hand pose estimation via intermediate dense guidance map supervision

In ECCV 2018. Other authors on the paper: Yongliang (Mac) Yang, Daniel (Dan) Finnegan, Eamonn O’Neill.

[Code] [Resources] [Paper] [Poster] [BibTex]

Problem statement and motivation

The goal is to accurately estimate hand pose, i.e. 3D location for each joint, given single depth image.

Why is the topic important?

Robust hand pose estimation is essential for emerging applications in human-computer interaction, such as virtual and mixed reality, computer games, and freehand user interfaces. The hand is also the most flexible and expressive part of human body, so the study of hand poses provides a great source of input to human behavior recognition task, which further benefits studies in computer vision in general.

Why is the problem difficult?

The difficulties come from several sources:

• Human skin is relatively uniform in color and surface property, which can only provide weak feature descriptors.
• Strong ambiguity due to self-similarity between fingers.
• The area of hands in a full-body size image is often very small, which means low signal-noise ratio.
• Severe self-occlusion, especially in interactive applications.

Framework

The main idea is to leverage the feature extraction effectiveness of the residual module through guidance map supervision, which further enhances the entire system’s learning strength by combining the residual link.

Pipeline

The pipeline of our algorithm starts from a single depth image. Our baseline method (shown in solid line) stacks R repetitions of a residual module on lower dimensional feature space, then directly regresses 3D coordinates of each joint as in a conventional CNN-based framework. In comparison, our proposed method (shown in dashed line) densely samples geometrically meaningful constraints from the input image, which provides coherent guidance to the feature representation of residual module.

“Stand on the shoulder of giants”

Our algorithm alone might not fully convince you. But please note that the core Dense GMS Module in our algorithm is “hot-pluggable”: we can easily plug it into other state-of-the-art (SOTA) methods, and achieve better performance due to added robustness. Please check the paper for details about evaluation metrics and performance enhancements.

In short: the best of our standalone algorithms is roughly comparable to the SOTA, but we achieved much better performance after combined our algorithms with the SOTA.

In the poster that I prepared for the ECCV2018 conference, you can see that:

Future work will explore temporal hand tracking using our framework …

Well, actually this has already been realized:

Please visit the hand tracking project if you are interested.

Appendix: Abstract

This work presents a novel hand pose estimation framework via intermediate dense guidance map supervision. By leveraging the advantage of predicting heat maps of hand joints in detection-based methods, we propose to use dense feature maps through intermediate supervision in a regression-based framework that is not limited to the resolution of the heat map. Our dense feature maps are delicately designed to encode the hand geometry and the spatial relation between local joint and global hand. The proposed framework significantly improves the state-of-the-art in both 2D and 3D on the recent benchmark datasets.

Please take a look at xinghaochen/awesome-hand-pose-estimation: it’s simply the best survey repo on the topic of Hand Pose Estimation.

Updated: