ThermoHands

A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images

ACM Sensys 2025

* Denotes equal contribution

1 2 3 4


Multi-spectral hand pose dataset. Data capture setup with the customized head-mounted sensor platform (HMSP) and exocentric platform recording multi-view multi-spectral images of two-hand actions performed by participants.

Abstract

Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.

Data Capture

This video shows an example of multi-spectral image data captured by our egocentric ane exocentric platform.

3D Hand Pose Annotation

This video shows an example of 3D hand pose annotations. We show the left (blue) and right (red) hand 3D joints projected onto RGB images. From the same viewpoint, we also visualize the corresponding hand mesh annotation.

These figures are selected from different subjects and meticulously cropped to highlight the hands on images. As can be seen, our dataset provides high-fidelity and accurate hand pose annotations for various actions.

Qualitative Results (main)

Qualitative results for different spectra under the well-illuminated office (main) setting. 3D hand joints are projected onto 2D images for visualization. Ground truth hand pose is shown in green while the prediction results in blue.

Hand-Object Interaction

Hand-Virtual Interaction

These figures are selected from different subjects performing various actions. As can be seen, each spectrum can provide reliable testing results, close to the ground truth annotations, validating the capability of our dataset to support 3D hand pose estimation research based on various spectra.

Qualitative Results (auxiliary)

Qualitative results for thermal vs. RGB (NIR) under our four auxiliary settings, including the glove, darkness, sun glare and kitchen scenairos. We show the left (blue) and right (red) hand 3D joints projected onto 2D images.

Glove

Hand-Object Interaction

Hand-Virtual Interaction

As the hand appearance, including colour and textures, is greatly altered by the handwear like gloves in the RGB images, the RGB image-based solution fails to accurately estimate the 3D hand pose in our examples. In contrast, thermal cameras are able to correctly detect hands even under handwear, such as gloves by identifying heat transmission patterns.

Darkness

Hand-Object Interaction

Hand-Virtual Interaction

From the RGB images captured in the darkness, we can hardly recognize the hand contour even by human eyes. As a result, the estimated hand joints either exhibit irregular articulation or deviate significantly from their actual locations. Independent of the visible light, thermal cameras are unaffected by the variation of lighting conditions, successfully estimating 3D hand pose in the darkness.

Sun Glare

It can be seen that NIR images are prone to interference from the sunlight, which consists of the NIR lighting component. In comparison, thermal images are less affected and thus yield a better performance under the strong sun glare.

Kitchen

The thermal camera shows better performance than the RGB camera when generalized to an unseen environment. This can be credited to the unique attribute of thermal cameras that accentuates the hand’s structure via temperature differentials, alleviating the effects of background variability. We believe that thermal image-based solutions could also generalize well to other environments, such as the dining hall and bathroom.