Comparison between our Gestura and the currently state-of-the-art LVLM models on our Test sets of GestureInt dataset. Metrics include BLEU-1∼4 and ACC (where ACC reports both closed-set / open-set results).
Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution—GestureGPT—suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs’ lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model’s ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs. Experimental results show that Gestura achieves the accuracy of 84.73% (closed-set) / 64.14% (open-set) in the exocentric (third-person) setting and 66.14% (closed-set) / 21.71% (open-set) in the egocentric (first-person) setting, achieving approximately 20% and 40% higher accuracy on closed-set and open-set tasks, respectively, compared to GestureGPT. Moreover, Gestura achieves over a 100× speedup in response time (1.6 seconds vs. 227 seconds) on an 8B-sized model deployed on a single NVIDIA A100 40GB GPU, and has been validated through real-device experiments with an edge–cloud collaborative setup, bringing free-form gesture understanding markedly closer to practical, real-world deployment.
Overview of the proposed framework of Gestura. Gestura introduces a hierarchical framework with two-phase training for free-form gesture understanding. First, the pre-training stage activates the model’s potential for free-form generalization by using a multi-view semantic enhancement strategy. In Stage 2, Gestura leverages well-trained landmark processing module to deliver anatomical and spatial contextual signals to the LVLM backbone, while internalizing advanced reasoning capabilities through Chain-of-Thought (CoT) tuning. This dual mechanism expands the model’s reasoning boundaries, enabling superior generalization in free-form gesture understanding.
Top-1 / Top-3 / Top-5 accuracy per intent category in an open-world experiment.