Journal of Supercomputing, vol.82, no.6, 2026 (SCI-Expanded, Scopus)
This paper presents a novel composite architecture for high-precision sign language gesture recognition referred to as ViTSLA. The model integrates convolutional inductive bias with transformer-based global attention to improve predictive accuracy and computational efficiency. Extensive experiments are conducted on three datasets using five-fold cross-validation, ablation analysis, and comparative evaluation against standard vision transformer (ViT) variants. Experimental results demonstrate consistently high classification accuracy exceeding 99.5% with minimal variance across folds, confirming robustness and balanced class-wise performance. Beyond predictive performance, this work addresses the computational demands of training and deploying hybrid deep architectures. Transformer-based attention mechanisms and high-resolution tokenization increase memory consumption and computational complexity. Therefore, efficient training requires parallel processing, GPU acceleration, and high-performance computing (HPC) environments to handle large-scale tensor operations and multi-fold cross-validation experiments. Furthermore, real-time inference capability is maintained at over 300 FPS, which is essential for deployment in latency-sensitive applications such as interactive systems and intelligent human–computer interfaces. Distributed computation further enables scalable training and stable convergence across datasets. Moreover, the low mean latency, stable P95 behavior, and high throughput confirm real-time deployment capability. These results demonstrate that the proposed architecture achieves high predictive performance while maintaining computational characteristics aligned with supercomputing-driven intelligent vision systems. The developed model achieves a favorable balance between accuracy, parameter efficiency, and computational cost. The results demonstrate that hybrid convolutional-attention architectures benefit from HPC-enabled parallelism, leading to scalable optimization and deployment-ready performance. These findings establish the relevance of the proposed approach within the domain of supercomputing-driven intelligent vision systems.