Multi-Channel Visual-Language Pretraining for Sign Language Translation

Sign language is the main communication medium for the deaf community, but the majority of our population knows little about sign language and how to communicate with them. To facilitate communication, sign language translation (SLT) aims to recognize and translate sign languages into coherent spoken language sentences. Sign language involves simultaneous multi-channel expressions such as gestures, facial expressions, and body movements. Current SLT methods often use "gloss," a specialized annotation that matches signs to natural languages. However, creating gloss annotations is labor-intensive and requires expertise, which limited learning from extensive sign language video datasets. Recent approaches showed promising results that attempted for gloss-free SLT using visual-language pretraining or multi-stream modeling of skeleton data for different body parts. In this work, we propose to develop a novel SLT framework with explicit multi-stream sign language modeling and large-scale gloss-free pretraining to guide our model learning the intricate interplay between different body parts in a wide range of conversation settings.

TEAM

Hangyu Zhou*, Calvin Qiu*, Nusrat N. Nizam*

ADVISOR

Dr. Bharath Hariharan

DURATION

2023.09 - Present

PROJECT TYPE

Research, Cornell University

TARGET CONFERENCE

ECCV '24

Proposed Method

Our proposed approach, shown in Figure 1, perceives sign language videos as a corpus of streams. The SLT model will be achieved through the extraction of co-occurrent signals such as upper body movements, lip patterns, and facial expressions from the skeleton data, combined with a large-scale visual-language pretraining using the vast reservoir of sign language videos with natural language transcripts.

Feature Encoding. We will encode each visual stream with a different encoder. This includes pose estimation discerning body keypoints to interpret gestures, coupled with facial recognition to derive insights from head rotations, mouth and eye movement dynamics. MMPose, Mediapipe, OpenHands, OpenPose, Detectron, etc are some of the pre-trained models available to generate skeleton and body keypoints. The generated streams will be encoded in the visual encoders and used in further processing.

Visual-Language Pre-training. We will train our model with multiple pre-training tasks to learn the intricate correlation between different input streams and the temporal correlation within each stream. The subsequent joint pre-training will leverage fusion layers to cohesively synthesize these streams, utilizing multiple visual encoders paired with a text encoder, utilizing weight-sharing in Transformers, to bridge the semantic gap between visual modalities and natural language. After that, specialized final tasks will be designed to guide the model to draw connections between natural language and the intra/inter-stream correlations. An example will be tasking on emotion recognition with facial expression and upper body movements.

We are introducing a pre-training approach, named MCVLP (Multi-Channel Visual Language Pre-training). This method combines the masked self-supervised learning paradigm with VLP model such as CLIP. Our design involves a pretext task that aligns visual and textual representations within a shared multimodal semantic space. This guides the Visual Encoder to learn visual representations indicated by language. Simultaneously, we will incorporate masked self-supervised learning into the pre-training to assist the Text Decoder in gaining the syntactic and semantic aspects of sign language sentences. Subsequently, we will transfer the parameters of the pre-trained VLP to the encoder-decoder based translation model that will enhance its translation capabilities.

Sign Language Translation Network (SLTN). We introduce our SLTN network tailored for downstream tasks, capable of producing sentences from given sign videos without relying on gloss annotations. In our approach, we employ the Transformer as the core framework. The sign video and extracted keypoints are fed through the Visual Encoder, pre-trained in the earlier VLP stage. The Text Decoder, also pre-trained in the VLP stage, utilizes both the corresponding sentence and the last encoder hidden state as input to generate the final translated text as output.

Figure 1. Proposed weakly supervised multi-stream V-L pretraining approach for SLT. First, Visual-Language Pretraining is done and then parameters of the pretrained Visual Encoder and Textual Decoder is transferred to the second step. In second step, keypoint features, sign video are the input of the visual encoder. The visual embeddings will then get decoded in the autoregressive text decoder.

Make customized city maps and meet with people alike for migrant workers