A code-switching-based approach for low-resource language visual question answering
-
Graphical Abstract
-
Abstract
To address challenges facing vision-language models in low-resource scenarios, such as lack of large-scale annotated data and effective transfer methods, a code-switching Chinese Minority pre-trained language model visual question answering (CCMPLM-VQA) method is proposed in this work. With a cross-lingual masked modeling approach using code-switching, model dependence on annotated training data is reduced. A language adapter (LA) with novel structures is introduced to effectively improve multimodal alignment of CCMPLM-VQA. The effectiveness of the proposed method is verified. Compared with the best benchmark model, CCMPLM-VQA improves zero-shot performance on real-world general visual reasoning dataset by approximately 12%. Additionally, its zero-shot performance on cross-lingual real-world general visual reasoning datasets also outperforms existing methods by about 1%.
-
-