基于语码转换的低资源语言视觉问答方法研究

A code-switching-based approach for low-resource language visual question answering

  • 摘要: 为解决视觉语言模型面对低资源场景缺乏大规模标注数据和有效迁移方法的困境,提出了基于语码转换的中国民族语言预训练模型的视觉语言问答(CCMPLM-VQA)方法.通过语码转换跨语言掩码建模方法,降低了模型对标注训练数据的依赖,同时引入全新结构的语言适配器(language adapter,LA),有效提升了CCMPLM-VQA多模态对齐效果;验证了所提方法的有效性.结果表明:相较最佳基准模型,CCMPLM-VQA在现实世界通用视觉推理数据集上的零样本性能提升了约12%;在跨语言的现实世界通用视觉推理数据集上的零样本性能也优于现有类似方法约1%.

     

    Abstract: To address the challenges faced by vision-language models in low-resource scenarios, such as the lack of large-scale annotated data and effective transfer methods, Code-switching Chinese Minority Pre-trained Language Model-Visual Question Answering (CCMPLM-VQA) method is proposed. Through a cross-lingual masked modeling approach using code-switching, the model's dependence on annotated training data is reduced. Meanwhile, a Language Adapter (LA) with a novel structure is introduced to effectively improve the multimodal alignment of CCMPLM-VQA. The effectiveness of the proposed method is verified. The results show that compared with the best benchmark model, CCMPLM-VQA improves the zero-shot performance on the real-world general visual reasoning dataset by approximately 12%. Additionally, its zero-shot performance on cross-lingual real-world general visual reasoning datasets also outperforms existing similar methods by about 1%.

     

/

返回文章
返回