基于语码转换的低资源语言视觉问答方法研究

A code-switching-based approach for low-resource language visual question answering

  • 摘要: 为解决视觉语言模型面对低资源场景缺乏大规模标注数据和有效迁移方法的困境,提出了基于语码转换的中国民族语言预训练模型视觉问答(CCMPLM-VQA)方法.通过语码转换跨语言掩码建模方法,降低了模型对标注训练数据的依赖,同时引入全新结构的语言适配器(language adapter,LA),有效提升了CCMPLM-VQA多模态对齐效果;验证了所提方法的有效性.结果表明:相较最佳基准模型,CCMPLM-VQA在现实世界通用视觉推理数据集上的零样本性能提升了约12%;在跨语言现实世界通用视觉推理数据集上的零样本性能优于现有类似方法约1%.

     

    Abstract: To address challenges facing vision-language models in low-resource scenarios, such as lack of large-scale annotated data and effective transfer methods, a code-switching Chinese Minority pre-trained language model visual question answering (CCMPLM-VQA) method is proposed in this work. With a cross-lingual masked modeling approach using code-switching, model dependence on annotated training data is reduced. A language adapter (LA) with novel structures is introduced to effectively improve multimodal alignment of CCMPLM-VQA. The effectiveness of the proposed method is verified. Compared with the best benchmark model, CCMPLM-VQA improves zero-shot performance on real-world general visual reasoning dataset by approximately 12%. Additionally, its zero-shot performance on cross-lingual real-world general visual reasoning datasets also outperforms existing methods by about 1%.

     

/

返回文章
返回