Code-switched inspired losses for spoken dialog representations

Pierre Colombo; Emile Chapuis; Matthieu Labeau; Chloe Clavel

Publication

EMNLP 2021

Conference paper

Code-switched inspired losses for spoken dialog representations

EMNLP 2021

Download paper

Abstract

Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation (e.g in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks (i.e multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.

Date

30 Aug 2021

Publication

EMNLP 2021

Authors

IBM-affiliated at time of publication

Topics

Natural Language Processing

Resources

Publication

Abstract

Date

Publication

Authors

Topics

Resources

Share