Search | arXiv e-print repository

FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts

Authors: Caroline Brun, Vassilina Nikoulina

Abstract: Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate toxicity in generated content, it is primarily concentrated on English, while it's essential to consider other languages as well. For addressing this issue, we c… ▽ More Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate toxicity in generated content, it is primarily concentrated on English, while it's essential to consider other languages as well. For addressing this issue, we create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts and their continuations, annotated with toxicity scores from a widely used toxicity classifier. We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity across various dimensions. We hope that our contribution will foster future research on toxicity detection and mitigation beyond Englis △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: TRAC-2024, Fourth Workshop on Threat, Aggression and Cyberbullying. 20 May 2024

arXiv:2406.14747 [pdf, other]

doi 10.1109/ICASSP48485.2024.10448240

An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks

Authors: Varsha Suresh, Salah Aït-Mokhtar, Caroline Brun, Ioan Calapodescu

Abstract: Self-supervised learning models have revolutionized the field of speech processing. However, the process of fine-tuning these models on downstream tasks requires substantial computational resources, particularly when dealing with multiple speech-processing tasks. In this paper, we explore the potential of adapter-based fine-tuning in developing a unified model capable of effectively handling multi… ▽ More Self-supervised learning models have revolutionized the field of speech processing. However, the process of fine-tuning these models on downstream tasks requires substantial computational resources, particularly when dealing with multiple speech-processing tasks. In this paper, we explore the potential of adapter-based fine-tuning in developing a unified model capable of effectively handling multiple spoken language processing tasks. The tasks we investigate are Automatic Speech Recognition, Phoneme Recognition, Intent Classification, Slot Filling, and Spoken Emotion Recognition. We validate our approach through a series of experiments on the SUPERB benchmark, and our results indicate that adapter-based fine-tuning enables a single encoder-decoder model to perform multiple speech processing tasks with an average improvement of 18.4% across the five target tasks while staying efficient in terms of parameter updates. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: ICASSP 2024

arXiv:2311.01070 [pdf, other]

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Authors: Thomas Palmeira Ferraz, Marcely Zanon Boito, Caroline Brun, Vassilina Nikoulina

Abstract: Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performa… ▽ More Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference. △ Less

Submitted 12 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted to IEEE ICASSP 2024

arXiv:2210.11621 [pdf, other]

SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

Authors: Alireza Mohammadshahi, Vassilina Nikoulina, Alexandre Berard, Caroline Brun, James Henderson, Laurent Besacier

Abstract: In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the "curse of multilinguality", these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introd… ▽ More In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the "curse of multilinguality", these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introduce SMaLL-100, a distilled version of the M2M-100 (12B) model, a massively multilingual machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. We evaluate SMaLL-100 on different low-resource benchmarks: FLORES-101, Tatoeba, and TICO-19 and demonstrate that it outperforms previous massively multilingual models of comparable sizes (200-600M) while improving inference latency and memory usage. Additionally, our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference. Code and pre-trained models: https://github.com/alirezamshi/small100 △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP 2022

Journal ref: https://aclanthology.org/2022.emnlp-main.571

arXiv:2205.10828 [pdf, other]

What Do Compressed Multilingual Machine Translation Models Forget?

Authors: Alireza Mohammadshahi, Vassilina Nikoulina, Alexandre Berard, Caroline Brun, James Henderson, Laurent Besacier

Abstract: Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP) tasks, but their size makes it more challenging to apply them in resource-constrained environments. Compression techniques allow to drastically reduce the size of the models and therefore their inference time with negligible impact on top-tier metrics. However, the general performa… ▽ More Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP) tasks, but their size makes it more challenging to apply them in resource-constrained environments. Compression techniques allow to drastically reduce the size of the models and therefore their inference time with negligible impact on top-tier metrics. However, the general performance averaged across multiple tasks and/or languages may hide a drastic performance drop on under-represented features, which could result in the amplification of biases encoded by the models. In this work, we assess the impact of compression methods on Multilingual Neural Machine Translation models (MNMT) for various language groups, gender, and semantic biases by extensive analysis of compressed models on different machine translation benchmarks, i.e. FLORES-101, MT-Gender, and DiBiMT. We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. Interestingly, the removal of noisy memorization with compression leads to a significant improvement for some medium-resource languages. Finally, we demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages. Code: https://github.com/alirezamshi/bias-compressedMT △ Less

Submitted 27 June, 2023; v1 submitted 22 May, 2022; originally announced May 2022.

Comments: Accepted to Findings of EMNLP 2022, presented at WMT 2022

Journal ref: https://aclanthology.org/2022.findings-emnlp.317/

arXiv:2112.02721 [pdf, other]

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter). △ Less

Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

arXiv:1610.01910 [pdf, ps, other]

Toward Automatic Understanding of the Function of Affective Language in Support Groups

Authors: Amit Navindgi, Caroline Brun, Cécile Boulard Masson, Scott Nowson

Abstract: Understanding expressions of emotions in support forums has considerable value and NLP methods are key to automating this. Many approaches understandably use subjective categories which are more fine-grained than a straightforward polarity-based spectrum. However, the definition of such categories is non-trivial and, in fact, we argue for a need to incorporate communicative elements even beyond su… ▽ More Understanding expressions of emotions in support forums has considerable value and NLP methods are key to automating this. Many approaches understandably use subjective categories which are more fine-grained than a straightforward polarity-based spectrum. However, the definition of such categories is non-trivial and, in fact, we argue for a need to incorporate communicative elements even beyond subjectivity. To support our position, we report experiments on a sentiment-labelled corpus of posts taken from a medical support forum. We argue that not only is a more fine-grained approach to text analysis important, but simultaneously recognising the social function behind affective expressions enable a more accurate and valuable level of understanding. △ Less

Submitted 6 October, 2016; originally announced October 2016.

Comments: 9 pages, 1 figure, conference workshop

arXiv:1604.05377 [pdf]

Churn analysis using deep convolutional neural networks and autoencoders

Authors: Artit Wangperawong, Cyrille Brun, Olav Laudy, Rujikorn Pavasuthipaisit

Abstract: Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUえーゆーC of 0.743 on the test dataset using no more than 12 temporal features for each custom… ▽ More Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUえーゆーC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. △ Less

Submitted 18 April, 2016; originally announced April 2016.

arXiv:cs/0506049 [pdf, ps, other]

Exploitation de dictionnaires électroniques pour la désambiguïsation sémantique lexicale

Authors: Caroline Brun, Bernard Jacquemin, Frédérique Segond

Abstract: This paper presents a lexical disambiguation system, initially developed for English and now adapted to French. This system associates a word with its meaning in a given context using electronic dictionaries as semantically annotated corpora in order to extract semantic disambiguation rules. We describe the rule extraction and application process as well as the evaluation of the system. The resu… ▽ More This paper presents a lexical disambiguation system, initially developed for English and now adapted to French. This system associates a word with its meaning in a given context using electronic dictionaries as semantically annotated corpora in order to extract semantic disambiguation rules. We describe the rule extraction and application process as well as the evaluation of the system. The results for French give us insight information on some possible improvments of the nature and content of lexical resources adapted for disambiguation in this framework. △ Less

Submitted 12 June, 2005; originally announced June 2005.

Comments: 25 pp

ACM Class: H.3; H.4; H.5

Journal ref: Traitement Automatique des Langues (TAL) 42, no. 3 (2001) pp. 667-690

arXiv:cs/0506048 [pdf, ps, other]

Enriching a Text by Semantic Disambiguation for Information Extraction

Authors: Bernard Jacquemin, Caroline Brun, Claude Roux

Abstract: External linguistic resources have been used for a very long time in information extraction. These methods enrich a document with data that are semantically equivalent, in order to improve recall. For instance, some of these methods use synonym dictionaries. These dictionaries enrich a sentence with words that have a similar meaning. However, these methods present some serious drawbacks, since w… ▽ More External linguistic resources have been used for a very long time in information extraction. These methods enrich a document with data that are semantically equivalent, in order to improve recall. For instance, some of these methods use synonym dictionaries. These dictionaries enrich a sentence with words that have a similar meaning. However, these methods present some serious drawbacks, since words are usually synonyms only in restricted contexts. The method we propose here consists of using word sense disambiguation rules (WSD) to restrict the selection of synonyms to only these that match a specific syntactico-semantic context. We show how WSD rules are built and how information extraction techniques can benefit from the application of these rules. △ Less

Submitted 12 June, 2005; originally announced June 2005.

Comments: 7 pp

ACM Class: H.3; H.4; H.5

Journal ref: LREC 2002 Workshop Proceedings "Using semantics for informaiton retrival and filtering" (2002) 45-51

Showing 1–10 of 10 results for author: Brun, C