Andrea Piergentili

13 Followers
29 Following
39 Posts
PhD student at University of Trento (Italy) and Fondazione Bruno Kessler @fbk_mt.
Twitterhttps://twitter.com/apierg
FBKhttps://mt.fbk.eu/author/apiergentili/
Using LLMs as evaluators looks like a very interesting and promising direction, enabling simpler automatic post-editing pipelines. For those interested in fine-grained MT evaluation and APE, I recommend checking out this paper by Lu et al. (2024): https://arxiv.org/abs/2409.14335
#MT #postediting #NLP #AI #evaluation #LLM
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM have shown state-of-the-art performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, $\textbf{MQM-APE}$, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) $\textit{evaluator}$ to provide error annotations, 2) $\textit{post-editor}$ to determine whether errors impact quality improvement and 3) $\textit{pairwise quality verifier}$ as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirms the effectiveness of each module and offers valuable insights into evaluator design and LLMs selection.

arXiv.org

Interesting take on gender bias in LLMs and retrievers: gender information is highly extractable but gender bias in RAG tasks is likely not due to the retrievers

MultiContrievers: Analysis of Dense Retrieval Representations (Goldfarb-Tarrant et al., 2024) http://arxiv.org/abs/2402.15925

MultiContrievers: Analysis of Dense Retrieval Representations

Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.

arXiv.org

The paper is finally available on arXiv! 🎊

➡️ https://arxiv.org/abs/2405.08477

Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models

Machine translation (MT) models are known to suffer from gender bias, especially when translating into languages with extensive gendered morphology. Accordingly, they still fall short in using gender-inclusive language, also representative of non-binary identities. In this paper, we look at gender-inclusive neomorphemes, neologistic elements that avoid binary gender markings as an approach towards fairer MT. In this direction, we explore prompting techniques with large language models (LLMs) to translate from English into Italian using neomorphemes. So far, this area has been under-explored due to its novelty and the lack of publicly available evaluation resources. We fill this gap by releasing Neo-GATE, a resource designed to evaluate gender-inclusive en-it translation with neomorphemes. With Neo-GATE, we assess four LLMs of different families and sizes and different prompt formats, identifying strengths and weaknesses of each on this novel task for MT.

arXiv.org

So happy to announce that our paper 'Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models' has been accepted at EAMT 2024! 🎉

Amazing co-authors: Beatrice Savoldi, Matteo Negri, Luisa Bentivogli

See you in Sheffield! 🧐
#MT #AI #LLMs #EAMT

Our pick of the week by @apierg: "Robust Pronoun Use Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?" by @dippedrusk, @lauscher, et al., 2024.

#pickoftheweek #LLM #bias #reasoning #NLP #NLProc

Engaging study by @dippedrusk et al. on pronoun use on a wide range of LLMs: http://arxiv.org/abs/2404.03134

Interestingly, encoder-decoder models are more accurate and more robust to distractions.
Hot take: maybe we shouldn't be using decoder-only LLMs all the time and for everything 🤔 #NLP #AI #aiethics

Robust Pronoun Use Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?

Robust, faithful and harm-free pronoun use for individuals is an important goal for language models as their use increases, but prior work tends to study only one or two of these components at a time. To measure progress towards the combined goal, we introduce the task of pronoun use fidelity: given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later, independent of potential distractors. We present a carefully-designed dataset of over 5 million instances to evaluate pronoun use fidelity in English, and we use it to evaluate 37 popular large language models across architectures (encoder-only, decoder-only and encoder-decoder) and scales (11M-70B parameters). We find that while models can mostly faithfully reuse previously-specified pronouns in the presence of no distractors, they are significantly worse at processing she/her/her, singular they and neopronouns. Additionally, models are not robustly faithful to pronouns, as they are easily distracted. With even one additional sentence containing a distractor pronoun, accuracy drops on average by 34%. With 5 distractor sentences, accuracy drops by 52% for decoder-only models and 13% for encoder-only models. We show that widely-used large language models are still brittle, with large gaps in reasoning and in processing different pronouns in a setting that is very simple for humans, and we encourage researchers in bias and reasoning to bridge them.

arXiv.org
Ask ChatGPT to pick a number between 1 and 100 - which does it pick? (by Leniolabs)

Our pick of the week by @apierg: "Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation" by Xu et al.

https://arxiv.org/abs/2401.08417

#pickoftheweek #MT #LLM

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.

arXiv.org

Here's a noteworthy study, focusing on one of the most relevant questions in MT: is the 'gold standard'... gold? Or is it gilded?

📄 Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
by Haoran Xu et al.

🔗 http://arxiv.org/abs/2401.08417

#MT #LLM

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.

arXiv.org

Our pick of the week by @mgaido91: "Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models" by @JHPang_r, @Fanghua_Ye, @wangly0229, @ShumingShi, @tuzhaopeng et al., 2023.

https://arxiv.org/abs/2401.08350

#translation #MT #pickoftheweek #LLM #languagemodel

Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models

The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings indicate that LLMs effectively lessen the reliance on parallel data for major languages in the pretraining phase. Additionally, the LLM-based translation system significantly enhances the translation of long sentences that contain approximately 80 words and shows the capability to translate documents of up to 512 words. However, despite these significant improvements, the challenges of domain mismatch and prediction of rare words persist. While the challenges of word alignment and beam search, specifically associated with NMT, may not apply to LLMs, we identify three new challenges for LLMs in translation tasks: inference efficiency, translation of low-resource languages in the pretraining phase, and human-aligned evaluation. The datasets and models are released at https://github.com/pangjh3/LLM4MT.

arXiv.org