This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
(Translated by https://www.hiragana.jp/) Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solution
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
Report issue for preceding element
failed: mdwlist
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solution
Report issue for preceding element
Nuo Xu111Equal Contributions. , Jun Zhao111Equal Contributions.222Corresponding authors. , Can Zu, Sixian Li, Lu Chen, Zhihao Zhang, Rui Zheng,
Shihan Dou, Wenjuan Qin, Tao Gui222Corresponding authors. , Qi Zhang222Corresponding authors. , Xuanjing Huang
School of Computer Science, Fudan University
Institute of Modern Languages and Linguistics, Fudan University
College of Foreign Languages and Literature, Fudan University
xun22@m.fudan.edu.cn,{zhaoj19,qz,tgui}@fudan.edu.cn
Report issue for preceding element
Abstract
Report issue for preceding element
Faithfulness, expressiveness, and elegance is the constant pursuit in machine translation. However, traditional metrics like BLEU do not strictly align with human preference of translation quality. In this paper, we explore leveraging reinforcement learning with human feedback (RLHF) to improve translation quality. It is non-trivial to collect a large high-quality dataset of human comparisons between translations, especially for low-resource languages. To address this issue, we propose a cost-effective preference learning strategy, optimizing reward models by distinguishing between human and machine translations. In this manner, the reward model learns the deficiencies of machine translation compared to human and guides subsequent improvements in machine translation. Experimental results demonstrate that RLHF can effectively enhance translation quality and this improvement benefits other translation directions not trained with RLHF. Further analysis indicates that the model’s language capabilities play a crucial role in preference learning. A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality and align better with real human translation preferences.
Report issue for preceding element
1 Introduction
Report issue for preceding element
As a crucial technology facilitating communication between disparate languages and cultures, machine translation has long garnered significant attention from both academia and industry Yang et al. (2020). Recently, the emergence of large language models (LLMs) has propelled the field to new frontiers Yang et al. (2023); Zhu et al. (2023); Jiao et al. (2023b); Hendy et al. (2023). Pre-training on massive monolingual datasets has alleviated the reliance on extensive parallel corpora while enhancing translation quality Xu et al. (2024).
Report issue for preceding element
To enhance the translation capabilities of models, much of the research works have adopted one of two optimization objectives: one is through supervised fine-tuning of translation models to maximize the log probability of human translations Yang et al. (2023); Xu et al. (2024); the other is through the techniques like reinforcement learning, directly optimizing the similarity score (e.g., BLEU score Papineni et al. (2002)) between model outputs and human translations Ranzato et al. (2016); Wu et al. (2018); Wieting et al. (2019). Although both approaches have generally performed well, the objectives they optimize for are not fully aligned with human’s preferences for translation faithfulness, expressiveness and elegance Rei et al. (2020); Stiennon et al. (2020).
Report issue for preceding element
Fortunately, reinforcement learning from human feedback (RLHF) has been shown to be effective in aligning model behavior with human societal values Ouyang et al. (2022); Bai et al. (2022). This process integrates reward modeling, where human annotators rank different responses from models based on their preferences, and then normalizes model behavior through a reinforcement learning (RL) phase. However, it is non-trivial to collect a large high-quality preference dataset. Firstly, preference data often comes with noise and ambiguity, as there is low consistency among different human annotators Wang et al. (2024). More importantly, preference data annotation for translation tasks places higher demands on annotators’ linguistic capabilities, a challenge particularly pronounced in low-resource languages.
Report issue for preceding element
This paper explores improving translation quality through RLHF and proposes a cost-effective preference learning strategy. We avoid the need to construct expensive preference datasets and instead leverage the inductive bias that high-quality human translations are superior to machine-generated translations. The reward model learns human translation preferences by comparing the quality difference between the two, and subsequently guides the improvement of machine translation quality. To collect such high-quality human translations, we align books with multilingual versions. Our motivation for choosing books as the data source is as follows: 1) the original text is authored by writers and the target language is translated by professional translators, ensuring the quality of both texts; 2) compared to web text, book text typically contains more complex language structures, which is particularly beneficial for learning translation preferences; 3) aligning book text does not require as high a level of linguistic capabilities from annotators and can be assisted with external tools Wang et al. (2023). The experimental results indicate that the reward model effectively learns human translation preferences, and the translation quality of the model is significantly improved.
Report issue for preceding element
The main contributions of this paper are as follows: 1) We explore the use of RLHF to improve machine translation quality and propose a cost-effective preference learning strategy that avoids the need for expensive preference data construction; 2) Our experimental results demonstrate that RLHF can improve translation quality, and this improvement can be transferred to target languages not trained with RLHF; 3) Further analysis shows that reward models with strong language capabilities can more sensitively learn differences in translation quality and have stronger resistance to noise in the data.
Report issue for preceding element
2 Related works
Report issue for preceding element
2.1 Reinforcement Learning from Human Feedback
Report issue for preceding element
In recent years, research applying RLHF techniques to tasks involving LLMs has significantly increased (Ouyang et al., 2022; Touvron et al., 2023b), aiming to align the behavior of these models more closely with human preferences.
For instance, Stiennon et al. (2020) employ this technique to enhance the quality of summaries, while Bai et al. (2022) utilize it to enable the model to generate responses that are more harmless and useful.
Report issue for preceding element
These technique follows a systematic approach: firstly, collect task-specific human preference data. Then, use this data to train a reward model, which acts as a proxy for human preferences. During reinforcement learning, this reward model provides signals to guide model training.
However, collecting human preference data is non-trivial, time-consuming, and labor-intensive, often requiring high demands on annotators and plagued by inconsistencies in annotation standards among them. Bai et al. (2022); Casper et al. (2023); Wang et al. (2024)
Report issue for preceding element
2.2 Human-like Alignment in Translation
Report issue for preceding element
Achieving human-level machine translation has long been a research goal, receiving ongoing attention. Hassan et al. (2018); Wu et al. (2016); Läubli et al. (2018) Recent years, some studies have focused on improving the quality of machine translation through human feedback and alignment techniques. Kreutzer et al. (2018) gather implicit task-based feedback, enhancing individual word translations and automatic evaluation measures.
Jiao et al. (2023a)
employs contrastive instruction and error-guided instruction to align LLMs with human feedback.
He et al. (2024) attempt to leverage the quality estimation model as the reward model to predict human preference feedback.
Report issue for preceding element
Considering the methods above, the scarcity of human-preference data in translation has long been a bottleneck. Our approach differs, creatively utilizing meticulously translated human data as readily available preference data.
Report issue for preceding element
3 Improving Translation with RLHF
Report issue for preceding elementFigure 1: An Overview of Modeling Translation Preferences using RLHF; To achieve cost-effective preference learning, we optimize the reward model in the second step by contrasting the deficiencies of SFT model translations with human expert translations, thus avoiding the expensive labeling of preference data.Report issue for preceding element
To build a translation model that aligns with human translation preferences, we start with a generic pre-trained language model (such as LLaMA Touvron et al. (2023a)), and follow the pipeline of the following three steps: 1) Supervised fine-tuning of on parallel corpora yields a model with basic translation capabilities; 2) Training a reward model on preference dataset , which assigns high reward scores to translations that adhere to human preference; 3) Utilizing as a proxy for human preferences, enhancing the translation quality of the model through reinforcement learning.
Report issue for preceding element
3.1 Supervised Fine-tuning to Acquire Basic Translation Capabilities
Report issue for preceding element
Given a parallel corpus , where represents the source-language text and represents the corresponding reference translation, we utilize a fixed prompt template and construct the training data as follows:
Report issue for preceding element
“Translate this from [SRC] to [TGT]:
Report issue for preceding element
[SRC]: <> [TGT]: <>”
Report issue for preceding element
where, ’SRC’ and ’TGT’ respectively represent the names of the source language and the target language. The translation model is optimized via the negative log-likelihood loss on parallel corpus as follows:
Report issue for preceding element
(1)
The translation model acquired basic translation capabilities by maximizing the probability of reference translations.
Report issue for preceding element
3.2 Modeling Translation Preferences
Report issue for preceding element
To accurately model human preferences, high-quality preference data is crucial. A common practice used for modeling human value preferences is to prompt the model to generate two different outputs in response to a query and then require annotators to choose their preferred one, i.e., . and denote the chosen and rejected response, respectively. However, constructing a large preference dataset for translation tasks requires annotators who are experts/native speaker in the specific languages, which greatly increases the annotation cost. For low-resource languages, finding a sufficient number of qualified annotators may even be impractical.
Report issue for preceding element
Unlike the aforementioned approach, we instead leverage the induction bias of ‘high-quality human translation is superior to machine-generated translation’ to collect preference data at a lower cost. These high-quality human translations are sourced from book data. Our motivation for selecting this data source is as follows: 1) Books’ original texts and their translated versions are completed by authors and professional translators, ensuring high text quality; 2) Book corpora contain more complex language structures compared to web text, which is highly beneficial for preference learning; 3) Aligning book data requires less stringent language proficiency from annotators and can be aided by external tools.
Report issue for preceding element
We optimize our reward model by contrasting the differences between high-quality human translation and machine translation:
Report issue for preceding element
(2)
where represents the source language sentence, while and respectively denote a high-quality human translation and a machine-generated translation, and is the preference dataset.
Report issue for preceding element
3.3 Improving Translation via RL Fine-tuning
Report issue for preceding element
During the Reinforcement Learning (RL) phase, we employ the acquired reward function to furnish feedback to the language model. Specifically, we refine the policy model to optimize the following reward objective:
Report issue for preceding element
(3)
where represents a coefficient regulating the extent of the KL penalty. The KL divergence component serves two main purposes within this framework. Firstly, it functions as an entropy bonus, maintaining diversity in generation and averting the collapse into singular high-reward responses Jaques et al. (2019). Secondly, it ensures that the output of the RL policy remains within a distribution where the reward model accurately reflects the performance, thereby preventing significant deviations.
Report issue for preceding element
4 Experimental Setup
Report issue for preceding element
4.1 Training Data Collection
Report issue for preceding element
We collect and utilize translation training data from three different sources. The detailed information of these datasets can be found in table 1.
Report issue for preceding element
Name of the dataset
Translation direction
Granularity
Training Samples
English-Chinese Books
En Zh
paragraph-level
Yiyan Corpus
En Zh
sentence-level
United Nations Parallel Corpus
En Zh/ Fr/ Es/ Ru/ Ar
sentence-level
Table 1:
Details of translation training data. In English-Chinese Books dataset and Yiyan Corpus dataset, we simultaneously use both directions of parallel corpora. In United Nations Parallel Corpus, we utilize approximately samples from English to each language.
Report issue for preceding elementFigure 2: The process of constructing the English-Chinese book dataset.Report issue for preceding element
English-Chinese Books. In order to collect rich human expression habits in book translation data, we manually construct an English-Chinese parallel book corpus dataset.
The construction process of this dataset, as shown in Figure 2, can be divided into three steps:
Firstly, alignment at the book level. We manually collect Chinese and English versions of several books, ensuring high quality for both versions selected, with translations being provided by skilled professional translators.
Next, alignment at the chapter level is performed for each book’s Chinese and English versions. We parse the data of the entire book into text format and then compare the number and content of chapters for consistency.
Finally, we align Chinese and English paragraphs at the paragraph level for each chapter through manual comparison and adjustment.
Report issue for preceding element
Yiyan Corpus.***https://corpus.bfsu.edu.cn/info/1070/1631.htm To enhance the diversity of the data and strengthen the model’s robustness to inputs of different lengths, we incorporate the Yiyan corpus, an English-Chinese Parallel Corpus. Specifically, we utilize the academic and novel sections, consisting of parallel sentences translated by human translators at the sentence level.
Report issue for preceding element
United Nations Parallel Corpus (UN). Ziemski et al. (2016)
For our multilingual experiments, we use the UN training set, which was also manually translated.
This dataset includes parallel data in six languages: English, Chinese, French, Spanish, Russian, and Arabic. We conduct experiments on translation from English to the other five languages.
We randomly sample from the extensive dataset, ensuring English sentences contain a minimum of 30 words to guarantee richer information.
Report issue for preceding element
In the experiment for bidirectional English-Chinese translation, we mix English-Chinese books data with Yiyan Corpus data. For the multilingual experiment, we utilize the UN dataset.
Report issue for preceding element
4.2 Model
Report issue for preceding element
•
Ultra-LLaMA2-7B: Base model of our experiments. A variant of LLaMA2-7B further-pretrained on over Chinese tokens.
Report issue for preceding element
•
LLaMA2-7B Touvron et al. (2023b): A LLM trained primarily in English. In certain experiment, we use this model as the control.
Report issue for preceding element
4.3 Evaluation
Report issue for preceding element
4.3.1 Metrics
Report issue for preceding element
When evaluating the quality of translation results, we employed three evaluation methods: GPT-4 comparative evaluation OpenAI (2023) and COMET metrics Rei et al. (2020) and human evaluation.
Report issue for preceding element
Figure 3: Comparison between preference optimized models and the SFT model on Task EnZh. G and H represent GPT-4 and humans as evaluators, respectively.Report issue for preceding elementFigure 4: Comparison between preference optimized models and the SFT model on Task ZhEn. G and H represent GPT-4 and humans as evaluators, respectively.Report issue for preceding element
GPT-4. Due to its exceptional general-purpose capabilities, the GPT-4 model has emerged as a pioneering approach for evaluating NLP tasks. We present the original text of a given sentence alongside translations from both the SFT and RLHF models, allowing GPT-4 to compare them simultaneously and select the superior translation. In the prompt used during the tests, we explicitly included multidimensional evaluation criteria, including flexibility, fidelity, and accuracy and so on. To mitigate the impact of comparison order, we interchanged the positions of both models’ outputs for each test, conducting two evaluations simultaneously. Refer to the Table 5 in appendix for the complete prompt.
Report issue for preceding element
COMET. COMET is a neural framework for training multilingual machine translation evaluation models. It has been shown to have high correlation with human assessment and has become an increasingly widely used metric for machine translation evaluation Kocmi et al. (2021). We select the reference-free quality evaluation model wmt22-cometkiwi-da Rei et al. (2022).
We compare the translation abilities of two models (SFT and RLHF models) by evaluating the relative COMET scores of their translation results for the same translated data.
Report issue for preceding element
Human Evaluation. When evaluating bidirectional English-Chinese translation, we also incorporate human evaluation.
Proficient bilingual native speakers conduct assessments to compare translation quality.
Report issue for preceding element
4.3.2 Test Sets
Report issue for preceding element
We utilize the WMT23 test sets Kocmi et al. (2023) and the Flores-200 devtest sets Costa-jussà et al. (2022) to assess the model’s performance. Note that WMT23 does not cover all directions for the multilingual experiment, but as we employ comparative reference-free evaluation, we only use English data from the WMT23 test sets as the source.
Report issue for preceding element
5 Results and Disscussions
Report issue for preceding element
5.1 Main Results
Report issue for preceding element
Faithfulness
Input
The synthesis of the pharmaceutical compound acetylsalicylic acid, commonly known as aspirin, marked a significant advancement in modern medicine.
Report issue for preceding element
SFT
阿司匹林的合成标志着现代医学的一个重要进步。
Report issue for preceding element
RLHF
乙酰水杨酸
Report issue for preceding element
(阿司匹林)这种药物的合成,标志着现代医学的一个重要进步。
Report issue for preceding element
Commentary
In the translation by RLHF, the term ‘乙酰水杨酸这种药物’ corresponds to ‘the pharmaceutical compound acetylsalicylic acid’ in the input text, while in the translation by SFT, this expression is missing, reflecting an improvement in translation faithfulness.
Report issue for preceding element
Expressiveness
Input
After years of practice, running a marathon was a piece of cake for her.
Report issue for preceding element
SFT
经过多年的练习,对她来说,跑马拉松就像吃蛋糕一样简单。
Report issue for preceding element
RLHF
经过多年的锻炼,跑马拉松对她来说已是小菜一碟了。
Report issue for preceding element
Commentary
In the SFT translation, ‘像吃蛋糕一样简单’ is a literal translation of "a piece of cake" in the input text. In contrast, the translation in RLHF, ‘小菜一碟’, is a more authentic Chinese expression, vivid and expressive. This case reflecting an enhancement in the expressive power of the translation.
Report issue for preceding element
Elegance
Input
As the crimson hues of dusk melded with the cerulean tapestry of the night sky, the poet pondered over verses that could encapsulate the ephemeral beauty of the twilight.
Report issue for preceding element
SFT
夜幕降临,天空中的蓝色帷幕与黄昏的红色调和在一起,诗人开始思考如何用诗句来捕捉这短暂的美好。
Report issue for preceding element
RLHF
暮色渐浓,绯红的余晖与夜空的青蓝交织,诗人思忖着如何用诗句来捕捉这转瞬即逝的美景。
Report issue for preceding element
Commentary
Both ‘转瞬即逝’ and ‘短暂’ can be used to convey the meaning of ‘ephemeral’ in the input text, but the former implies a sense of regret and sorrow for the fleeting nature of beautiful things, while the latter is a neutral term, simply describing temporal brevity. This example demonstrates an improvement in the elegance of the translation.
Report issue for preceding element
Table 2: An case study on modeling human translation preference through RLHF. The yellow background text reflects the improved translation quality of RLHF compared to SFT.
Report issue for preceding element
Is it feasible to model translation preferences without explicit preference annotations?
Report issue for preceding element
This paper explores the feasibility of modeling human translation preferences in the absence of explicit preference annotations. By comparing the deficiencies of machine translation with human translation, the reward model learns human translation preferences, thus circumventing the need for costly preference data annotation. In this subsection, we empirically validate the effectiveness of this approach. Specifically, we use high-quality English-Chinese parallel corpora (refer to Section 4.1) as preferred data, while data generated by the SFT model (also fine-tuned using pre-heldout book data) serves as dispreferred data. From Figure 3 and 4, we observe that on the WMT23 and FLORES datasets, our preference-optimized model exhibits significantly improved win rates compared to the SFT model, regardless of whether the evaluator is GPT-4 or human. This indicates that with access to high-quality parallel corpora, even in the absence of explicit preference annotations, we can learn human translation preferences and improve the translation quality of the model. In Table 2, we demonstrate the quality improvement of translations after preference optimization through three cases.
Report issue for preceding element
Figure 5: After replacing the base model in Figure 3 with LLaMA, compare the preference optimized model and the SFT model in the EnZh translation direction.Report issue for preceding element
Dataset
Evaluator
Results
Translation Direction
EnFr
EnEs
EnRu
EnZh
EnAr
WMT23
GPT-4
SFT Win
RLHF Win
Tie
COMET
SFT Win
RLHF Win
Tie
FLORES
GPT-4
SFT Win
RLHF Win
Tie
COMET
SFT Win
RLHF Win
Tie
Table 3: Results of preference modeling in five translation directions on the UN dataset.
Report issue for preceding element
The language capability of reward model is crucial for preference learning.
Report issue for preceding element
In the previous part of the experiment, we utilize Ultra-LLaMA as the base model, which is a variant of LLaMA further-pretrained on over Chinese tokens. To investigate the impact of language capability differences on preference learning, we replace the base model with original LLaMA, which has a relatively weaker processing capability for Chinese. We construct the SFT model using the same experimental data and training scheme as in the previous section and further optimize it for human preferences. As observed from Figure 5, the win rate of the preference-optimized model significantly decreased in comparison with the SFT model, and it even lost to the SFT model in human evaluations. It is worth noting that the SFT model trained on original LLaMA inherently lacks translation capabilities compared to the SFT model based on Ultra-LLaMA, thus highlighting more pronounced differences in the quality of generated translations compared to human translations. Intuitively, this should decrease the learning difficulty of the reward model. However, the reward model constructed based on original LLaMA failed to effectively model human translation preferences. Therefore, we believe that the language capability of reward models plays an important role in preference learning.
Report issue for preceding element
Translation DirectionOptimized by RLHF
Evaluator
Results
Transferred Translation Direction
EnFr
EnEs
EnRu
EnZh
EnAr
EnZh
GPT-4
SFT Win
RLHF Win
Tie
COMET
SFT Win
RLHF Win
Tie
EnAr
GPT-4
SFT Win
RLHF Win
Tie
COMET
SFT Win
RLHF Win
Tie
Table 4: Cross-lingual Transfer Results of Translation Preferences.
Report issue for preceding element
5.2 The Impact of the Inherent Nature of Human Translation
Report issue for preceding elementFigure 6: Quality Analysis of UN Datasets.Report issue for preceding element
The book dataset used in the previous section has high textual quality, containing complex linguistic structures and grammar phenomena, and is diverse in its domain sources. In contrast, the UN originates from specific domains and lacks complex linguistic structures and rhetorical devices commonly found in governmental documents.
In this section, we conduct multilingual experiments using the UN dataset to explore the influence of intrinsic properties of the data on preference learning.
Report issue for preceding element
For simple domain-specific parallel corpora, the quality of machine translations is comparable to human translations.
Report issue for preceding element
As shown in Figure 6 (left), using COMET as the evaluation metric, we find that the difference in quality between translations from the SFT model and human translations is minimal. Especially for French and Spanish, only and of human translations respectively outperform translations from the SFT model. This indicates that when parallel corpora do not contain complex linguistic sources or sentence structures, the SFT model can already achieve results comparable to human translations. Clearly, the induction bias of "human translations are superior to translations from the SFT model" is no longer valid for such datasets.
Report issue for preceding element
Similar translation quality increases the difficulty of preference learning.
Report issue for preceding element
To explore preference learning on the United Nations dataset, we first remove of the data with small differences in COMET scores, retaining data pairs with relatively clear preference tendencies. However, as shown in Figure 6 (right), in the directions of French and Spanish, nearly of SFT translations still outperform human translations. Therefore, we reannotate based on COMET scores to construct a preference dataset. As shown in Table 3, translation models optimized for preferences significantly outperform the SFT model in all five translation directions in terms of COMET scores. This is easily understood since our preference labels are derived from COMET scores. However, learned preferences may not necessarily be generalizable and aligned with human preferences. The evaluation results of GPT-4 in Table 3 indicate that in the English to Spanish and Russian directions, the preference-optimized model only has a slight advantage, and in the case of French, it even loses to the SFT model. This is mainly because the difference between SFT and human translations is minimal in French. In contrast, in the English to Arabic direction, the preference-optimized model consistently and significantly improves, mainly due to the distinct differences in preference data itself, making it easier for the reward model to learn generalizable translation preferences.
Report issue for preceding element
5.3 Transferability Analysis
Report issue for preceding element
With the powerful Chinese capabilities of the reward model and the notable quality disparities in Arabic preference data, translation models have achieved effective alignment with human preferences in both English-to-Chinese and English-to-Arabic directions. In this section, we explore through experiments whether learned translation preferences can be transferred across languages. As observed from Table 4, RLHF training solely on tasks in English-to-Chinese translation, the learned human preferences can effectively transfer to other languages and consistently improve performance. Similarly, when English-to-Arabic translation is used as the source task, improvements are also evident in tasks such as English-to-French and English-to-Russian translation. This indicates that aligning with and transferring from human preferences in other translation directions can be a viable strategy when the current translation direction lacks reward models with strong language capabilities or high-quality preference data.
Report issue for preceding element
6 Conclusions
Report issue for preceding element
This paper explores modeling translation preferences with RLHF to improve the quality of machine translation. We propose a cost-effective preference learning strategy, optimizing reward models by contrasting deficiencies in machine translation compared to human translation. Learning human preferences while avoiding expensive preference data annotation. Further analysis suggests that the language capability of the reward model and the nature of the data itself affect the effectiveness of preference learning. Additionally, learned preferences exhibit cross-lingual transfer phenomena. This may be beneficial for preference modeling in low-resource languages.
Report issue for preceding element
Limitations
Report issue for preceding element
Due to cost limitations, we only collected English-Chinese aligned book data as a substitute for preference data, without covering more translation directions. Additionally, our human evaluations were limited to English-Chinese translation, with GPT-4 used as a proxy for manual evaluations in other translation directions. In the future, we will attempt to align with human translation preferences in more languages, especially low-resource languages, and conduct comprehensive manual evaluations in more translation directions.
Report issue for preceding element
References
Report issue for preceding element
Bai et al. (2022)↑
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022.
Training a helpful and harmless assistant with reinforcement learning from human feedback.
Casper et al. (2023)↑
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Ségerie, Micah Carroll, Andi Peng, Phillip J. K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca D. Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. 2023.
Open problems and fundamental limitations of reinforcement learning from human feedback.
CoRR, abs/2307.15217.
Costa-jussà et al. (2022)↑
Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Y. Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022.
No language left behind: Scaling human-centered machine translation.
CoRR, abs/2207.04672.
Hassan et al. (2018)↑
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018.
Achieving human parity on automatic chinese to english news translation.
CoRR, abs/1803.05567.
Jiao et al. (2023a)↑
Wenxiang Jiao, Jen-tse Huang, Wenxuan Wang, Zhiwei He, Tian Liang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023a.
Parrot: Translating during chat using large language models tuned with human translation and feedback.
In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 15009–15020. Association for Computational Linguistics.
Kocmi et al. (2023)↑
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popovic, and Mariya Shmatova. 2023.
Findings of the 2023 conference on machine translation (WMT23): llms are here but not quite there yet.
In Proceedings of the Eighth Conference on Machine Translation, WMT 2023, Singapore, December 6-7, 2023, pages 1–42. Association for Computational Linguistics.
Kocmi et al. (2021)↑
Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. 2021.
To ship or not to ship: An extensive evaluation of automatic metrics for machine translation.
In Proceedings of the Sixth Conference on Machine Translation, WMT@EMNLP 2021, Online Event, November 10-11, 2021, pages 478–494. Association for Computational Linguistics.
Kreutzer et al. (2018)↑
Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018.
Can neural machine translation be improved with user feedback?In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 3 (Industry Papers), pages 92–105. Association for Computational Linguistics.
Läubli et al. (2018)↑
Samuel Läubli, Rico Sennrich, and Martin Volk. 2018.
Has machine translation achieved human parity? A case for document-level evaluation.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4791–4796. Association for Computational Linguistics.
Ouyang et al. (2022)↑
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Training language models to follow instructions with human feedback.
Papineni et al. (2002)↑
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.
Bleu: a method for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
Rei et al. (2020)↑
Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. 2020.
COMET: A neural framework for MT evaluation.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 2685–2702. Association for Computational Linguistics.
Rei et al. (2022)↑
Ricardo Rei, Marcos V. Treviso, Nuno Miguel Guerreiro, Chrysoula Zerva, Ana C. Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte M. Alves, Luísa Coheur, Alon Lavie, and André F. T. Martins. 2022.
Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task.
In Proceedings of the Seventh Conference on Machine Translation, WMT 2022, Abu Dhabi, United Arab Emirates (Hybrid), December 7-8, 2022, pages 634–645. Association for Computational Linguistics.
Stiennon et al. (2020)↑
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020.
Learning to summarize from human feedback.
CoRR, abs/2009.01325.
Touvron et al. (2023a)↑
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a.
Llama: Open and efficient foundation language models.
Touvron et al. (2023b)↑
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,
and Thomas Scialom. 2023b.
Llama 2: Open foundation and fine-tuned chat models.
CoRR, abs/2307.09288.
Wang et al. (2024)↑
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. 2024.
Secrets of rlhf in large language models part ii: Reward modeling.
Wieting et al. (2019)↑
John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019.
Beyond BLEU:training neural machine translation with semantic similarity.
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, Florence, Italy. Association for Computational Linguistics.
Wu et al. (2018)↑
Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018.
A study of reinforcement learning for neural machine translation.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3612–3621, Brussels, Belgium. Association for Computational Linguistics.
Wu et al. (2016)↑
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016.
Google’s neural machine translation system: Bridging the gap between human and machine translation.
CoRR, abs/1609.08144.
Ziemski et al. (2016)↑
Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016.
The united nations parallel corpus v1.0.
In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016. European Language Resources Association (ELRA).
Appendix A Implementation Details
Report issue for preceding element
SFT stage.
In the English-Chinese model, we use of the dataset, with a learning rate of , training for epochs; In the multilingual model, approximately of the training data is used for epoch, with a learning rate of .
Report issue for preceding element
RM training stage.
The reward model is initialized with the previous stage’s SFT model. In the English-Chinese model, the remaining of the training data are used to form chosen-rejected pairs with the data generated by the SFT model; In the multilingual model, the remaining of the training data is utilized, and only the top of high-confidence data selected by the COMET model, is used to train the RM. Training continues with dynamic batch processing until early stopping criteria are met.
Report issue for preceding element
RL stage.
For English-Chinese model, we reuse the inputs from the RM stage’s training data as queries, and for multilingual model, we use English monolingual book data obtained from web crawling as queries. We set the KL divergence penalty coefficient to , and trained until early stopping criteria were met.
Report issue for preceding element
You are a translation expert, and I need your help in impartially judging the quality of two translations. The judging criteria are as follows:
Report issue for preceding element
Flexibility of Translation: A good translation is not confined to the original form, and it should be smooth and clear. Poor-quality translations appear rigid and awkward, merely translating word-for-word according to the original form.
Report issue for preceding element
Fidelity of Translation: A good translation should faithfully reflect the content of the original text. It should not introduce content that does not exist in the original, nor should it omit content present in the original.
Report issue for preceding element
Accuracy and Elegance of Phrasing: In a good translation, phrases and wording should adhere to the conventions of the target language, and they should be as accurate and elegant as possible.
Report issue for preceding element
Next, I will provide you with the original text and two translations. Please let me know which one is better according to these criteria. Please give your judgment directly and do not output additional explanations.