Journals & Magazines >IEEE Access >Volume: 11

A Data Augmentation Method for English-Vietnamese Neural Machine Translation

The illustration of the proposed method consists of three steps: 1) check and correct grammatical errors for legal domain text, and create a phrase dictionary; 2) Use Goo...

Abstract:

The translation quality of machine translation systems depends on the parallel corpus used for training, particularly on the quantity and quality of the corpus. However, ...Show More

Metadata

Abstract:

The translation quality of machine translation systems depends on the parallel corpus used for training, particularly on the quantity and quality of the corpus. However, building a high-quality and large-scale parallel corpus is complex and expensive, particularly for a specific domain-parallel corpus. Therefore, data augmentation techniques are widely used in machine translation. The input of the back-translation method is monolingual text, which is available from many sources, and therefore, this method can be easily and effectively implemented to generate synthetic parallel data. In practice, monolingual texts can be collected from different sources, in which sources from websites often contain errors in grammar and spelling, sentence mismatch, or freestyle. Therefore, the quality of the output translation is reduced, leading to a low-quality parallel corpus generated by back-translation. In this study, we proposed a method for improving the quality of monolingual texts for back-translation. Moreover, we supplemented the data by pruning the translation table. We experimented with an English-Vietnamese neural machine translation using the IWSLT2015 dataset for training and testing in the legal domain. The results showed that the proposed method can effectively augment parallel data for machine translation, thereby improving translation quality. In our experimental cases, the BLEU score increased by 16.37 points compared with the baseline system.

The illustration of the proposed method consists of three steps: 1) check and correct grammatical errors for legal domain text, and create a phrase dictionary; 2) Use Goo...

Published in: IEEE Access ( Volume: 11)

Page(s): 28034 - 28044

Date of Publication: 06 March 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3252898

Contents

SECTION I.

Introduction

Neural machine translation (NMT) systems have been developed recently and have been successful in practice [1]. The quality of the NMT system significantly depends on the training data. Parallel corpora are often unavailable in low-resource language pairs, particularly in specific domains. One of the most significant challenges of NMT is the lack of parallel data [2].

Monolingual data have traditionally been used to train language models, improving the fluency of statistical machine translation (SMT) [3], [4]. Additional monolingual data for NMT are important because sufficient parallel data are unavailable for low-resource language pairs or domains. Unlike parallel corpora, monolingual text in most domains is relatively easy to obtain for each language [5], [6]. However, the monolingual text is crawled from websites that usually contain spelling and grammar errors, especially Vietnamese texts in many specific domains, such as sports, legal, and medical [5], [7].

Back-translation has recently become a prominent approach to data augmentation [8],[10], in which a reverse system is built to translate monolingual text from the target language to the source language. This is difficult for low-resource language pairs or specific domains, because we do not have sufficient parallel data to train a quality reverse machine translation system.

Recent advances in neural machine translation have resulted in human parity in several popular language pairs [11]. The parallel data of these language pairs are often large-scale but are limited in low-resource language pairs.

Google Translate is a well-known machine- translation system. Google’s neural machine translation (GNMT) has been implemented for many language pairs. Currently, 133 languages, including Vietnamese, are supported. According to Google’s experiments, translation precision of translation depends on the language pair and translation domain [11], [12]. However, many pairs and language domains exhibit poor translation performances. Currently, there are no studies on the precision of Google Translate for the English-Vietnamese language pairs, including domains. Nevertheless, Google Translate’s performance is expected to improve in all the supported languages. Therefore, we want to leverage the advantages of the Google Translate application for data augmentation rather than fully train a reverse machine translation system.

We present a data augmentation proposal for English-Vietnamese neural machine translation, which can also be used for domain adaptation tasks in machine translation. In this study, we focus on experiments on this task and choose the legal domain for all experiments.

The main contributions of this study are as follows.

We propose to use Google Translate as a reverse machine translation system in the back-translation step. This proposal not only reduces the time for the experiments because we do not need to train a reverser system but also makes sense for low-resource language pairs.
We propose using the grammar and spelling error correction model to improve the input text for back-translation so that the output pseudo-parallel data are of better quality.
We propose generating a phrase dictionary using pruning the phrase-table.

This paper is structured as follows. Section II summarizes related work. Our proposal is described in Section III. Experiments, results, and discussion are presented in Section IV. Finally, the conclusion and future works are presented in Section V.

SECTION II.

Related Works

In SMT, the phrase-table contains all the phrase pairs found in the training data, which includes a large amount of noise. To reduce noise, in [13] the authors present a method to prune out unlikely pairs of phrases to improve translation quality. In this study, we used this method to generate a dictionary to enrich training data for machine translation.

Recent studies on data enhancement for NMT. These studies have focused on the exploitation of monolingual data [14], [15], [16], [17], [18]. A successful method is back-translation [19], in which an NMT system is trained in the reverse direction (from the target side to the source side) and subsequently used to translate monolingual data from the target language to the source language. Parallel data are generated by back-translation concatenated with original parallel data to train a source-to-target language machine translation model.

A method for data augmentation using back-translation for context-sensitive NMT was presented in [20]. The authors first obtained large-scale pseudo-parallel corpora by back-translating target-side monolingual corpora, and then subsequently investigated its impact on the translation performance of context-aware NMT models. These NMT models were trained with small parallel corpora and large-scale pseudo-parallel corpora on the IWSLT2017 English-Japanese and English-French datasets.

In [21], the authors presented an unsupervised domain-adaptation method for NMT with iterative back-translation. The authors highlighted an iterative back-translation scheme to make better use of in-domain text. This scheme iteratively translates the source language to the target language and trains a translation model to map the generated data back to the original source sentences. The same generation and training process was then performed in the reverse direction. This process was repeated until convergence was achieved.

In [22], the authors present a data augmentation method for adapting an NMT system from the general domain to the legal domain. The proposed method uses the Google Translate application for back-translation to exploit its advantages.

To improve input data quality for back-translation, correction of error sentences is very important, some errors such as spelling and grammar errors are very important. Some methods have been proposed for spelling errors and grammar error correction.

In [23], the authors presented a method for grammatical error correction using neural machine translation in Chinese. This method consists of two stages: removing surface errors and building a grammatical error correction system using neural machine translation. In [24], the authors presented a method that combines statistical machine translation with neural machine translation to build an automated grammatical error correction system.

In [25], the authors presented a method for rectification of grammatical errors in the Vietnamese language. The authors consider the incorrect grammar text as a source language text and the correct grammar text as a target language text. This issue is considered to be a machine translation task.

In this work, because of noises and errors in a monolingual text, we will use the grammar and spelling correction model in [26] to improve the quality of the monolingual text, which is the input for the back-translation method mentioned in [23]. In addition, we provide three improvements:

We extend the back-translation method for a transformer model including experiments and hyper-parameters optimization, and we perform error analysis for neural machine translation models.
This is the first proposal to improve the quality of the pseudo-parallel data generated by back translation.
We create a dictionary using the pruning table as a data enrichment method for the NMT model.

SECTION III.

Proposed Method

In this section, we provide an overview of the NMT system. We then introduce the back-translation method and prune the translation table method. Finally, the proposed method is presented.

A. Neural Machine Translation

Given a source sentence $x = (x_{1},..., x_{m})$ and its corresponding target sentence $y = (y_{1},..., y_{n})$ , NMT aims to model the conditional probability $p(y|x)$ with a single neural network. To parameterize the conditional distribution, recent studies on NMT employ an encoder-decoder architecture [26], [27], [28]. The attention mechanism [29] has successfully addressed the degradation of the quality of NMT when dealing with long input sentences [30].

In the study [25], we use the attentional NMT architecture proposed in [29]. The encoder, which is a bidirectional recurrent neural network, reads the source sentence and generates a sequence of source representations $h = (h_{1},..., h_{m})$ . The decoder, which is another recurrent neural network, produces the target sentence one symbol at a time. The log conditional probability thus can be decomposed as follows:

$\begin{equation*} \log p(y|x) = \sum _{t=1}^{n}\log p(y_{t}|y_{< t}, x) \tag{1}\end{equation*}$ View Source

where

$y_{< t}=(y_{1},..., y_{t-1})$

. The conditional distribution of

$p(y_{t}|y_{< t},x)$

is modeled as a function of the previously predicted output

$y_{t-1}$

, the hidden state of the decoder

$s_{t}$

, and the context vector

$c_{t}$

$\begin{equation*} p(y_{t}|y_{< t}, x) \propto \exp \left \{{g(y_{t-1}, s_{t}, c_{t})}\right \} \tag{2}\end{equation*}$

View Source

The context vector

$c_{t}$

is used to determine the relevant part of the source sentence to predict

$y_{t}$

. It is computed as the weighted sum of the source representations

$h_{1},.., h_{m}$

. Each weight

$\alpha _{ti}$

for

$h_{i}$

implies the probability of the target symbol

$y_{t}$

being aligned to the source symbol

$x_{i}$

$\begin{equation*} c_{t} =\sum _{i=1}^{m} \alpha _{ti}h_{i} \tag{3}\end{equation*}$

View Source

Given a sentence-aligned parallel corpus of size

$N$

, the entire parameter

$\theta$

of the NMT model is jointly trained to maximize the conditional probabilities of all sentence pairs

$\left \{{(x_{n}, y_{n})}\right \}_{n=1}^{N}$

$\begin{equation*} \theta ^{*} = \arg \max _{\theta } \sum _{n=1}^{N}\log p(y_{n}|x_{n};\theta) \tag{4}\end{equation*}$

View Source

where

$\theta ^{*}$

is the optimal parameter.

Recently, a novel deep neural network model, transformer [31], with an innovative multi-head attention mechanism was introduced. It has become a state-of-the-art model for many artificial intelligence tasks, including machine translation [32], [33]. Compared with other NMT models, including recurrent ones, the transformer not only provides better translation results but can also be trained in a shorter period of time [31]. In this study, a transformer model was used as the baseline translation system.

The encoder maps an input sequence of symbol representations $(x_{1}, \ldots, x_{n})$ to a sequence of continuous representations $z = (z_{1}, \ldots, z_{n})$ . Given $z$ , the decoder then generates an output sequence $(y_{1}, \ldots, y_{m})$ of symbols, one element at a time. At each step, the model is auto-regressive [34], consuming the previously generated symbols as additional inputs when generating the next. The transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

1) Encoder

The encoder is composed of a stack of $N = 6$ identical layers. Each layer consisted of two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position, wise fully connected feed-forward network, which employs a residual connection [35] around each of the two sub-layers, followed by layer normalization [36]. That is, the output of each sub-layer is LayerNorm( $x$ + Sublayer( $x$ )), where Sublayer( $x$ ) is a function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model}$ = 512.

2) Decoder

The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, residual connections around each sub-layer were employed, followed by layer normalization. The self-attention sub-layer in the decoder stack prevents positions from attending to the subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ depend only on the known outputs at positions less than $i$ .

B. Back-Translation

The back-translation method is a technique that employs target-language monolingual data in training a machine translation system without changing its network architecture. Given a sentence-aligned parallel dataset $D=\left \{{(X_{n},Y_{n})}\right \}_{n=1}^{N}$ and a monolingual dataset in the target language $T=(Y_{m})_{m=1}^{M}$ , the process of back-translation includes the following steps.

First, a translation model in the reverse direction NMT $_{Y->X}$ is trained with the parallel dataset $D$ .
Second, with the translation model NMT $_{Y->X}$ , the monolingual target language data set $T$ is translated back into the source language translations $S=(X_{m})_{m=1}^{M}$ , which is then paired with $T$ , which constitutes a synthetic parallel data set $D_{syn}=\left \{{(X_{m},Y_{m})}\right \}_{m=1}^{M}$ .
Third, the synthetic parallel data set $D_{syn}$ and the original parallel data set $D$ are combined to train the main translation model NMT $_{X->Y}$ .

Parallel data synthesized from back-translation has merits and demerits against human-translated parallel data. Synthetic parallel data is inferior to human-translated parallel data because it is generated automatically by a possibly inaccurate machine translation system. However, it does not contain mismatches of sentence boundaries between the target and the obtained source monolingual data, in contrast to the human-translated data, where a source sentence can correspond to multiple target sentences. This method makes effective use of the available resources and achieves substantial accuracy.

Despite its simplicity, this method has been shown to be powerful for phrase-based translation [37], NMT [19] and unsupervised MT [38].

C. Data Augmentation

Our proposal can use for most language pairs, especially effective for low-resource language pairs. Currently, the available parallel data of the English-Vietnamese is not yet at the scale to build a quality NMT system, although the English-Vietnamese pair is not low-resource language pair. Therefore, in this paper, we present and conduct experiments to augment parallel data for the English-Vietnamese NMT system.

Unlike parallel data, monolingual data is often available from multi-sources such as the internet, documents, etc. Therefore, we use monolingual data to augment parallel data for the training process. To augment the data, we select the back-translation method. In this method, we must train a reverser NMT system from Vietnamese to English, and use it to translate Vietnamese texts. The input sentences and their translations were then paired to form pseudo-parallel data. Since the English-Vietnamese language pair is a low-resource language pair, we do not have enough parallel corpus in the legal domain to train a sufficiently good reverser NMT system. Therefore, our first proposal is to use the Google Translate application as a reverse machine translation system. This proposal takes advantage of Google Translate, which not only reduces the time for the experiments because we do not need to train a reverser system but also makes sense for low-resource language pairs such as the English-Vietnamese pair.

In practice, translations of the Google Translate application are better when the source sentences are properly structured and have full sentence components, such as subjects and predicates. Furthermore, the sentences did not contain any misspellings.

A caveat of using the back-translation method is the noise of output parallel text when the input text contains grammar errors or misspellings. This is particularly harmful to machine translation systems. Another side effect is that they generate large amounts of data and approximate translations, unlike how people translate them. When these data are used for training, they can introduce bias and result in translations that are less similar to human translations.

Therefore, our second proposal in this paper is to use a grammar error correction model for Vietnamese to correct some errors in the input Vietnamese texts for Google Translate to improve back-translation. We combine a grammar error correction model with the back-translation method, in which back-translation is performed using Google Translate to generate additional parallel data.

Currently, the transformer-based NMT systems are state-of-the-art NMT. However, this architecture does not translate long sentences effectively [39]. To improve the NMT system, we proposed a technique for data augmentation by pruning the phrase-table. First, we filter all long sentences in a parallel corpus and then generate a phrase-table by training an SMT system. This phrase table prunes out unlikely phrase pairs, which often have a low probability value. Finally, we have a parallel corpus that includes phrase pairs that are translations of each other.

The proposed method described in Figure 1 consists of three steps: checking and correcting grammatical errors, back-translation using Google Translate, and training an English-Vietnamese neural machine translation system.

FIGURE 1.

The illustration of our proposed method consists of three steps: 1) check and correct grammatical errors for legal domain text, and create a phrase dictionary; 2) Use Google Translate as a reverser model in back-translation method; 3) Synthesise a parallel corpus from the text which is corrected for grammatical errors in Step 1 and the translation of the Google Translate in Step 2. This synthetic parallel corpus is combined with the original parallel corpus and the phrase dictionary for training a domain NMT model.

Show All

1) Step 1

Check and correct grammatical errors. We use a model to check and correct grammatical errors for Vietnamese text. We consider this grammatical correction as a machine translation task, where the wrong grammar text and the right grammar text are treated as the source language and the target language, respectively, as in [25]. We also used the NMT architecture to train a model for the grammar error correction task, which was improved more than the models in [25]. We used the transformer-based NMT architecture throughout this work, as it is the current state-of-the-art. Step 1 is illustrated in Figure 2.

FIGURE 2.

The checking and correction of grammatical errors process.

Show All

2) Step 2

The Back-translation using Google Translate. Our purpose was to generate a domain-parallel corpus from monolingual data for the English-Vietnamese language pair. We use the Google Translate application to translate monolingual text from Vietnamese to English. In this case, the Google Translate application, like a high-quality reverser system for back-translation, generates a domain-parallel corpus from monolingual data. This parallel corpus is added to the original parallel corpus, as described in Step 3, to train the NMT system. An illustration of step 2 is shown in Figure 3.

FIGURE 3.

The back-translation method using the Google Translate application.

Show All

3) Step 3

Training an English-Vietnamese neural machine translation system. Parallel data sets for training were obtained from the original parallel corpus and the parallel corpus generated in Step 2. The two corpora are mixed under different ratios. We trained an NMT system from English to Vietnamese, and then we evaluated the system’s quality when translating texts into other domains. This is the most interesting scenario that allows us to trace changes in the quality of our systems when we change the mix ratio of the two types of parallel data. Step 3 is illustrated in Figure 4.

FIGURE 4.

Combining the synthetic parallel corpus with the original parallel corpus for training a machine translation system.

Show All

D. Generating Dictionary Using Pruning the Phrase-Table

The phrase-table plays an important role in the traditional phrase statistical machine translation (SMT) system. During translation, a phrase-based SMT system relies heavily on phrase-table to generate outputs. Pruning of the phrase-table is introduced in SMT [13]. In SMT, the structure of the phrase-table is for each line: English phrase, Vietnamese phrase, and alignment points. Despite its simplicity, this method has been shown to be powerful for phrase-based translation, NMT as well, and unsupervised MT.

Each phrase pair has five features:

The phrase translation probability $\phi (f|e)$ ,
The lexical weighting lex $(f|e)$ ,
The phrase inverse translation probability $\phi (e| f)$ ,
The inverse lexical weighting lex $(e| f)$ ,
The phrase penalty, currently always $e = 2.718$ .

The four features above are probabilistic, they take values between zero and one. The fifth feature is constant.

The phrase table contains all phrase pairs found in the training data, including noise. To reduce noise, we can prune out unlikely phrase pairs, which often have a low probability value. The main idea of phrase table pruning is the act of removing phrase pairs from a phrase table to make it smaller, ideally removing the least useful phrases first.

SECTION IV.

Experiments

In this section, we describe the experiments, including statistics for datasets, pre-processing and settings, and the process of training and evaluating the NMT systems. For comparison, experiments were carried out with datasets in the general and legal domains.

A. Outline of the Experiments

We train a model for checking and correcting grammar for Vietnamese texts, and transformer-based NMT systems with three scenarios: (1) from the IWSLT2015 parallel corpus only, (2) from synthetic data only, and (3) from a mixture of parallel corpus and synthetic data. We trained five NMT systems and evaluated the quality of translation using general and legal domain data. We also compared the translation quality of our systems using Google Translate (GT). The proposed method consists of two main components.

1) Check and Correct Grammatical Errors for Vietnamese

Currently, there is no complete toolkit for checking and correcting grammatical errors in Vietnamese. Therefore, we trained a model for Step 1 (check and correct grammatical errors for Vietnamese). This model is trained using the transformer-based machine translation architecture with two situations: training data is corrected for spelling errors and training data is not corrected for spelling errors, these systems are named Spell+Vi_GEC and Vi_GEC, respectively.

2) Neural Machine Translation Systems

The system was built using only the IWSLT2015 Parallel Corpus. We trained a machine translation system to serve as the basis for the evaluation. This system was trained using a parallel corpus in the general domain, which was provided by the IWSLT 2015 workshop. This system is known as the Baseline system.
The system is built using synthetic data only. This system represents the case where there are no parallel data available, but monolingual data can be translated using an existing MT system. These synthetic data were used to train a new NMT system. Here, we used 100,000 sentences in Vietnamese in the legal domain and Google Translate to perform the back-translation. This system is called the Synthetic system.
The systems are built using a mixture of parallel corpus and synthetic data. this is the most interesting scenario that allows us to trace the changes in translation quality when we change the mixture ratio between the synthetic data and the original parallel data. We trained two NMT systems: the first system trained on IWSLT2015 parallel data (133k sentences pairs) + synthetic parallel data (50k sentences pairs) and the second system trained on IWSLT2015 parallel data (133k sentences pairs) + synthetic parallel data (100k sentences pairs). These systems are called Baseline_Syn50 and Baseline_Syn100.

When using the Vi_GEC model for back-translation, we also build two additional systems in addition to the mixture of the above datasets and data augmentation by pruning the phrase-table. NMT systems are named Baseline_Syn $50_{Vi\_{}GEC}$ ,

Baseline _Syn $100_{Vi\_{}GEC}$ ,

Baseline_Syn50_Pru $_{Vi\_{}GEC}$ ,

Baseline _Syn100_Pru $_{Vi\_{}GEC}$ and

Synthetic $_{Vi\_{}GEC}$ respectively.

Our NMT systems are evaluated in the general domain and in the legal domain. We also compare translation quality with Google Translate on the same test domain data set.

B. Data and Pre-Processing

1) Data

We experiment with the datasets of the English-Vietnamese language pair. In all experiments, we consider two different domains, a legal domain, and a general domain.

Data for training Vi_GEC system. We consider the Vietnamese grammatical error detection and correction task as a machine translation problem. Wrong grammar texts and the corresponding right grammar texts are considered as the source language and target language, respectively. A machine translation model translates a text from grammatically incorrect to grammatically correct. Since this model is based on machine translation, we need to have a parallel corpus in Vietnamese with source-side data as grammatically incorrect and target-side data as grammatically correct. To this end, we gathered a training set in the general domain:

First, we crawled Vietnamese data from server news websites such as https://dantri.com.vn, https://vnexpress.net, etc. We collected about 300,000 Vietnamese sentences.
Second, we created parallel data for training a Vi_GEC system. We used the Vietnamese data collected above as the target-side data and use this data to create source-side data by manually making some types of grammar errors and spelling errors.

A summary of the datasets is presented in Table 1.

TABLE 1 Overview of the Data for Training the Vi_GEC System

Data for Back-translation. We crawled Vietnamese data from several websites in the legal domain such as https://vanban.chinhphu.vn/, https://vbpl.vn/, etc. We collected about 100000 sentences.

Data for Neural Machine Translation systems

We use the English-Vietnamese parallel corpus provided by IWSLT2015 (133k sentence pairs). This corpus is in the general domain. The tst2012/tst2013 datasets were selected as a validation set (val) and a test set, respectively.
To evaluate, we use 500 pairs of sentences in the legal domain and 1,246 pairs of sentences in the general domain (tst2013 dataset).

A summary of parallel and monolingual data is presented in Table 2.

TABLE 2 The Summary Statistics of Data Sets English-Vietnamese

Data for Pruning the phrase-table. To generate the phrase-table, we used about 10000 parallel sentences in the legal domain. The length of these sentences is 40 to 256. A summary of this dataset is presented in Table 3.

TABLE 3 The Summary statistics of the Dataset for Pruning the Phrase-Table

2) Pre-Processing

The English language has explicit word boundaries which are white spaces. We pre-process all datasets using the tokenization script in the standard Moses toolkit [40] for source-side of parallel data as English. In Vietnamese, word boundaries are not white space. White spaces are used to separate syllables in Vietnamese, not to separate words. A Vietnamese word consists of one or more syllables and Moses tokenizer does not segment sentences into words; hence, we use vnTokenizer [41] for word segmentation. We only used it for separation marks such as dots, commas, and other special symbols.

For data cleaning, we use the script clean-n-corpus.perl in Moses to remove line pairs in parallel data that contain more than 40 tokens.

C. Settings

We have trained NMT systems, the Vi_GEC system using the OpenNMT toolkit [42] with the transformer architecture [31]. This is a state-of-the-art open-source neural machine translation system. The settings are the same for all the systems in the experiments. Our encoder-decoder translation architecture is based on transformer with 2048 hidden units and 8 heads. We use 6 layers both in the encoder and decoder. The dimensionality of the word embeddings and the hidden layers is set to 512.

To generate the phrase-table, we use the Moses toolkit [40] with standard settings, this is a state-of-the-art open-source phrase-based SMT system. To prune the phrase-table, we use the technique in [13].

For evaluation, we used the standard BLEU score metric (Bi-Lingual Evaluation Understudy) [43]. The translated output of the test set is compared with different manually translated references of the same set.

D. Results

We conducted the test scenarios as described in the outline section. The first component of the proposed method is a grammar and spelling correction model in Step 1. We train the Vi_GEC model and the Spell+Vi_GEC model for comparison, evaluated by the BLEU score. The process of building the model is presented in detail in [25], which is shown in detail in Table 4. The BLEU score of the Vi_GEC model is 89.70 and that of the Spell+Vi_GEC model is 92.18, which are shown in detail in Table 4. Therefore, we choose the Spell+Vi_GEC model for all experiments in the next step. Note that both BLEU scores are high because the word order is usually identical when the source text is incorrect and the target text is the corrected text and incorrect words in Vietnamese can be well-fixed.

TABLE 4 The BLEU Score of Vi_GEC System

In Step 2, we use the Spell+Vi_GEC model to preprocess documents and then put them into the Google Translate reverse translation system. In the experiments, we compared and evaluated the quality of NMT systems when using Spell+Vi_GEC model and when not using it. Table 5 presents the BLEU score of NMT systems that do not use the Spell+Vi_GEC system for Google Translate, and Table 6 presents the results of the systems that do.

TABLE 5 Experimental Results of Systems when not Using Spell+Vi_GEC

TABLE 6 Experimental Results of Systems when Using Spell+Vi_GEC

Table 5 shows the result when the Spell+Vi_GEC system is not used to correct grammatical errors and spelling errors for texts prior to applying the back-translation method using the Google Translate application. When translating in the general domain, the Baseline NMT system achieved a 28.3 BLEU score. Because this Baseline system is trained with parallel corpus in the general domain, when translating in the legal domain, the BLEU score was reduced to 19.83. Using the back-translation method improves the translation results over the Baseline system.

In particular, the Baseline_Syn100 system improved translation quality in the legal domain by up to 13.93 BLEU scores for the Baseline system. This is the system that achieved the highest BLEU score compared to the other systems when translated in the legal domain (the BLEU scores are 33.91 and 32.80, corresponding to the Baseline_Syn50 system and Synthetic system). Using the Google Translate application, translation in the general domain achieved a BLEU score of 46.47. When translated into the legal domain, the BLEU score was reduced to 32.05 BLEU score when translated into the legal domain. These results show that our Baseline_Syn100 system is better than other systems. Figure 5 shows a comparison of the quality of systems when translated into the general and legal domains. Assess the impact of the number of back-translation sentences on the quality of the augmented data. We mixed the synthetic data with the original data with different ratios. The results showed that the BLEU scores of the NMT systems improved when the number of back-translation sentences increased not only for the test dataset in the general domain but also in the legal domain.

FIGURE 5.

Compare the BLEU SCORE of systems when not using Spell+Vi_GEC.

Show All

When using Spell+Vi_GEC, the translation quality of the systems is improved. In particular, the Baseline_Syn $100_{Spell+Vi\_{}GEC}$ system achieved the highest BLEU score of 36.20, and this system improved the BLEU score by 2.44. Furthermore, when we add data augmentation by pruning the phrase table, the BLEU score improved by 0.5 for the Baseline_Syn $100_{Spell+Vi\_{}GEC}$ system.

Table 6 and Figure 6 show a comparison of the quality of NMT systems when using the Spell+Vi_GEC system. Figure 8 shows a comparison of the BLEU score when pruning is used and not.

FIGURE 6.

The BLEU SCORE of systems when using Spell+Vi_GEC.

Show All

FIGURE 7.

Compare the BLEU SCORE of the same system when using Spell+Vi_GEC and not using Spell+Vi_GEC.

Show All

FIGURE 8.

Compare the BLEU SCORE of the systems when pruning the phrase table and not when pruning the phrase-table.

Show All

Figure 7 shows a comparison of the BLEU score of the same system when translating in the general domain and in the legal domain with two situations, one using the Spell+Vi_GEC system and one without the Spell+Vi_GEC system. The Baseline_Syn100 system with the Spell+Vi_GEC system has the highest BLEU score (both in the legal domain and the general domain) compared to the remaining situations.

The experimental results show that the back-translation technique is capable of generating synthetic parallel data. This method is simple and effective, especially when the Google Translate application is used as a reverser system, and monolingual data for back-translation are checked and corrected for grammatical errors. In the legal domain of the English-Vietnamese language pair, an improvement of 2.94 BLEU score is observed over the system not using the Spell+Vi_GEC system for Google Translate to back-translation BLEU score from 33.76 up to 36.70 and an improvement of 16.87 BLEU score over the Baseline system BLEU score from 19.83 up to 36.70.

In summary, our proposed method helps to increase the accuracy of the translated sentences of the back-translation method. As a result, the quality of the augmented data is better, as shown by the BLEU score of the NMT systems.

E. Discussion

Data augmentation is a great tool to fix the problem of limited data. Data augmentation adds value to base data by adding information derived from internal and external sources. The idea of back-translation is very effective in machine translation, especially for low-resource languages or low-resource domains. Since our data set was very limited with only 133,316 training sentence pairs, we augmented the data using the back-translation technique. By using back-translation, we can double the size of the training dataset, making it more valuable for later training machine translation models.

Examples of translation results of the back-translation method with the Google Translate application when using Spell+Vi_GEC and not using Spell+Vi_GEC are illustrated in Table 7. Accordingly, the Vietnamese sentences in the first column have some types of errors, such as spelling errors and sentence structure errors. The sentences in the third and fourth columns are the results of back-translation when the Spell+Vi_GEC model is not used and when it is used. The phrases in bold in the sentence are the translation phenomena under consideration. More specifically, we consider the sentences in rows 1st, 2nd, 3rd, 4th, and 5th. The back-translated sentences using Spell+Vi_GEC model give better translation results or have a meaning more closely to the reference sentence.

TABLE 7

Examples of Translation Results of Google Translate when Using Spell+Vi_GEC and not Using Spell+Vi_GEC

Show All

The Vietnamese sentence in the first row:

”Đại dien tại Việt Nm có tham quyền ký hợp đồng dưới tên của doanh nghiệp nước ngoài.” has two spelling errors

Incorrect Đại dien tại Việt Nm tham quyền| Correct Đại diện tại Việt Nam thẩm quyền

Google translated the incorrect phrases above to ”The representative in Viet Nam Nm” and ”the right”, and this translation result is incorrect. When using the Spell+Vi_GEC model for Vietnamese, the above spelling errors are corrected, so the translation result of Google Translate is better. The sentence in the last row has the same errors.

The sentence in the third row:

”Qua bản báo cáo cho ta thấy được thực trạng ô nhiễm môi trường hiện nay”. This sentence has two parts; the first ”Qua bản báo cáo” is an adverb, and the other ”cho ta thấy được thực trạng ô nhiễm môi trường hiện nay” is a predicate. Hence, it lacks subject. It should be corrected as follows: ”Bản báo cáo cho ta thấy được thực trạng ô nhiễm môi trường hiện nay”. Thus, the Google translation result is also better when we use the Spell+Vi_GEC model for the input sentence (The report vs. Through the report).

The sentences in the seventh and eighth rows are not corrected for grammatical errors, so the translation results of Google are not good, as the following:

The sentence in the seventh row ”Có s traăo chính trịgiữa haai nước.”

Incorrect s traăo haai nước| Correct sự trao đổi hai nước

Similarly, the sentence in the eighth row ”Nhìn bao quátvù uê từ trên đỉnh đồ.”

Incorrect vù uê đỉnh đồ| Correct vùng quê đỉnh đồi

Howerver, the Spell+Vi_GEC model can only correct the ”đỉnh đồ” phrase into ”đỉnh đồi”. Therefore, the Google translation results are not good for these situations. These phrases are special cases in which the Spell+Vi_GEC model is not learned. This is similar to the unknown words situation in machine translation. In the future, we will improve the Spell+Vi_GEC model to address these problems.

Assess the impact of the number of back-translated sentences on the quality of the augmented data. We mixed the synthetic data with the original data with different ratios. The results are shown in Table 5 and Table 6. As a result, the BLEU score increases proportionally with the number of back-translated sentences, which also means that the quality of the augmented data increases. However, the accuracy of the back-translation system (Google Translate) and our Spell+Vi_GEC model are not 100% as human, so the augmentation dataset contains more noise if the number of back-translated sentences increases.

From the above analysis, our proposed method to augment data for machine translation is effective. The experiment results showed that the reverse system’s quality of translation plays important in the back-translation method, building such a high-performance reverse model requires large amounts of parallel data so building a good inverse system with available parallel data is not simple for any language pair or specific domains of any language pair.

SECTION V.

Conclusion

We proposed a method to augment parallel data for NMT. This method combines a grammatical error correction model with Google Translate for back-translation. It does not require changes to the neural network architecture and can be easily applied to other language pairs. Thus, it is simple, effective, and highly applicable in practice for low-resource language pairs or low-resource domains. The proposed method improved the translation quality with a BLEU score of 16.87 over the baseline system. It also outperforms the baseline system in the legal domain. In the future, we will study methods to reduce noise for back-translation and evaluate the proposed method for other domains of the English-Vietnamese language pair.

References is not available for this document.

MIT Libraries

MIT Libraries

A Data Augmentation Method for English-Vietnamese Neural Machine Translation

Abstract:

Metadata

Abstract:

Introduction

Related Works