How to Develop a Neural Machine Translation System from Scratch
Develop a Deep Learning Model to Automatically
Translate from German to English in Python with Keras, Step-by-Step.
Machine translation is a challenging task that traditionally involves large statistical models developed using highly sophisticated linguistic knowledge.
Neural machine translation is the use of deep neural networks for the problem of machine translation.
In this tutorial, you will discover how to develop a neural machine translation system for translating German phrases to English.
After completing this tutorial, you will know:
- How to clean and prepare data ready to train a neural machine translation system.
- How to develop an encoder-decoder model for machine translation.
- How to use a trained model for inference on new input phrases and evaluate the model skill.
Let’s get started.
Tutorial Overview
This tutorial is divided into 4 parts; they are:
- German to English Translation Dataset
- Preparing the Text Data
- Train Neural Translation Model
- Evaluate Neural Translation Model
Python Environment
This tutorial assumes you have a Python 3 SciPy environment installed.
You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.
The tutorial also assumes you have NumPy and Matplotlib installed.
If you need help with your environment, see this post:
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
German to English Translation Dataset
In this tutorial, we will use a dataset of German to English terms used as the basis for flashcards for language learning.
The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.
The page provides a list of many language pairs, and I encourage you to explore other languages:
The dataset we will use in this tutorial is available for download here:
Download the dataset to your current working directory and decompress; for example:
1 | unzip deu-eng.zip |
You will have a file called deu.txt that contains 152,820 pairs of English to German phases, one pair per line with a tab separating the language.
For example, the first 5 lines of the file look as follows:
12345 | Hi. Hallo!Hi. Grüß Gott!Run! Lauf!Wow! Potzdonner!Wow! Donnerwetter! |
We will frame the prediction problem as given a sequence of words in German as input, translate or predict the sequence of words in English.
The model we will develop will be suitable for some beginner German phrases.
Preparing the Text Data
The next step is to prepare the text data ready for modeling.
Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.
For example, here are some observations I note from reviewing the raw data:
- There is punctuation.
- The text contains uppercase and lowercase.
- There are special characters in the German.
- There are duplicate phrases in English with different translations in German.
- The file is ordered by sentence length with very long sentences toward the end of the file.
Did you note anything else that could be important?
Let me know in the comments below.
A good text cleaning procedure may handle some or all of these observations.
Data preparation is divided into two subsections:
- Clean Text
- Split Text
1. Clean Text
First, we must load the data in a way that preserves the Unicode German characters. The function below called load_doc() will load the file as a blob of text.
123456789 | # load doc into memorydef load_doc(filename):# open the file as read onlyfile=open(filename,mode='rt',encoding='utf-8')# read all texttext=file.read()# close the filefile.close()returntext |
Each line contains a single pair of phrases, first English and then German, separated by a tab character.
We must split the loaded text by line and then by phrase. The function to_pairs() below will split the loaded text.
12345 | # split a loaded document into sentencesdef to_pairs(doc):lines=doc.strip().split('\n')pairs=[line.split('\t')forline inlines]returnpairs |
We are now ready to clean each sentence. The specific cleaning operations we will perform are as follows:
- Remove all non-printable characters.
- Remove all punctuation characters.
- Normalize all Unicode characters to ASCII (e.g. Latin characters).
- Normalize the case to lowercase.
- Remove any remaining tokens that are not alphabetic.
We will perform these operations on each phrase for each pair in the loaded dataset.
The clean_pairs() function below implements these operations.
123456789101112131415161718192021222324252627 | # clean a list of linesdef clean_pairs(lines):cleaned=list()# prepare regex for char filteringre_print=re.compile('[^%s]'%re.escape(string.printable))# prepare translation table for removing punctuationtable=str.maketrans('','',string.punctuation)forpair inlines:clean_pair=list()forline inpair:# normalize unicode charactersline=normalize('NFD',line).encode('ascii','ignore')line=line.decode('UTF-8')# tokenize on white spaceline=line.split()# convert to lowercaseline=[word.lower()forwordinline]# remove punctuation from each tokenline=[word.translate(table)forwordinline]# remove non-printable chars form each tokenline=[re_print.sub('',w)forwinline]# remove tokens with numbers in themline=[wordforwordinline ifword.isalpha()]# store as stringclean_pair.append(' '.join(line))cleaned.append(clean_pair)returnarray(cleaned) |
Finally, now that the data has been cleaned, we can save the list of phrase pairs to a file ready for use.
The function save_clean_data() uses the pickle API to save the list of clean text to file.
Pulling all of this together, the complete example is listed below.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667 | import stringimport refrom pickle import dumpfrom unicodedata import normalizefrom numpy import array# load doc into memorydef load_doc(filename):# open the file as read onlyfile=open(filename,mode='rt',encoding='utf-8')# read all texttext=file.read()# close the filefile.close()returntext# split a loaded document into sentencesdef to_pairs(doc):lines=doc.strip().split('\n')pairs=[line.split('\t')forline inlines]returnpairs# clean a list of linesdef clean_pairs(lines):cleaned=list()# prepare regex for char filteringre_print=re.compile('[^%s]'%re.escape(string.printable))# prepare translation table for removing punctuationtable=str.maketrans('','',string.punctuation)forpair inlines:clean_pair=list()forline inpair:# normalize unicode charactersline=normalize('NFD',line).encode('ascii','ignore')line=line.decode('UTF-8')# tokenize on white spaceline=line.split()# convert to lowercaseline=[word.lower()forwordinline]# remove punctuation from each tokenline=[word.translate(table)forwordinline]# remove non-printable chars form each tokenline=[re_print.sub('',w)forwinline]# remove tokens with numbers in themline=[wordforwordinline ifword.isalpha()]# store as stringclean_pair.append(' '.join(line))cleaned.append(clean_pair)returnarray(cleaned)# save a list of clean sentences to filedef save_clean_data(sentences,filename):dump(sentences,open(filename,'wb'))print('Saved: %s'%filename)# load datasetfilename='deu.txt'doc=load_doc(filename)# split into english-german pairspairs=to_pairs(doc)# clean sentencesclean_pairs=clean_pairs(pairs)# save clean pairs to filesave_clean_data(clean_pairs,'english-german.pkl')# spot checkforiinrange(100):print('[%s] => [%s]'%(clean_pairs[i,0],clean_pairs[i,1])) |
Running the example creates a new file in the current working directory with the cleaned text called english-german.pkl.
Some examples of the clean text are printed for us to evaluate at the end of the run to confirm that the clean operations were performed as expected.
1234567891011 | [hi] => [hallo][hi] => [gru gott][run] => [lauf][wow] => [potzdonner][wow] => [donnerwetter][fire] => [feuer][help] => [hilfe][help] => [zu hulf][stop] => [stopp][wait] => [warte]... |
2. Split Text
The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.
This is a good number of examples for developing a small translation model. The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary.
Although we have a good dataset for modeling translation, we will simplify the problem slightly to dramatically reduce the size of the model required, and in turn the training time required to fit the model.
You can explore developing a model on the fuller dataset as an extension; I would love to hear how you do.
We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.
Further, we will then stake the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.
Below is the complete example of loading the clean data, splitting it, and saving the split portions of data to new files.
12345678910111213141516171819 |