How to Get Started with Deep Learning for Natural Language Processing (7
Deep Learning for NLP Crash Course.
Bring Deep Learning methods to Your Text Data project in 7 Days.
We are awash with text, from books, papers, blogs, tweets, news, and increasingly text from spoken utterances.
Working with text is hard as it requires drawing upon knowledge from diverse domains such as linguistics, machine learning, statistical methods, and these days, deep learning.
Deep learning methods are starting to out-compete the classical and statistical methods on some challenging natural language processing problems with singular and simpler models.
In this crash course, you will discover how you can get started and confidently develop deep learning for natural language processing problems using Python in 7 days.
This is a big and important post. You might want to bookmark it.
Let’s get started.
Who Is This Crash-Course For?
Before we get started, let’s make sure you are in the right place.
The list below provides some general guidelines as to who this course was designed for.
Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.
You need to know:
- You need to know your way around basic Python, NumPy and Keras for deep learning.
You do NOT need to know:
- You do not need to be a math wiz!
- You do not need to be a deep learning expert!
- You do not need to be a linguist!
This crash course will take you from a developer that knows a little machine learning to a developer who can bring deep learning methods to your own natural language processing project.
Note: This crash course assumes you have a working Python 2 or 3 SciPy environment with at least NumPy, Pandas, scikit-learn and Keras 2 installed. If you need help with your environment, you can follow the step-by-step tutorial here:
Crash-Course Overview
This crash course is broken down into 7 lessons.
You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.
Below are 7 lessons that will get you started and productive with deep learning for natural language processing in Python:
- Lesson 01: Deep Learning and Natural Language
- Lesson 02: Cleaning Text Data
- Lesson 03: Bag-of-Words Model
- Lesson 04: Word Embedding Representation
- Lesson 05: Learned Embedding
- Lesson 06: Classifying Text
- Lesson 07: Movie Review Sentiment Analysis Project
Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.
The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the deep learning, natural language processing and the best-of-breed tools in Python (hint, I have all of the answers directly on this blog, use the search box).
I do provide more help in the form of links to related posts because I want you to build up some confidence and inertia.
Post your results in the comments, I’ll cheer you on!
Hang in there, don’t give up.
Note: This is just a crash course. For a lot more detail and 30 fleshed out tutorials, see my book on the topic titled “
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Lesson 01: Deep Learning and Natural Language
In this lesson, you will discover a concise definition for natural language, deep learning and the promise of deep learning for working with text data.
Natural Language Processing
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.
The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.
The problem of understanding text is not solved, and may never be, is primarily because language is messy. There are few rules. And yet we can easily understand each other most of the time.
Deep Learning
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
A property of deep learning is that the performance of these type of model improves by training them with more examples by increasing their depth or representational capacity.
In addition to scalability, another often cited benefit of deep learning models is their ability to perform automatic feature extraction from raw data, also called feature learning.
Promise of Deep Learning for NLP
Deep learning methods are popular for natural language, primarily because they are delivering on their promise.
Some of the first large demonstrations of the power of deep learning were in natural language processing, specifically speech recognition. More recently in machine translation.
The 3 key promises of deep learning for natural language processing are as follows:
- The Promise of Feature Learning. That is, that deep learning methods can learn the features from natural language required by the model, rather than requiring that the features be specified and extracted by an expert.
- The Promise of Continued Improvement. That is, that the performance of deep learning in natural language processing is based on real results and that the improvements appear to be continuing and perhaps speeding up.
- The Promise of End-to-End Models. That is, that large end-to-end deep learning models can be fit on natural language problems offering a more general and better-performing approach.
Natural language processing is not “solved“, but deep learning is required to get you to the state-of-the-art on many challenging problems in the field.
Your Task
For this lesson you must research and list 10 impressive applications of deep learning methods in the field of natural language processing. Bonus points if you can link to a research paper that demonstrates the example.
Post your answer in the comments below. I would love to see what you discover.
More Information
In the next lesson, you will discover how to clean text data so that it is ready for modeling.
Lesson 02: Cleaning Text Data
In this lesson, you will discover how you can load and clean text data so that it is ready for modeling using both manually and with the NLTK Python library.
Text is Messy
You cannot go straight from raw text to fitting a machine learning or deep learning model.
You must clean your text first, which means splitting it into words and normalizing issues such as:
- Upper and lower case characters.
- Punctuation within and around words.
- Numbers such as amounts and dates.
- Spelling mistakes and regional variations.
- Unicode characters
- and much more…
Manual Tokenization
Generally, we refer to the process of turning raw text into something we can model as “tokenization”, where we are left with a list of words or “tokens”.
We can manually develop Python code to clean text, and often this is a good approach given that each text dataset must be tokenized in a unique way.
For example, the snippet of code below will load a text file, split tokens by whitespace and convert each token to lowercase.
12345678 | filename='...'file=open(filename,'rt')text=file.read()file.close()# split into words by white spacewords=text.split()# convert to lowercasewords=[word.lower()forwordinwords] |
You can imagine how this snippet could be extended to handle and normalize Unicode characters, remove punctuation and so on.
NLTK Tokenization
Many of the best practices for tokenizing raw text have been captured and made available in a Python library called the Natural Language Toolkit or NLTK for short.
You can install this library using pip by typing the following on the command line:
1 | sudo pip install -U nltk |
After it is installed, you must also install the datasets used by the library, either via a Python script:
12 | import nltknltk.download() |
or via a command line:
1 | python -m nltk.downloader all |
Once installed, you can use the API to tokenize text. For example, the snippet below will load and tokenize an ASCII text file.
12345678 | # load datafilename='...'file=open(filename,'rt')text=file.read()file.close()# split into wordsfrom nltk.tokenize import word_tokenizetokens=word_tokenize(text) |
There are many tools available in this library and you can further refine the clean tokens using your own manual methods, such as removing punctuation, removing stop words, stemming and much more.
Your Task
Your task is to locate a free classical book on the Project Gutenberg website, download the ASCII version of the book and tokenize the text and save the result to a new file. Bonus points for exploring both manual and NLTK approaches.
Post your code in the comments below. I would love to see what book you choose and how you chose to tokenize it.
More Information
In the next lesson, you will discover the bag-of-words model.
Lesson 03: Bag-of-Words Model
In this lesson, you will discover the bag of words model and how to encode text using this model so that you can train a model using the scikit-learn and Keras Python libraries.
Bag-of-Words
The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document.
A vocabulary is chosen, where perhaps some infrequently used words are discarded. A given document of text is then represented using a vector with one position for each word in the vocabulary and a score for each known word that appears (or not) in the document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
Bag-of-Words with scikit-learn
The scikit-learn Python library for machine learning provides tools for encoding documents for a bag-of-words model.
An instance of the encoder can be created, trained on a corpus of text documents and then used again and again to encode training, test, validation and any new data that needs to be encoded for your model.
There is an encoder to score words based on their count called CountVectorizer, one for using a hash function of each word to reduce the vector length called HashingVectorizer, and a one that uses a score based on word occurrence in the document and the inverse occurrence across all documents called TfidfVectorizer.
The snippet below shows how to train the TfidfVectorizer bag-of-words encoder and use it to encode multiple small text documents.
1234567891011121314151617 | from sklearn.feature_extraction.text import TfidfVectorizer# list of text documentstext=["The quick brown fox jumped over the lazy dog.","The dog.","The fox"]# create the transformvectorizer=TfidfVectorizer()# tokenize and build vocabvectorizer.fit(text)# summarizeprint(vectorizer.vocabulary_)print(vectorizer.idf_)# encode documentvector=vectorizer.transform([text[0]])# summarize encoded vectorprint(vector.shape)print(vector.toarray()) |
Bag-of-Words with Keras
The Keras Python library for deep learning also provides tools for encoding text using the bag-of words-model in the Tokenizer class.
As above, the encoder must be trained on source documents and then can be used to encode training data, test data and any other data in the future. The API also has the benefit of performing basic tokenization prior to encoding the words.
The snippet below demonstrates how to train and encode some small text documents using the Keras API and the ‘count’ type scoring of words.
12345678910111213141516171819 | from keras.preprocessing.text import Tokenizer# define 5 documentsdocs=['Well done!','Good work','Great effort','nice work','Excellent!']# create the tokenizert=Tokenizer()# fit the tokenizer on the documentst.fit_on_texts(docs)# summarize what was learnedprint(t.word_counts)print(t.document_count)print(t.word_index)print(t.word_docs)# integer encode documentsencoded_docs=t.texts_to_matrix(docs,mode='count')print(encoded_docs) |
Your Task
Your task in this lesson is to experiment with the scikit-learn and Keras methods for encoding small contrived text documents for the bag-of-words model. Bonus points if you use a small standard text dataset of documents to practice on and perform data cleaning as part of the preparation.
Post your code in the comments below. I would love to see what APIs you explore and demonstrate.
More Information
In the next lesson, you will discover word embeddings.
Lesson 04: Word Embedding Representation
In this lesson, you will discover the word embedding distributed representation and how to develop a word embedding using the Gensim Python library.
Word Embeddings
Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.
They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.
Train Word Embeddings
You can train a word embedding distributed representation using the Gensim Python library for topic modeling.
Gensim offers an implementation of the word2vec algorithm, developed at Google for the fast training of word embedding representations from text documents,
You can install Gensim using pip by typing the following on your command line:
1 | pip install -U gensim |
The snippet below shows how to define a few contrived sentences and train a word embedding representation in Gensim.
12345678910111213141516 | from gensim.models import Word2Vec# define training datasentences=[['this','is','the','first','sentence','for','word2vec'],['this','is','the','second','sentence'],['yet','another','sentence'],['one','more','sentence'],['and','the','final','sentence']]# train modelmodel=Word2Vec(sentences,min_count=1)# summarize the loaded modelprint(model)# summarize vocabularywords=list(model.wv.vocab)print(words)# access vector for one wordprint(model['sentence']) |
Use Embeddings
Once trained, the embedding can be saved to file to be used as part of another model, such as the front-end of a deep learning model.
You can also plot a projection of the distributed representation of words to get an idea of how the model believes words are related. A common projection technique that you can use is the Principal Component Analysis or PCA, available in scikit-learn.
The snippet below shows how to train a word embedding model and then plot a two-dimensional projection of all words in the vocabulary.
123456789101112131415161718192021 | from gensim.models import Word2Vecfrom sklearn.decomposition import PCAfrom matplotlib import pyplot# define training datasentences=[['this','is','the','first','sentence','for','word2vec'],['this','is','the','second','sentence'],['yet','another','sentence'],['one','more','sentence'],['and','the','final','sentence']]# train modelmodel=Word2Vec(sentences,min_count=1)# fit a 2D PCA model to the vectorsX=model[model.wv.vocab]pca=PCA(n_components=2)result=pca.fit_transform(X)# create a scatter plot of the projectionpyplot.scatter(result[:,0],result[:,1])words=list(model.wv.vocab)fori,wordinenumerate(words):pyplot.annotate(word,xy=(result[i,0],result[i,1]))pyplot.show() |
Your Task
Your task in this lesson is to train a word embedding using Gensim on a text document, such as a book from Project Gutenberg. Bonus points if you can generate a plot of common words.
Post your code in the comments below. I would love to see what book you choose and any details of the embedding that you learn.
More Information
In the next lesson, you will discover how a word embedding can be learned as part of a deep learning model.
Lesson 05: Learned Embedding
In this lesson, you will discover how to learn a word embedding distributed representation for words