The essential guide to how NLP works

阿新 • • 發佈：2018-12-29

Understanding how to build an NLP pipeline

Let’s look at a piece of text:

London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.

This paragraph contains several useful facts. It would be great if a computer could read this text and understand that London is a city, London is located in England, London was settled by Romans and so on. But to get there, we have to first teach our computer the most basic concepts of written language and then move up from there.

Steps to build an NLP Pipeline

Step 1: Sentence Segmentation

The first step in the pipeline is to break the text apart into separate sentences. That gives us this:

“London is the capital and most populous city of England and the United Kingdom.”
“Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia.”

“It was founded by the Romans, who named it Londinium.”

We can assume that each sentence in English is a separate thought or idea. It will be a lot easier to write a program to understand a single sentence than to understand a whole paragraph.

Coding a Sentence Segmentation model can be as simple as splitting apart sentences whenever you see a punctuation mark. But modern NLP pipelines often use more complex techniques that work even when a document isn’t formatted cleanly.

Step 2: Word Tokenization

Now that we’ve split our document into sentences, we can process them one at a time. Let’s start with the first sentence from our document:

“London is the capital and most populous city of England and the United Kingdom.”

The next step in our pipeline is to break this sentence into separate words or tokens. This is called tokenization. This is the result:

“London”, “is”, “ the”, “capital”, “and”, “most”, “populous”, “city”, “of”, “England”, “and”, “the”, “United”, “Kingdom”, “.”

Tokenization is easy to do in English. We’ll just split apart words whenever there’s a space between them. And we’ll also treat punctuation marks as separate tokens since punctuation also has meaning.

Step 3: Predicting Parts of Speech for Each Token

Next, we’ll look at each token and try to guess its part of speech — whether it is a noun, a verb, an adjective and so on. Knowing the role of each word in the sentence will help us start to figure out what the sentence is talking about.

We can do this by feeding each word (and some extra words around it for context) into a pre-trained part-of-speech classification model:

With this information, we can already start to glean some very basic meaning. For example, we can see that the nouns in the sentence include “London” and “capital”, so the sentence is probably talking about London.

Step 4: Text Lemmatization

In English (and most languages), words appear in different forms. Look at these two sentences:

I had a pony.

I had two ponies.

Both sentences talk about the noun pony, but they are using different inflections. When working with text in a computer, it is helpful to know the base form of each word so that you know that both sentences are talking about the same concept. Otherwise the strings “pony” and “ponies” look like two totally different words to a computer.

In NLP, we call finding this process lemmatization — figuring out the most basic form or lemma of each word in the sentence.

The same thing applies to verbs. We can also lemmatize verbs by finding their root, unconjugated form. So “I had two ponies” becomes “I [have] two [pony].”

Lemmatization is typically done by having a look-up table of the lemma forms of words based on their part of speech and possibly having some custom rules to handle words that you’ve never seen before.

Here’s what our sentence looks like after lemmatization adds in the root form of our verb:

The essential guide to how NLP works

Understanding how to build an NLP pipeline

Steps to build an NLP Pipeline

Step 1: Sentence Segmentation

Step 2: Word Tokenization

Step 3: Predicting Parts of Speech for Each Token

Step 4: Text Lemmatization

The essential guide to how NLP works

The definitive guide to deleting your tweets

What is a Security Token? A Comprehensive Guide to How They Work and Their Impact

The ultimate guide to proper use of animation in UX

The Ultimate Guide to Filling Out Your Bracket for March Madness

The ultimate guide to Schema Stitching in GraphQL

The Complete Guide to SAFEs

The Ultimate Guide To Speech Recognition With Python β Real Python

The Ultimate Guide to Data Classes in Python 3.7

The Ultimate Guide to Django Redirects β Real Python

The Complete Guide to Conversational Commerce

The Ultimate Guide To iPhone Resolutions

THE DEFINITIVE GUIDE TO DEBUGGING JAVASCRIPT

A practical ES6 guide on how to perform HTTP requests using the Fetch API

Guide to update the Cisco Phone Firmware via CME

A Simple Guide to the Versions of the Inception Network

【 InkGenius】Good developers who are familiar with the entire stack know how to make life easier for those around

Form is submitted when I click on the button in form. How to avoid this?

An Insider's Guide to Keeping Up with the AI Experts Udacity

A Basic Guide to Choosing the Right Tech Stack for Client Work

The essential guide to how NLP works

Understanding how to build an NLP pipeline

Steps to build an NLP Pipeline

Step 1: Sentence Segmentation

Step 2: Word Tokenization

Step 3: Predicting Parts of Speech for Each Token

Step 4: Text Lemmatization

相關推薦