自然語言處理NLP快速入門
自然語言處理NLP快速入門
https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA
【導讀】自然語言處理已經成為人工智能領域一個重要的分支,它研究能實現人與計算機之間用自然語言進行有效通信的各種理論和方法。本文提供了一份簡要的自然語言處理介紹,幫助讀者對自然語言處理快速入門。
作者 | George Seif
編譯 | Xiaowen
An easy introduction to Natural Language Processing
Using computers to understand human language
計算機非常擅長處理標準化和結構化的數據,如數據庫表和財務記錄。他們能夠比我們人類更快地處理這些數據。但我們人類不使用“結構化數據”進行交流,也不會說二進制語言!我們用文字進行交流,這是一種非結構化數據。
不幸的是,計算機很難處理非結構化數據,因為沒有標準化的技術來處理它。當我們使用c、java或python之類的語言對計算機進行編程時,我們實際上是給計算機一組它應該操作的規則。對於非結構化數據,這些規則是非常抽象和具有挑戰性的具體定義。
互聯網上有很多非結構化的自然語言,有時甚至連谷歌都不知道你在搜索什麽!
人與計算機對語言的理解
人類寫東西已經有幾千年了。在這段時間裏,我們的大腦在理解自然語言方面獲得了大量的經驗。當我們在一張紙上或互聯網上的博客上讀到一些東西時,我們就會明白它在現實世界中的真正含義。我們感受到了閱讀這些東西所引發的情感,我們經常想象現實生活中那東西會是什麽樣子。
自然語言處理 (NLP) 是人工智能的一個子領域,致力於使計算機能夠理解和處理人類語言,使計算機更接近於人類對語言的理解。計算機對自然語言的直觀理解還不如人類,他們不能真正理解語言到底想說什麽。簡而言之,計算機不能在字裏行間閱讀。
盡管如此,機器學習 (ML) 的最新進展使計算機能夠用自然語言做很多有用的事情!深度學習使我們能夠編寫程序來執行諸如語言翻譯、語義理解和文本摘要等工作。所有這些都增加了現實世界的價值,使得你可以輕松地理解和執行大型文本塊上的計算,而無需手工操作。
讓我們從一個關於NLP如何在概念上工作的快速入門開始。之後,我們將深入研究一些python代碼,這樣你就可以自己開始使用NLP了!
NLP難的真正原因
閱讀和理解語言的過程比乍一看要復雜得多。要真正理解一段文字在現實世界中意味著什麽,有很多事情要做。例如,你認為下面這段文字意味著什麽?
“Steph Curry was on fire last nice. He totallydestroyed the other team”
對一個人來說,這句話的意思很明顯。我們知道 Steph Curry 是一名籃球運動員,即使你不知道,我們也知道他在某種球隊,可能是一支運動隊。當我們看到“著火”和“毀滅”時,我們知道這意味著Steph Curry昨晚踢得很好,擊敗了另一支球隊。
計算機往往把事情看得太過字面意思。從字面上看,我們會看到“Steph Curry”,並根據大寫假設它是一個人,一個地方,或其他重要的東西。但後來我們看到Steph Curry“著火了”…電腦可能會告訴你昨天有人把Steph Curry點上了火!…哎呀。在那之後,電腦可能會說, curry已經摧毀了另一支球隊…它們不再存在…偉大的…
Steph Curry真的著火了!
但並不是機器所做的一切都是殘酷的,感謝機器學習,我們實際上可以做一些非常聰明的事情來快速地從自然語言中提取和理解信息!讓我們看看如何在幾行代碼中使用幾個簡單的python庫來實現這一點。
使用Python代碼解決NLP問題
為了了解NLP是如何工作的,我們將使用Wikipedia中的以下文本作為我們的運行示例:
Amazon.com, Inc., doing business as Amazon, is an Americanelectronic commerce and cloud computing company based in Seattle, Washington,that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largestInternet retailer in the world as measured by revenue and market capitalization,and second largest after Alibaba Group in terms of total sales. The amazon.comwebsite started as an online bookstore and later diversified to sell videodownloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming,software, video games, electronics, apparel, furniture, food, toys, andjewelry. The company also produces consumer electronics—Kindle e-readers,Fire tablets, Fire TV, and Echo—and is the world’s largest provider of cloud infrastructure services (IaaS andPaaS). Amazon also sells certain low-end products under its in-house brandAmazonBasics.
幾個需要的庫
首先,我們將安裝一些有用的python NLP庫,這些庫將幫助我們分析本文。
### Installing spaCy, general Python NLP lib pip3 install spacy ### Downloading the English dictionary model for spaCy python3 -m spacy download en_core_web_lg ### Installing textacy, basically a useful add-on to spaCy pip3 install textacy
實體分析
現在所有的東西都安裝好了,我們可以對文本進行快速的實體分析。實體分析將遍歷文本並確定文本中所有重要的詞或“實體”。當我們說“重要”時,我們真正指的是具有某種真實世界語義意義或意義的單詞。
請查看下面的代碼,它為我們進行了所有的實體分析:
# coding: utf-8 import spacy ### Load spaCy‘s English NLP model nlp = spacy.load(‘en_core_web_lg‘) ### The text we want to examine text = "Amazon.com, Inc., doing business as Amazon, is anAmerican electronic commerce and cloud computing company based in Seattle,Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant isthe largest Internet retailer in the world as measured by revenue and marketcapitalization, and second largest after Alibaba Group in terms of total sales.The amazon. com website started as an online bookstore and later diversified tosell video downloads/streaming, MP3 downloads/streaming, audiobookdownloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics-Kindle e-readers,Fire tablets, Fire TV, and Echo-and is the world‘s largest provider of cloud infrastructureservices (IaaS and PaaS). Amazon also sells certain low-end products under itsin-house brand AmazonBasics." ### Parse the text with spaCy ### Our ‘document‘ variable now contains a parsed version oftext. document = nlp(text) ### print out all the named entities that were detected for entity in document.ents: print(entity.text,entity.label_)
我們首先加載spaCy’s learned ML模型,並初始化想要處理的文本。我們在文本上運行ML模型來提取實體。當運行taht代碼時,你將得到以下輸出:
Amazon.com, Inc. ORG Amazon ORG American NORP Seattle GPE Washington GPE Jeff Bezos PERSON July 5, 1994 DATE second ORDINAL Alibaba Group ORG amazon.com ORG Fire TV ORG Echo - LOC PaaS ORG Amazon ORG AmazonBasics ORG
文本旁邊的3個字母代碼[1]是標簽,表示我們正在查看的實體的類型。看來我們的模型幹得不錯!Jeff Bezos確實是一個人,日期是正確的,亞馬遜是一個組織,西雅圖和華盛頓都是地緣政治實體(即國家、城市、州等)。唯一棘手的問題是,Fire TV和Echo之類的東西實際上是產品,而不是組織。然而模型錯過了亞馬遜銷售的其他產品“視頻下載/流媒體、mp3下載/流媒體、有聲讀物下載/流媒體、軟件、視頻遊戲、電子產品、服裝、家具、食品、玩具和珠寶”,可能是因為它們在一個龐大的的列表中,因此看起來相對不重要。
總的來說,我們的模型已經完成了我們想要的。想象一下,我們有一個巨大的文檔,裏面滿是幾百頁的文本,這個NLP模型可以快速地讓你了解文檔的內容以及文檔中的關鍵實體是什麽。
對實體進行操作
讓我們嘗試做一些更適用的事情。假設你有與上面相同的文本塊,但出於隱私考慮,你希望自動刪除所有人員和組織的名稱。spaCy庫有一個非常有用的清除函數,我們可以使用它來清除任何我們不想看到的實體類別。如下所示:
# coding: utf-8 import spacy ### Load spaCy‘s English NLP model nlp = spacy.load(‘en_core_web_lg‘) ### The text we want to examine text = "Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture , food, toys, and jewelry. The company also produces consumer electronics?-?Kindle e-readers, Fire tablets, Fire TV, and Echo?-?and is the world‘s largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics." ### Replace a specific entity with the word "PRIVATE" def replace_entity_with_placeholder(token): if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"): return "[PRIVATE] " else: return token.string ### Loop through all the entities in a piece of text and apply entity replacement def scrub(text): doc = nlp(text) for ent in doc.ents: ent.merge() tokens = map(replace_entity_with_placeholder, doc) return "".join(tokens) print(scrub(text))
效果很好!這實際上是一種非常強大的技術。人們總是在計算機上使用ctrl+f函數來查找和替換文檔中的單詞。但是使用NLP,我們可以找到和替換特定的實體,考慮到它們的語義意義,而不僅僅是它們的原始文本。
從文本中提取信息
我們之前安裝的textacy庫在spaCy的基礎上實現了幾種常見的NLP信息提取算法。它會讓我們做一些比簡單的開箱即用的事情更先進的事情。
它實現的算法之一是半結構化語句提取。這個算法從本質上分析了spaCy的NLP模型能夠提取的一些信息,並在此基礎上獲取一些關於某些實體的更具體的信息!簡而言之,我們可以提取關於我們選擇的實體的某些“事實”。
讓我們看看代碼中是什麽樣子的。對於這一篇,我們將把華盛頓特區維基百科頁面的全部摘要都拿出來。
# coding: utf-8 import spacy import textacy.extract ### Load spaCy‘s English NLP model nlp = spacy.load(‘en_core_web_lg‘) ### The text we want to examine text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9] The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country‘s East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District. Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city‘s daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country. All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross. A locally elected mayor and a 13?member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961.""" ### Parse the text with spaCy ### Our ‘document‘ variable now contains a parsed version of text. document = nlp(text) ### Extracting semi-structured statements statements = textacy.extract.semistructured_statements(document, "Washington") print("**** Information from Washington‘s Wikipedia page ****") count = 1 for statement in statements: subject, verb, fact = statement print(str(count) + " - Statement: ", statement) print(str(count) + " - Fact: ", fact) count += 1
我們的NLP模型從這篇文章中發現了關於華盛頓特區的三個有用的事實:
(1)華盛頓是美國的首都
(2)華盛頓的人口,以及它是大都會的事實
(3)許多國家紀念碑和博物館
最好的部分是,這些都是這一段文字中最重要的信息!
深入研究NLP
到這裏就結束了我們對NLP的簡單介紹。我們學了很多,但這只是一個小小的嘗試…
NLP有許多更好的應用,例如語言翻譯,聊天機器人,以及對文本文檔的更具體和更復雜的分析。今天的大部分工作都是利用深度學習,特別是遞歸神經網絡(RNNs)和長期短期記憶(LSTMs)網絡來完成的。
如果你想自己玩更多的NLP,看看spaCy文檔[2] 和textacy文檔[3] 是一個很好的起點!你將看到許多處理解析文本的方法的示例,並從中提取非常有用的信息。所有的東西都是快速和簡單的,你可以從中得到一些非常大的價值。是時候用深入的學習來做更大更好的事情了!
參考鏈接:
[1] https://spacy.io/usage/linguistic-features#entity-types
[2]https://spacy.io/api/doc
[3]http://textacy.readthedocs.io/en/latest/
原文鏈接:
https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1
-END-
自然語言處理NLP快速入門