Exploring the UN General Debates with Dynamic Topic Models

阿新 • • 發佈：2018-12-28

The General Debate Dataset

A corpus consisting of speeches at the General Debate from 1970 through 2015 is hosted on Kaggle (7,507 in total). The dataset was originally released last year (see here for the paper) by researchers in the UK and Ireland who used it to study the position of different countries on various policy dimensions.

Each speech is tagged with the year and session it was given and the ISO Alpha-3 code of the country that the speaker represented.

Data Preprocessing

A key observation in this dataset is that each of these speeches consists of discussion on a multitude of topics. If every speech contains discussion on poverty and terrorism, a topic model trained on entire speeches as documents in a bag of words representation will have no way of understanding that terms like “poverty” and “terrorism” should be representative of different topics.

To counter this problem, I tokenize each speech into paragraphs and treat each paragraph as a separate document for the analysis. A simple rule based approach that looks for sentences separated by a newline character performs reasonably well on the task of paragraph tokenization for this dataset. After this step, the number of documents jumps from 7,507 (full speeches) to 283,593 (paragraphs).

After expanding each speech into multiple documents, I word tokenize, normalize each term by lowercasing and lemmatizing, and trim low frequency terms from the vocabulary. The end result is a vocabulary of 7,054 terms and a bag of words representation for each document that can be used as input to the DTM.

Word, sentence, and paragraph tokenizers for the General Debate corpus

Sample rows after paragraph tokenization

Sample rows after word tokenization of paragraphs

Inference

To determine the number of topics to use, I ran LDA on a few different time slices individually with different numbers of topics (10, 15, 20, 30) to get a feel for the problem. Through manual inspection, 15 seemed to produce the most interpretable topics, so that is what I settled on for the DTM. More experimentation and rigorous quantitative evaluation could certainly improve this.

Using 's Python wrapper to the original DTM C++ code, inferring the parameters of a DTM is then straightforward, albeit slow. Inference took about 8 hours on an n1-standard-2 (2 vCPUs, 7.5 GB memory) instance on Google Cloud Platform. However, I ran on a single core, so this time can probably be cut down if you can get the parallelized version of the original C++ to compile.

Results

The model discovered very interpretable topics, and I examine a few in depth here. Specifically, I show for a few topics the top terms at a sample of time slices as well as plots of probabilities of notable terms over time. A complete list of the topics discovered by the model and their top terms can be found in the appendix at the end of this article.

Human Rights

We the peoples of the United Nations determined … to reaffirm faith in fundamental human rights, in the dignity and worth of the human person, in the equal rights of men and women and of nations large and small.

It is no surprise then that human rights is a perennial topic at the General Debate and one that the model was able to discover. Despite the notion of gender equality appearing in the charter quoted above, the model shows that it still took quite some time for the terms “woman” and “gender” to really catch on. Also, note the rising use of “humankind” coupled with the decline of the use of “mankind”.

Exploring the UN General Debates with Dynamic Topic Models

The General Debate DatasetA corpus consisting of speeches at the General Debate from 1970 through 2015 is hosted on Kaggle (7,507 in total). The dataset wa

動態主題模型（Dynamic Topic Models）

在本文中，我們介紹一個動態主題模型，該模型捕獲了順序組織的文件語料庫中主題的演變。我們通過分析由Ed Edi-son於1880年創立的Jour-nal Science的100多年的OCR文章來證明其適用性。在這種模式下，文章按年份分組，每年

Mastering the game of Go with deep neural networks and tree search

深度策略參數初始化技術以及 -1 簡單 cpu 網絡 Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.758

The database cluster was initialized with RELSEG_SIZE 1048576, but the server was compiled with RELSEG_SIZE 8388608

database physical 都在 script back 對應關系 com 發現 logs 　　由於一次誤操作，將線上機器的數據庫程序目錄刪除，雖然不影響程序的正常使用，數據也未丟失，但後面如果出現服務器宕機或數據庫宕機，數據庫將無法啟動，而且數據庫對應的編譯參數也

Taking the MSTest Framework forward with “MSTest V2”[譯]

with 工程英文 sdn markdown -i 文檔化 -a uwp ??最近，我們宣布MSTest Framework支持.NET Core RC2/ASP.NET Core RC2，最終，我們把加入這一新特性的MSTest Framework更名為“MSTest

Jenkins CI/CD on Kubernetes with dynamic slaves

number lock tmp ply server 單擊 admin 動態 crmsh 本文檔介紹如何通過在 Kubernetes 集群上創建並配置 Jenkins Server 實現應用開發管理的 CI/CD 流程，並且利用 Kubernetes-Jenkins-Plu

【已解決】mac上appium報錯：“Could not find aapt Please set the ANDROID_HOME environment variable with the Android SDK root directory path”

resource sset root could not fun ror 環境 apt direct 按照網上教程配置完appium環境後，真機跑自動化過程，遇到如下報錯： appium報錯如下： [ADB] Checking whether aapt is present

centos 報錯 “Job for iptables.service failed because the control process exited with error code.”的解決辦法

cau ack res sta ble put use wal ror 原因：因為centos7默認的防火墻是firewalld防火墻，不是使用iptables，因此需要先關閉firewalld服務，或者幹脆使用默認的firewalld防火墻。操作步驟：關閉防火墻 1.

centos7啟動iptables時報Job for iptables.service failed because the control process exited with error cod

異常信息 bsp stop input emctl tro stat mct centos7 centos7啟動iptables時報Job for iptables.service failed because the control process exi

Job for network.service failed because the control process exited with error code問題

今天在centOS 7下更改完靜態ip後發現network服務重啟不了，翻遍了網路，嘗試了各種方法，終於解決了。現把各種解決方法歸納整理，希望能讓後面的同學少走點歪路。。。首先看問題：執行service network restart命令

centos7下docker啟動失敗--ob for docker.service failed because the control process exited with error code.

今天在配置openshift用docker部署映象的時候，啟動docker發現竟然啟動失敗：之前是可以啟動的，突然不能啟動了，想到我改了配置檔案：執行： vim /etc/sysconfig/docker 可能是家的內容有問題

Investigating Capsule Networks with Dynamic Routing for Text Classification

探索使用動態路由的膠囊網路進行文字分類，提出三種策略穩定動態路由來減輕噪音膠囊的分佈，這些膠囊可能包含背景資訊，或是訓練不好。膠囊網路獲得很好的分類效果，而且訓練多標籤的效果好於單標籤 1 Introduction 文章或是句子建模是NLP的基礎問題，如果組成，層次，結構都考慮的話，很是複雜

WARNING Re-reading the partition table failed with error 16

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

Centos 7不能上網，Job for iptables.service failed because the control process exited with error code.

一、先檢查自己的網路配置資訊：路徑： vi /etc/sysconfig/network-scripts/ifcfg-eno16777736 如圖重啟網路配置的命令：service network restart或者是systemctl restart net

Time-Aware User Identification with Topic Models

這篇文章的應用場景（問題）是針對當個賬號對應多個使用者的情況，現實的舉例為網路電視中，家庭賬號，所有成員使用。如果可以識別出使用者，可以可以從基於賬號的服務改變為基於使用者的服務——學習系統需要知道使用者數量，但是不知道哪個使用者線上（active）——作者基於LDA模型，聯

centos7啟動MySQL時候突然進不去Job for mysqld.service failed because the control process exited with error cod

mysql突然進不去了，不知道什魔鬼 [[email protected] ~]# systemctl start mysqld.service Job for mysqld.service failed because the control pro

Mastering the game of Go with deep neural networks and tree search譯文

用深度神經網路和樹搜尋征服圍棋作者：David Silver 1 , Aja Huang 1 , Chris J. Maddison 1 , Arthur Guez 1 , Laurent Sifre 1 , George van den Driessche

解決mariadb 啟動時報錯Job for mariadb.service failed because the control process exited with error code. Se

錯誤：[[email protected] ~]# systemctl start mariadb.service Job for mariadb.service failed because the control process exited

Problem Solving Process of The terminal process terminated with exit code 1

參考前輩的配置VScode C/C++環境的經驗：成成賜我力量 bat67 參考之後我的配置 c_cpp_properties.json "configurations": [ { "name": "Mac",

[Algorithms] Solve Complex Problems in JavaScript with Dynamic Programming

Every dynamic programming algorithm starts with a grid. It entails solving subproblems and builds up to solving the big problem. Let’s break down a problem

Exploring the UN General Debates with Dynamic Topic Models

The General Debate Dataset

Data Preprocessing

Inference

Results

Human Rights

相關推薦