1. 程式人生 > >Detecting Fake News With NLP

Detecting Fake News With NLP

Which one is fake?

As an example of the problem, let’s try to detect the fake news from the 4 headline given down below.

It is hard to detect at first sight, isn't it? Let’s see the answer.

The site such and such

And let’s try to see detect the real news from fake ones.

And here is the answer…

As we can see from the examples above, unless we are careful and actually looking for fake or real news, it is hard to detect them. Even if we are careful and looking for fakeness and have only 4 option, it is challenging to detect the fake or the real news. If we don’t read them all carefully, we often were mistaken.

How to detect fake news?

As human beings, when we read a sentence or a paragraph, we can interpret the words with the whole document and understand the context. In this project, we teach to a computer how to read and understand the differences between real news and the fake news using Natural Language Processing (NLP). We will do this by using TF-IDF vectorizer.TF-IDF is used to determine word importance in a given article in the entire corpus. We will discuss them in the last section.

Data Set

I collected more than 200,000 articles and filtered them by the topic and the date range. Eventually, I had 52,000 articles from 2016–2017 and in Business, Politics, U.S. News, and The World. 12,000 of them were label as fake news and 40,000 of them was real news. I used NYT API and The Guardian Post API to get real news, and I used Kaggle’s fake news data set for the fake news.

Fake News Word Cloud

Machine Learning Algorithms

Since it is a classification problem, I started with Logistic Regression, Random Forest, and XGBoost. Logistic regression is a simple algorithm whereas Random Forest and XGBoost are more advance. I expected the advance models to perform better but the results were surprising.

After vectorizing the documents with TF-IDF, the random forest gave % 82 F1 scores, and XGBoost gave %65. Essentially F1 score optimizes the model based on False Negative and False Positive. In our case, they are both equally important. I did optimize my model for the F1 score, as precision and recall are equally important to the problem. After that, I have got %95 accuracy from Logistic Regression, using the body of the article, and reduce the number of columns 5700 from 8 million using grid search for the vectorizer and the regression model.

The figure above shows the number of columns on the x-axis and the size of the coefficients before grid search. And the figure below shows their numbers after the grid search.

After having a really high F1 score and accuracy, every data scientist should ask the question “Am I over-fitting? What is my Bias-Variance trade off? Can I get the same score with less data?”

In order to answer those questions, I plotted learning curves using the headline, which shows the bias-variance trade off and the if I need more data, or less than I have is enough.

Learning Curve From Headline

After having this plot, I realized that using headline is necessarily helpful for my problem.

After generating the learning curves for using headline and body , I realized that using headline is not necessarily helpful to solve my problem.

Website

Here is how the site looks like…

Home Page

After copying and pasting the article we click to the check button and the result is…

An Example Show Case

In the graph below, we can see the difference train and test curves are far apart, which shows us the level of bias is high.

But using the body, we can see that there is low bias and low variance and we drive the conclusion as more data helps to improve the metrics.

Future Work

Combining LDA with cross-validation across news media agencies to check if they have the same perspective on a give topic. The model with detect the topic and the facts of a news article, then do the same trust wordy articles and compare the given articles topic with the trustworthy articles. In this case, each news agencies will have weight based on the trustworthiness, and then we will set a threshold. If the weight is above the threshold, we would label is as real news, if not then it will be labeled as fake news.

Code Available

The code is available at www.github.com/genyunus/Detecting_Fake_News

相關推薦

Detecting Fake News With NLP

Which one is fake?As an example of the problem, let’s try to detect the fake news from the 4 headline given down below.It is hard to detect at first sight,

Detecting fake news at its source

Lately the fact-checking world has been in a bit of a crisis. Sites like Politifact and Snopes have traditionally focused on specific claims, which is admi

Detecting Fake News, At Its Source

Summary: Researchers have created a new deep learning system that can determine if a news outlet is accurate or biased based on only 150 articles published

'With fake news, the people are the first line of defe LTE Router nce'

www.inhandnetworks.de Our podcast this week takes aim at fake news and how Sweden’s lining up against it. Also, gender battles in France – and why

CF802G Fake News (easy)

ref space ant other clas mes == sed str CF802G Fake News (easy) 題意翻譯 給定一個字符串詢問能否聽過刪除一些字母使其變為“heidi” 如果可以輸出

論文研讀 “Liar, Liar Pants on Fire”:A New Benchmark Dataset for Fake News Detection

給十月畫個句號 最近上的很喜歡的一門課中,老師要求我們研讀一篇頂會論文並進行分享,好久沒能靜靜地坐下來寫一篇部落格了,接下來希望自己能夠多讀論文的同時把論文的思路以部落格的形式輸出~ 論文來源 “Liar, Liar Pants on Fire”:A New Benchmar

codeforce 802 H. Fake News (medium) 構造 套路題

H. Fake News (medium)time limit per test1 secondmemory limit per test256 megabytesinputstandard input

France’s ‘fake news’ playbook, Uber’s settlement & why we give away our personal data

“All of the people, all of the internet, all of the time”. Working for digital equality #ForEveryone. Founded by Sir Tim Berners-Lee, inventor of the web.

Even the best AI for spotting fake news is still terrible

When Facebook chief executive Mark Zuckerberg promised Congress that AI would help solve the problem of fake news, he revealed little in the way of how. Ne

MIT shows how to tackle fake news using AI and ML

After the recent uproar about "fake news", it is again all quiet in India. Well, not until, another lynching or riot happens. Whatsapp is busy spreading th

Algorithm Outperforms Humans at Spotting Fake News Digital Trends

An artificial intelligence system that can tell the difference between real and fake news -- often with better success rates than its human counterparts --

Next generation of chatbots with NLP services and Graphs

Next generation of chatbots with NLP services and GraphsAI technologies and specially chatbots and personal assistants (like Google Assistant, Siri, Cortan

[D] Fake news | AITopics

There are lots of different ways. Fake news is a broad topic. Most of the papers published on it have stated two main aspects- Text and User. You can work

Shadow Politics: Meet the Digital Sleuth Exposing Fake News

This story is for Medium members.Continue with FacebookContinue with GoogleMedium curates expert stories from leading publishers exclusively for members (w

A Mathematical Model Captures the Political Impact of Fake News

This story is for Medium members.Continue with FacebookContinue with GoogleMedium curates expert stories from leading publishers exclusively for members (w

Confessions of a Fake News Writer

I’ve been a freelance writer for years. During that time, I’ve put my all into decent work that has appeared under my byline. But despite researching and w

Are you down with NLP?

Are you down with NLP?A non-technical introduction to natural language processingprocessing language is not easyNatural Language ProcessingWe are not talki

Codeforces 802H. Fake News (medium)

題目 題意:構造兩個字串 s s s,

Detecting Text in Natural Image with Connectionist Text Proposal Network》論文閱讀之CTPN

前言 2016年出了一篇很有名的文字檢測的論文:《Detecting Text in Natural Image with Connectionist Text Proposal Network》,這個深度神經網路叫做CTPN,直到今天這個網路框架一直是OCR系統中做文字檢測的一個常用網路,極大

深度學習論文翻譯解析(三):Detecting Text in Natural Image with Connectionist Text Proposal Network

論文標題:Detecting Text in Natural Image with Connectionist Text Proposal Network 論文作者:Zhi Tian , Weilin Huang, Tong He , Pan He , and Yu Qiao 論文原始碼的下載地址:htt