Detecting Fake News With NLP

阿新 • • 發佈：2019-01-12

Which one is fake?

As an example of the problem, let’s try to detect the fake news from the 4 headline given down below.

It is hard to detect at first sight, isn't it? Let’s see the answer.

And let’s try to see detect the real news from fake ones.

And here is the answer…

As we can see from the examples above, unless we are careful and actually looking for fake or real news, it is hard to detect them. Even if we are careful and looking for fakeness and have only 4 option, it is challenging to detect the fake or the real news. If we don’t read them all carefully, we often were mistaken.

How to detect fake news?

As human beings, when we read a sentence or a paragraph, we can interpret the words with the whole document and understand the context. In this project, we teach to a computer how to read and understand the differences between real news and the fake news using Natural Language Processing (NLP). We will do this by using TF-IDF vectorizer.TF-IDF is used to determine word importance in a given article in the entire corpus. We will discuss them in the last section.

Data Set

I collected more than 200,000 articles and filtered them by the topic and the date range. Eventually, I had 52,000 articles from 2016–2017 and in Business, Politics, U.S. News, and The World. 12,000 of them were label as fake news and 40,000 of them was real news. I used NYT API and The Guardian Post API to get real news, and I used Kaggle’s fake news data set for the fake news.

Machine Learning Algorithms

Since it is a classification problem, I started with Logistic Regression, Random Forest, and XGBoost. Logistic regression is a simple algorithm whereas Random Forest and XGBoost are more advance. I expected the advance models to perform better but the results were surprising.

After vectorizing the documents with TF-IDF, the random forest gave % 82 F1 scores, and XGBoost gave %65. Essentially F1 score optimizes the model based on False Negative and False Positive. In our case, they are both equally important. I did optimize my model for the F1 score, as precision and recall are equally important to the problem. After that, I have got %95 accuracy from Logistic Regression, using the body of the article, and reduce the number of columns 5700 from 8 million using grid search for the vectorizer and the regression model.

The figure above shows the number of columns on the x-axis and the size of the coefficients before grid search. And the figure below shows their numbers after the grid search.

After having a really high F1 score and accuracy, every data scientist should ask the question “Am I over-fitting? What is my Bias-Variance trade off? Can I get the same score with less data?”

In order to answer those questions, I plotted learning curves using the headline, which shows the bias-variance trade off and the if I need more data, or less than I have is enough.

After having this plot, I realized that using headline is necessarily helpful for my problem.

After generating the learning curves for using headline and body , I realized that using headline is not necessarily helpful to solve my problem.

Website

Here is how the site looks like…

After copying and pasting the article we click to the check button and the result is…

In the graph below, we can see the difference train and test curves are far apart, which shows us the level of bias is high.

But using the body, we can see that there is low bias and low variance and we drive the conclusion as more data helps to improve the metrics.

Future Work

Combining LDA with cross-validation across news media agencies to check if they have the same perspective on a give topic. The model with detect the topic and the facts of a news article, then do the same trust wordy articles and compare the given articles topic with the trustworthy articles. In this case, each news agencies will have weight based on the trustworthiness, and then we will set a threshold. If the weight is above the threshold, we would label is as real news, if not then it will be labeled as fake news.

Code Available

The code is available at www.github.com/genyunus/Detecting_Fake_News