Build a predictive model on Watson Studio using CSV data set from Tweets

阿新 • • 發佈：2019-01-16

In the era that we currently live in, all the focus has shifted towards data. Each day, the amount of data that is generated and consumed is increasing, adding somewhere around 5 exabytes of data. Everything we do generates data, be it turning on and off the light, or commuting from home to work. This data can be used to generate information that can be used for insights to predict and extract patterns. Data Mining or Data Science is the term that has taken the industry abuzz. It is the process of discovering patterns, insights, and associations from data. In this how-to guide we’ll learn how to use data and implement a predictive model on it to get insights. Our intended audience include developers, general users with basic knowledge of programming, and organizations that want to enhance customer experience. It will enable a user to create a predictive model on Watson Studio, which is a cloud-based environment for Data Scientists. By using this how-to user can predict and optimize their twitter interaction and would lead to optimum traffic on their tweets.

Learning objectives

After completing this how-to, the reader will be able to:

Learn Watson Studio to build a predictive model using any CSV data.
Extract user information from Twitter.
Leverage Twitter to predict and optimize their twitter interactions.

Prerequisites

Estimated time

To complete this tutorial it should take around 45 minutes.

Steps

Use sample data or get your own?

The first thing we’ll need to do is get a bunch of tweets to analyze. In this step we’ll go through how to get a bunch of tweets, but if you’re not interested in doing that, we provide a sample data set:

: Tweets from a Ufone, a phone operator, cleaned up and ready for Watson Studio. (Use this one!)

: Same as above, but raw, taken directly from tweepy. (Only added for completeness.)

Step 1. Getting Twitter API access (optional)

If you’re using the sample data, then skip to Step 3.

Before we use tweepy to get tweets we neeed to generate OAuth Consumer and Access token keys and secrets. There are various guides that show how to do this, like this one, but the Twitter UI will change. It’s best to go to https://developer.twitter.com to follow along. In the end you’ll end up with these keys and secrets:

Consumer API Key
Consumer API Secret
Access Token
Access Token Secret

These can be revoked and regenerated, but as with any other key, you should keep these secret.

Step 2: Saving Tweets to CSV format (optional)

Again, if you’re using the sample data, then skip to Step 3.

Now that we’ve got our Twitter API keys and secrets, we can use tweepy to save tweets into a CSV file. Free developer accounts on Twitter will limit the amount of tweets that are retrieved, but that’s enough for our purposes.

If you don’t have Python, then download and install the latest version, and then install tweepy. This can be done using pip install tweepy, if you have pip installed.

Copy the code below into a new file and save it. There are a few lines to update at the top, add values to the variables for keys, secrets, and the twitter handle you want to analyze.

import csv
import tweepy

# Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
screen_name = ""


def get_all_tweets():
    # initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)

    alltweets = []

    # request first 200 tweets, the max allowed
    new_tweets = api.user_timeline(screen_name=screen_name, count=200)
    alltweets.extend(new_tweets)
    oldest = alltweets[-1].id - 1

    # keep grabbing tweets until the 3200 tweet limit is hit
    while len(new_tweets) > 0:
        print("getting tweets before id: %s" % (oldest))
        new_tweets = api.user_timeline(screen_name=screen_name,
                                       count=200,
                                       max_id=oldest)
        alltweets.extend(new_tweets)
        oldest = alltweets[-1].id - 1
        print("...%s tweets downloaded so far" % (len(alltweets)))

    return alltweets


def write_tweets_to_csv(tweets):
    # transform the tweepy tweets into an array
    outtweets = [[tweet.id_str, tweet.created_at,
                  tweet.text.encode("utf-8"), tweet.retweet_count,
                  tweet.favorite_count] for tweet in tweets]

    # write the csv
    with open('%s_tweets.csv' % screen_name, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(["id", "created_at", "text", "Retweets", "Favorites"])
        writer.writerows(outtweets)

    pass


if __name__ == '__main__':
    # pass in the username of the account you want to download
    tweets = get_all_tweets()
    write_tweets_to_csv(tweets)

Run the script by running python tweets.py in a terminal, a CSV file will be output, containing various tweets and information about those tweets, for example:

You can remove the id and created_at columns, and remove empty rows to clean the data a bit.

Step 3: Log into Watson Studio

IBM Watson Studio is an easy-to-use, collaborative and cloud based environment for data scientists where they can use tools like Scala, R, Jupyter Notebookc etc.

Log into https://dataplatform.cloud.ibm.com/ and choose to create a New Project, the Complete option will work for this tutorial.

At the new project wizard, enter a Name and Description, You will also be required to create a new Object Storage service or choose an existing service during project creation. Once created, you’ll be able to see a project overview, for example:

Once created, we can add an asset, by clicking Add to project and in this case, we’ll click Model, to add a new model.

Step 4: Create a new model

Give your model a Name and Description. We will also set the Model type option to Model builder and choose the Manual for this exercise.

Before proceeding we need to associate two services. An Apache Spark service, and a Machine Learning service. You can use the UI to create a new one or select an existing one. For an example of how to do that with Apache Spark, refer to this IBM Code Tutorial. To do that with Machine Learning is the same exercise.

Step 5: Add data to the model

We’re now going to add the CSV file to the model. Click Add Data Assets, browse to either the generated CSV file or the saved sample CSV file. The data should appear in the dashboard, for example:

Click on the Next button to continue. Loading the data may take a few minutes.

Step 6: Select a training technique

For this example we’re trying to predict the best time to send a tweet, so let’s set the Column value to predict to be hour. Leave the Feature columns unchanged and set to All. The important choice here is the technique used, we’ll be using the Regression technique. We’ll also be leving the Validation Split unchanged.

It should be noted that because the classifier is set to hour, which has around 20 values, Watson Studio will suggested Multiclass classification. But in this case the best technique according to our data is Regression.

We also need to add estimators. To do that, click on Add Estimators and select all avilable choices, then click Add.

Once we have our technique and estimators selected we can click Next. This will start training and testing data. This step will take a few minutes to fully complete.

Step 7: Wrapping up

The results show just how accurate each estimator is, with the most optimal estimator at the top. Here it is Isotonic Regression, click on the first one and select the Save option, for example:

Once saved, you will be redirected to an overview of the model, for example:

From here, we can create a web deployment so our model is accessible over a REST call.

Congratulations! Your model is saved, deployed, and you can start testing it out with the generated cURL, Java, JavaScript and Python snippets.

Summary

In this tutorial we learned to extract user data from twitter and then perform data science predictive model on it to optimize future tweeting and increasing the users audience. This tutorial of building a model on Watson Studio can be applied on any other CSV file as well and can be further deployed on a web application. We also learned how to deploy the model as a web application to allow REST calls.

Build a predictive model on Watson Studio using CSV data set from Tweets

Learning objectives

Prerequisites

Estimated time