1. 程式人生 > >Scraping Data into Google Spreadsheets

Scraping Data into Google Spreadsheets

The UK parliament has a bunch of different committees who are tasked with conducting inquiries and producing reports about a range of different topics. The individual inquiries normally have a fixed duration and they hear evidence from a range of individuals known as witnesses. The UK parliament website

maintains a list of open inquiries that is updated on an ad hoc basis.

parliament-uk-logo

Rather than manually checking the webpage and making a spreadsheet of these inquiries by hand, this blog shows you how you can set up a midnight job (using launchd) to run a Python script which scrapes information from the webpage and automatically updates a Google sheet instead.

Getting the Data

I’ve blogged about pulling data from web pages using Beautiful Soup and the Python requests library before, so I won’t repeat that here. If you do want to see what my code for scraping the open inquiries webpage looks like you can see it here.

Spreading the Data

Once you’ve got some data you need a spreadsheet to write it to. You can create a sheet directly from Python, but it’ also pretty quick to just do it yourself by hand. That’s what I did. I made a sheet with the name “Policy Spreadsheet

” and I made five tabs: All, Commons, Lords, Joint, Public.

headers

footer

Next is the most complicated bit… getting the credentials.

I did see some blogs that said you could just login using your normal Google username and password with the function gspread.login(username,password), however this seems to be deprecated – or at least it threw an error when I tried it. Which basically leaves OAuth2 using the Google API.

It’s free and anyone can sign up, it’s just perhaps not the most intuitive process.

Getting authorized:

  1. Create a new project.
  2. Enable the Google Drive API.
  3. Create credentials —> Web Server —> Application Data.
  4. Make a service account —> Project —> Editor.
  5. Download the JSON file containing the credentials and rename it client_secret.json

You can follow these steps in a video on this blog.

If you already have a project started and you just want to add some more credentials, you can simply go to the Credentials tab and add a service account using Create Credentials.

To connect you need the following lines of code:

import gspread
from oauth2client.service_account import ServiceAccountCredentials

# use creds to create a client to interact with the Google Drive API
scope = ['https://spreadsheets.google.com/feeds']
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
client = gspread.authorize(creds)

Once you’ve got the client_secret.json file you should look inside and copy the client_email. Go to the google sheet you want to write to and make sure it’s shared with that email address.

I’m using the gspread library, which is a nice wrapper around the Google sheets API. It is pip installable:

pip install gspread

To access the sheet from Python using gspread it’s really straight forward:

sheet = client.open("Policy Spreadsheet")

You can also access all of the tabs and their worksheets using:

sheet1 = sheet.worksheet("All")
sheet2 = sheet.worksheet("Commons")
sheet3 = sheet.worksheet("Lords")
sheet4 = sheet.worksheet("Joint")
sheet5 = sheet.worksheet("Public")

I read another blog which said it was possible to navigate sheets using sheet.sheet1, sheet.sheet2 etc but that gave me an error, and looking at the gspread docs I’m pretty sure that the only one you can use is sheet.sheet1, which is a short cut to the first sheet. The higher ones won’t work.

If you aren’t sure what the names of all your worksheets are you can use:

print sheet.worksheets()

Then you want to write in some content. I’m going to add in all the data that I scraped from the parliament webpage. I’ll read the info out of each dict in the list and then write it into the sheet.

import datetime

# find today's date to tag input data:
now = datetime.datetime.now()
indate = now.strftime("%Y-%m-%d")

for inquiry in inquiries:

        # extract info from dict:
        itype = inquiry['type']
        name = inquiry['inquiry name']
        committee = inquiry['committee']
        deadline = inquiry['deadline']
        status = inquiry['status']
        link = inquiry['link']

        # compose the contents of the row you want to add:
        row = [name,itype,committee,"",deadline,status,link,indate,indate]

        # specify the index of the row (2 here because of header row):
        index = 2

        # add into "All" sheet first:
        sheet1.add_rows(index)
        sheet1.insert_row(row,index)

        # then add into appropriate tab:
        if itype=='Commons':
            sheet2.add_rows(index)
            sheet2.insert_row(row,index)

        elif itype=='Lords':
            sheet3.add_rows(index)
            sheet3.insert_row(row,index)

        elif itype=='Joint':
            sheet4.add_rows(index)
            sheet4.insert_row(row,index)

        elif itype=='Public':
            sheet5.add_rows(index)
            sheet5.insert_row(row,index)

I found that using sheet.insert_row() without sheet.add_rows() first just overwrote the previous contents of the row rather than adding in a new one.

I’ve used index=2 because I have columns headings in the first row.

Updating the Data

I’m planning to run this code on a regular cadence to keep updating the sheet with information about new inquiries. One thing I want to keep track of is the deadline for evidence, which sometimes changes. I don’t want to have to re-write the whole sheet each time (you’ll quickly work out that it’s really slow to write things to Google sheets), so I just want to search and replace deadline info.

To do this I’m going to match the item using the name of the inquiry, which should be unique. I can find the index of the row using sheet.find(), which returns a gspread.Cell object with the properties row, col and val.

# find index of row:
index = sheet.find(name).row
# replace row contents:
sheet.insert_row(row,index)

Ideally I’d also like to then sort the spreadsheet by the deadline column, i.e. put the most pressing items at the top, but there doesn’t seem to be a functionality to do this with gspread