Web Scraping HTML Tables with Python

阿新 • • 發佈：2018-12-19

Web Scraping HTML Tables with Python

https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059

Pokemon Database Website

Starting off, we will try scraping the online Pokemon Database (http://pokemondb.net/pokedex/all).

Inspect HTML

Before moving forward, we need to understand the structure of the website we wish to scrape. This can be done by clicking right-clicking the element we wish to scrape and then hitting “Inspect”

. For our purpose, we will inspect the elements of the table, as illustrated below:

Inspecting cell of HTML Table

Based on the HTML codes, the data are stored in after <tr>..</tr>. This is the row information. Each row has a corresponding <td>..</td> or cell data information.

Import Libraries

We will need requests for getting the HTML contents of the website and lxml.html for parsing the relevant fields. Finally, we will store the data on a Pandas Dataframe.

import requests
import lxml.html as lh
import pandas as pd

Scrape Table Cells

The code below allows us to get the Pokemon stats data of the HTML table.

url='http://pokemondb.net/pokedex/all'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

For sanity check, ensure that all the rows have the same width. If not, we probably got something more than just the table.

#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

OUTPUT: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

Looks like all our rows have exactly 10 columns. This means all the data collected on tr_elements are from the table.

Parse Table Header

Next, let’s parse the first row as our header.

tr_elements = doc.xpath('//tr')

#Create empty list
col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print '%d:"%s"'%(i,name)
    col.append((name,[]))

OUTPUT:
1:”#”
2:”Name”
3:”Type”
4:”Total”
5:”HP”
6:”Attack”
7:”Defense”
8:”Sp. Atk”
9:”Sp. Def”
10:”Speed”

Creating Pandas DataFrame

Each header is appended to a tuple along with an empty list.

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

Just to be sure, let’s check the length of each column. Ideally, they should all be the same.

[len(C) for (title,C) in col]

OUTPUT: [800, 800, 800, 800, 800, 800, 800, 800, 800, 800]

Perfect! This shows that each of our 10 columns has exactly 800 values.

Now we are ready to create the DataFrame:

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

Looking at the top 5 cells on the DataFrame:

df.head()

There you have it! Now you have a Pandas DataFrame with all the information needed!

Additional Information

This tutorial is a subset of a 3 part series:

The series covers:

Scraping a Pokemon Website
Data Analysis
Building a GUI Pokedex

Web Scraping HTML Tables with Python

Web Scraping HTML Tables with Python

Inspect HTML

Import Libraries

Scrape Table Cells

Parse Table Header

Creating Pandas DataFrame

Additional Information

Web Scraping HTML Tables with Python

Web Scraping and Crawling with Scrapy and MongoDB

Web Scraping YouTube Videos in Python

"Web Scraping with Python"筆記（一）

An Intro to Web Scraping With lxml and Python

Dataset creation and cleaning: Web Scraping using Python

1.0 An Introduction to Web Scraping using Python

Show HN: VBUILD, to compile “.vue” files to HTML, with python (NO Node.js)

Modern Web Automation With Python and Selenium

Practical Introduction to Web Scraping in Python

Web Scraping with Scrapy and MongoDB

Web scraping with Node.JS and Cheerio

python----使用re正則表示式刷選資料，去重，列表，取特定行資料（適用於web的html回包資料提取）

Website Scraping with Python 閱讀筆記

Web接口開發-基於Python

web學習---html，js，php，mysql一個動態網頁獲取流程

Selenium with Python 001 - 安裝篇

Selenium with Python 003 - 頁面元素定位

Selenium with Python 007 - Cookie處理

Selenium with Python 008 - WebDriver 元素等待

Web Scraping HTML Tables with Python

Web Scraping HTML Tables with Python

Inspect HTML

Import Libraries

Scrape Table Cells

Parse Table Header

Creating Pandas DataFrame

Additional Information

相關推薦