解析html中的table和表頭

阿新 • • 發佈：2021-08-22

原始資料如下：

解析後的資料如下：

補充程式碼

import pandas as pd
import lxml
from lxml import etree
from bs4 import BeautifulSoup
url = "http://www.stats.gov.cn/tjsj/zxfb/202108/t20210816_1820554.html"
urlhtml=requests.get(url)
urlhtml.encoding='utf-8'
html = etree.HTML(urlhtml.text)  
#html=etree.HTML(urlhtml) #初始化生成一個XPath解析物件
#result=etree.tostring(html,encoding='utf-8')   #解析物件輸出程式碼
result=html.xpath('//table[@class="MsoNormalTable"]')
elements = html.xpath('.//table[@class="MsoNormalTable"]')
for ele in elements:
    header = ele.getparent().getprevious().getprevious()
    print(header.xpath('string(.)'))
    docs = etree.tostring(ele,encoding='utf-8').decode('utf-8')
    extractor = Extractor(docs)
    extractor.parse()
    data = extractor.return_list()
    #print(pd.DataFrame(data))

#Extractor
#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup, Tag
import os
import csv
import pdb

class Extractor(object):
    def __init__(self, input, id_=None, **kwargs):
        # TODO: should divide this class into two subclasses
        # to deal with string and bs4.Tag separately

        # validate the input
        if not isinstance(input, str) and not isinstance(input, Tag):
            raise Exception('Unrecognized type. Valid input: str, bs4.element.Tag')

        soup = BeautifulSoup(input, 'html.parser').find() if isinstance(input, str) else input

        # locate the target table
        if soup.name == 'table':
            self._table = soup
        else:
            self._table = soup.find(id=id_)

        if 'transformer' in kwargs:
            self._transformer = kwargs['transformer']
        else:
            self._transformer = str

        self._output = []

    def parse(self):
        self._output = []
        row_ind = 0
        col_ind = 0
        for row in self._table.find_all('tr'):
            # record the smallest row_span, so that we know how many rows
            # we should skip
            smallest_row_span = 1

            for cell in row.children:
                if cell.name in ('td', 'th'):
                    # check multiple rows
                    # pdb.set_trace()
                    row_span = int(cell.get('rowspan')) if cell.get('rowspan') else 1

                    # try updating smallest_row_span
                    smallest_row_span = min(smallest_row_span, row_span)

                    # check multiple columns
                    col_span = int(cell.get('colspan')) if cell.get('colspan') else 1

                    # find the right index
                    while True:
                        if self._check_cell_validity(row_ind, col_ind):
                            break
                        col_ind += 1

                    # insert into self._output
                    try:
                        self._insert(row_ind, col_ind, row_span, col_span, self._transformer(cell.get_text()))
                    except UnicodeEncodeError:
                        raise Exception( 'Failed to decode text; you might want to specify kwargs transformer=unicode' )

                    # update col_ind
                    col_ind += col_span

            # update row_ind
            row_ind += smallest_row_span
            col_ind = 0
        return self

    def return_list(self):
        return self._output

    def write_to_csv(self, path='.', filename='output.csv'):
        with open(os.path.join(path, filename), 'w') as csv_file:
            table_writer = csv.writer(csv_file)
            for row in self._output:
                table_writer.writerow(row)
        return

    def _check_validity(self, i, j, height, width):
        """
        check if a rectangle (i, j, height, width) can be put into self.output
        """
        return all(self._check_cell_validity(ii, jj) for ii in range(i, i+height) for jj in range(j, j+width))

    def _check_cell_validity(self, i, j):
        """
        check if a cell (i, j) can be put into self._output
        """
        if i >= len(self._output):
            return True
        if j >= len(self._output[i]):
            return True
        if self._output[i][j] is None:
            return True
        return False

    def _insert(self, i, j, height, width, val):
        # pdb.set_trace()
        for ii in range(i, i+height):
            for jj in range(j, j+width):
                self._insert_cell(ii, jj, val)

    def _insert_cell(self, i, j, val):
        while i >= len(self._output):
            self._output.append([])
        while j >= len(self._output[i]):
            self._output[i].append(None)

        if self._output[i][j] is None:
            self._output[i][j] = val

2021年7月份70個大中城市商品住宅銷售價格變動情況

參考

(6條訊息) lxml模組常用方法整理總結_彭世瑜的部落格-CSDN部落格
 lxml/Python : get previous-sibling - Stack Overflow
python3解析庫lxml - Py.qi - 部落格園
 yuanxu-li/html-table-extractor: extract data from html table

解析html中的table和表頭

原始資料如下：解析後的資料如下：補充程式碼 import pandas as pd import lxml from lxml import etree

HTML中href和src的區別

■ href屬性 href是Hypertext Reference的縮寫，表示超文字引用，是引用用來建立當前元素和文件之間的連結，常用的有：link、a

html中 readonly和disabled的區別

技術標籤：html 要說readonly和disabled的區別，就需要先說說兩者的聯絡；兩個屬性都可以作用於input等表單元素上，都使得元素成為“不可用”的狀態；

關於html中元素和佈局的筆記

一、元素型別　　css標準文件流：預設的網頁從左到右，從上到下的排列方式顯示出網頁效果

HTML-table 設定表頭和列鎖定功能

學習來源：http://www.webkaka.com/tutorial/html/2021/0630123/ 感謝大神的文章，受益匪淺！

解析python 中/ 和 % 和 //（地板除）

python / 和 % 和 //（地板除）用於對資料進行除法運算。 python中 // 和 / 和 % 簡介

html中的下拉框—select和input方式

1.使用<select>標籤優點：可以初始化選中項缺點：不能自定義option的樣式，自帶的樣式很醜

Vue -- table多表頭，在表頭中新增按鈕

<el-table v-loading="loading" :data="list" @selection-change="handleSelectionChange">

HTML中表格（table標籤）的相關屬性

表格tr :table rowtd :table data 1 <body> 2<table style=\"width: 240px\" border=\"1\"> 3<caption>表格的跨行與跨列</caption>

HTML中的div和span

div和span 1、div和span都可以稱為“圖層“。 2、圖層的作用是為了保證頁面可以靈活的佈局。相當於“盒子”。

el-table和分頁外掛修改樣式以及解決表頭和內容歪掉的問題

1、可以通過el-table的屬性修改樣式 <el-pagination background :page-sizes=\"[10, 20, 30, 40,50,60,70,80,90,100]\"

表頭凍結列凍結_如何在Excel中凍結和取消凍結行和列

表頭凍結列凍結 If you are working on a large spreadsheet, it can be useful to “freeze” certain rows or columns so that they stay on screen while you scroll through the rest of the sh