Python網路爬蟲與資訊提取（中國大學mooc）

阿新 • • 發佈：2019-01-01

Python網路爬蟲與資訊提取

淘寶商品比價定向爬蟲
股票資料定向爬蟲

1. 淘寶商品比價定向爬蟲

功能描述

目標：獲取淘寶搜尋頁面的資訊

理解：淘寶的搜尋介面翻頁的處理

技術路線：requests-re[^footnote].

程式碼如下：

#CrowTaobaoPrice.py
import requests
import re

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return 
 r.text
    except:
        return ""

def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':' 
)[1])
            ilt.append([price , title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序號", "價格", "商品名稱"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))

def main():
    goods = '書包' 

    depth = 3
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

main()

流程圖：
步驟1：提交商品搜尋請求，迴圈獲取頁面
步驟2：對於每個頁面，提取商品名稱和價格資訊
步驟3：將資訊輸出到螢幕上

2. 股票資料定向爬蟲

這裡寫圖片描述

1. 列表內容

功能描述
目標：獲取上交所和深交所所有股票的名稱和交易資訊
輸出：儲存到檔案中
技術路線：requests-bs4-re

2.爬取網站原則

選取原則：股票資訊靜態存在於HTML頁面中，非js程式碼生成，沒有Robots協議限制
選取方法：瀏覽器F12，原始碼檢視等
選取心態：不要糾結於某個網站，多找資訊源嘗試

程式結構如下

程式碼如下

#CrawBaiduStocksA.py
import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})

            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

程式碼優化

1.編碼識別優化
2.增加動態進度顯示

優化後代碼如下

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})

            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

Python網路爬蟲與資訊提取（中國大學mooc）

目錄 Python網路爬蟲與資訊提取淘寶商品比價定向爬蟲股票資料定向爬蟲 1. 淘寶商品比價定向爬蟲功能描述目標：獲取淘寶搜尋頁面的資訊理解：淘寶的搜尋介面翻頁的處理技術路線：requests

Python網路爬蟲與資訊提取（三）bs4入門

Python的requests庫可以幫助我們獲取到大量的資訊，而如果想對這些資訊進行提取與分析，則經常使用beautifulsoup這個用來解析HTML和XML格式的功能庫。 beautifulsoup庫的安裝和requests的流方法一樣，可直接在cmd中輸入pip

python網路爬蟲與資訊提取（四）Robots協議

Robots協議例項一京東例項二亞馬遜緒論網路爬蟲引發的問題1、網路爬蟲的尺寸爬取網頁 Requests庫爬取網站 Scrapy庫爬取全網建立搜尋引擎2、網路爬蟲引發的問題1.伺服器效能騷擾2.法律風險3.洩露隱私3、網路爬蟲的限制來源審查：判斷User-Agent

Python網路爬蟲與資訊提取（五）資訊標記與資訊提取的一般方法

目前國際公認的資訊標記種類共有如下三種：名稱方式例項XML(eXtensible Markup Language)基於HTML的用有名稱與屬性的標籤進行標記的方式<name>...</name> <name /> <!-

Python網路爬蟲與資訊提取_爬蟲例項（學習筆記）

慕課課程學習筆記 1. 京東商品頁面的爬取 1.採用get()方法，獲取Response物件； import requests url = 'https://item.jd.com/100000947807.html' r = requests.get(url)

Python網路爬蟲與資訊提取-Day14-（例項）股票資料定向爬蟲

功能描述目標：獲取上交所和深交所所有股票的名稱和交易資訊股票資料是進行量化交易的基礎型資料，此爬蟲也能為量化交易提供獲得基礎資料的方法輸出：儲存到檔案中技術路線：requests‐bs4‐re 候選資料網站的選擇百度股票：https://gupiao.baidu

Python網路爬蟲與資訊提取Day2

Python網路爬蟲與資訊提取一、導學掌握定向網路資料爬取和網頁解析的基本能力 1、Requests庫：自動爬取HTML頁面，自動向網路提交請求 2、robots.txt：網路爬蟲排除標準 3、Beautiful Soup庫：解析HTML頁面 4、Projects：實戰專案A/B 5、Re庫：正

Python網路爬蟲與資訊提取Day1

嵩天教授的Python網路爬蟲與資訊提取課程筆記——單元1. requests庫入門

本文目錄 Requests庫介紹 requests.get(url, params, **kwargs)方法及其他請求方法介紹 Response類屬性簡介 Reponse類中的encoding與app

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 4

網路爬蟲之框架 1.scrapy爬蟲框架介紹 1.1.scrapy爬蟲框架介紹安裝方法：簡要地說，Scrapy不是一個函式功能庫，而是一個快速功能強大的網路爬蟲框架。（爬蟲框架是實現爬蟲功能的一個軟體結構和功能元件集合，是一個半成品，

Python網路爬蟲與資訊提取-Day5-Requests庫網路爬取實戰

一、京東商品頁面的爬取先選取一個商品頁面直接利用之前的程式碼框架即可 import requests url = "https://item.jd.com/12186192.html" try: r = requests.get(url) r.raise_for

Python網路爬蟲與資訊提取-Day9-資訊標記與提取方法

一、資訊標記的三種形式我們需要對資訊進行表記，使得我們能夠理解資訊所反饋的真實含義。標記後的資訊可形成資訊組織結構，增加了資訊維度標記的結構與資訊一樣具有重要價值標記後的資訊可用於通訊、儲存或

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 1

【第〇周】網路爬蟲之前奏網路爬蟲”課程內容導學【第一週】網路爬蟲之規則 1.Requests庫入門注意：中文文件的內容要稍微比英文文件的更新得慢一些，參考時需要關注兩種文件對應的Requests庫版本。（對於比較簡單的使

j記錄學習--python網路爬蟲與資訊提取

The website is the API...要獲取網站內容，只要把網站當成API就可以了。 requests庫獲取網頁資訊---》Beautiful Soup解析提取到資訊的內容---》利用re庫正則表示式提取其中某部分的關鍵資訊----》Scrapy*網路爬蟲網路

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 3

【第三週】網路爬蟲之實戰一、Re(正則表示式)庫入門 1.正則表示式的概念 1.1正則表示式是什麼正則表示式是用來簡潔表達一組字串的表示式。使用正則表示式的優勢就是：簡潔、一行勝千言一行就是特徵(模式) 例1：代表一組字串：

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 2

【第二週】網路爬蟲之提取 Beautiful Soup庫入門 Beautiful Soup庫的安裝與測試 <html><head><title>This is a python demo page<

Python 爬蟲基礎學習--網路爬蟲與資訊提取

Python 爬蟲基礎學習 Requests庫的安裝 Win平臺: “以管理員身份執行”cmd，執行 pip install requests Requests庫的7個主要的方法 Requests庫中2個重要的物件：Request和Response Response物件

python網路爬蟲與資訊採取之解析網頁（三）---- BeautifulSoup庫的導航樹例項

上篇我們講到了通過標籤的名稱和屬性來查詢標籤的findAll（),本篇將講解如何通過標籤在文件中的位置進行查詢------導航樹一篇文章，可以將其標籤分為子標籤，父標籤和兄弟標籤。而導航樹的功能就是

python網路爬蟲與資訊採取之解析網頁例項---oJ期末成績排名

目的:打印出排名，學號，姓名，成績；首先，開啟網頁的原始碼，如下：你會發現，很混亂，沒有層次感；不用急，告訴你一個小技巧，先使用 soup.prettify（）函式進行格式化一下，得到如下結果：這樣就好受多了；接下來對原始碼進行分析；你會想這麼多難道要一句句

python 網路爬蟲與資訊採取之異常處理

本篇文章轉自Ryan Mitchell 寫的python：網路資料採集網路是十分複雜的。網頁資料格式不友好，網站伺服器宕機，目標資料的標籤找不到，都是很麻煩的事情。網路資料採集最痛苦的遭遇之一，就是爬蟲執行的時候你洗洗睡了，夢想著明天一早資料就都會採集好放在

Python網路爬蟲與資訊提取（中國大學mooc）

目錄

Python網路爬蟲與資訊提取

1. 淘寶商品比價定向爬蟲

目標：獲取淘寶搜尋頁面的資訊

理解：淘寶的搜尋介面翻頁的處理

技術路線：requests-re[^footnote].

程式碼如下：

2. 股票資料定向爬蟲

1. 列表內容

2.爬取網站原則

程式碼如下

程式碼優化

相關推薦