1. 程式人生 > 其它 >資料採集與融合技術_實驗四

資料採集與融合技術_實驗四

碼雲連結:歐翔實驗四

1.作業①:

  • 要求:熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法;Scrapy+Xpath+MySQL資料庫儲存技術路線爬取噹噹網站圖書資料

  • 候選網站:http://search.dangdang.com/?key=python&act=input

  • 關鍵詞:學生可自由選擇

  • 輸出資訊:

    MySQL的輸出資訊如下

2.解題思路

2.1網頁分析與商品定位

可以發現包含每條商品資訊在li標籤下,其xpath定位語句為"//ul[@class='bigimg']/li"

2.2資訊提取

從每條li中找到所需要的資訊

for book in books:
    title = book.xpath("./a/@title").extract_first()
    price = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
    author = book.xpath("./p[@class='search_book_author']/span/a/@title").extract_first()
    date = book.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first()
    publisher = book.xpath("./p[@class='search_book_author']/span/a[@name='P_cbs']/text()").extract_first()
    detail = book.xpath("./p[@class='detail']/text()").extract_first()
2.3寫入資料庫

pipelines.py

class DdspiderPipeline:

    def open_spider(self, spider):
        self.count = 1
        self.con = pymysql.connect(host='localhost', user='root', password='123456', charset="utf8")
        self.cursor = self.con.cursor()
        self.cursor.execute("CREATE DATABASE IF NOT EXISTS books")
        self.cursor.execute("USE books")
        self.cursor.execute("CREATE TABLE IF NOT EXISTS books("
                            "id int primary key,"
                            "bTitle varchar(512) ,"
                            "bAuthor varchar(256),"
                            "bPublisher varchar(256),"
                            "bDate varchar(32),"
                            "bPrice varchar(16),"
                            "bDetail text)"
                            "ENGINE=InnoDB DEFAULT CHARSET=utf8")
        print("opened")
        self.opened = True


    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute("insert into books (id, bTitle, bAuthor, bPublisher, bDate, bPrice, bDetail) values "
                                    "(%s,%s,%s,%s,%s,%s,%s)", (self.count, item['title'], item['author'], item['publisher'],
                                                                  item['date'], item['price'], item['detail']))
                self.count += 1
            return item
        except Exception as err:
            print(err)

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("close")
        print("總共爬取:", self.count, "本書籍")
2.4 修改setting.py
BOT_NAME = 'ddSpider'

SPIDER_MODULES = ['ddSpider.spiders']
NEWSPIDER_MODULE = 'ddSpider.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'ddSpider.pipelines.DdspiderPipeline': 300,
}
2.5 結果

3.心得體會

復現作業難度不大,經過本次實驗,我熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法。同時,也是第一次使用mysql,對於python中利用pymysql與MySQL資料庫進行互動有了初步理解。對於Xpath的掌握程度也大大加深。

1.作業②:

  • 要求:熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法;使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取外匯網站資料。

  • 候選網站:招商銀行網:http://fx.cmbchina.com/hq/

  • 輸出資訊:MySQL資料庫儲存和輸出格式

    Id Currency TSP CSP TBP CBP Time
    1 港幣 86.60 86.60 86.26 85.65 15:36:30
    2......

2.解題思路

2.1網頁分析與商品定位

網頁比較簡潔,可以容易的找到所需資料在tr標籤下,但第一條tr為表頭,要略過,xpath定位為"//table[@class='data']//tr"

2.2資訊提取

從每條tr中找到所需要的匯率資訊

for f in forex_data[1:]:
    item = forexSpiderItem()
    currency = f.xpath("./td[1]/text()").extract_first().strip()
    TSP = f.xpath("./td[4]/text()").extract_first().strip()
    CSP = f.xpath("./td[5]/text()").extract_first().strip()
    TBP = f.xpath("./td[6]/text()").extract_first().strip()
    CBP = f.xpath("./td[7]/text()").extract_first().strip()
    time = f.xpath("./td[8]/text()").extract_first().strip()
2.3寫入資料庫

pipelines.py

class ForexspiderPipeline:
    def open_spider(self, spider):
        try:
            self.con = pymysql.connect(host='localhost', port=3306, user='root', password='123456', charset='utf8')
            self.cursor = self.con.cursor()
            self.cursor.execute("CREATE DATABASE IF NOT EXISTS forex")
            self.cursor.execute("USE forex")
            self.cursor.execute("CREATE TABLE IF NOT EXISTS forexs(Id varchar(8) primary key, Currency varchar(32), "
                                "TSP varchar(8),CSP varchar(8), TBP varchar(8), CBP varchar(8), Time varchar(16))"
                                "ENGINE=InnoDB DEFAULT CHARSET=utf8")
            self.opened = True
            print("opened")

        except Exception as e:
            print(e)
            self.opened = False

    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute("INSERT INTO forexs VALUES (%s,%s,%s,%s,%s,%s,%s)",
                                    (item["id"],item["currency"],item["TSP"],item["CSP"],
                                     item["TBP"],item["CBP"],item["time"]))
                print("INSERT INTO forexs VALUES (%s,%s,%s,%s,%s,%s,%s)",
                                    (item["id"],item["currency"],item["TSP"],item["CSP"],
                                     item["TBP"],item["CBP"],item["time"]))
                return item
        except Exception as e:
            print(e)

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
            print("closed")
2.4 修改setting.py(類似上一題)
2.5結果

id欄位設為varchar,表格顯示未按id排序

3.心得體會

和第一題大體相同,繼續鞏固了pymysql與MySQL資料庫的使用理解。

1.作業③:

  • 要求:熟練掌握 Selenium 查詢HTML元素、爬取Ajax網頁資料、等待HTML元素等內容;使用Selenium框架+ MySQL資料庫儲存技術路線爬取“滬深A股”、“上證A股”、“深證A股”3個板塊的股票資料資訊。

  • 候選網站:東方財富網:http://quote.eastmoney.com/center/gridlist.html#hs_a_board

  • 輸出資訊:MySQL資料庫儲存和輸出格式如下,表頭應是英文命名例如:序號id,股票程式碼:bStockNo……,由同學們自行定義設計表頭:

    序號 股票程式碼 股票名稱 最新報價 漲跌幅 漲跌額 成交量 成交額 振幅 最高 最低 今開 昨收
    1 688093 N世華 28.47 62.22% 10.92 26.13萬 7.6億 22.34 32.0 28.08 30.2 17.55
    2......

2.解題思路

2.1網頁分析與商品定位

所需股票資訊在tr標籤下,xpath定位語句為'//tbody/tr',“滬深A股”、“上證A股”、“深證A股””3個板塊對應修改url 尾部'hs_a_board','sh_a_board','sz_a_board'

2.2資訊提取
stocks = driver.find_elements(By.XPATH, '//tbody/tr')
for s in stocks:
    stock_data = []
    for i in [1,2,3,5,6,7,8,9,10,11,12,13,14]:
        stock_data.append(s.find_element_by_xpath('./td['+str(i)+']').text)
    stocks_data.append(stock_data)
2.3翻頁處理
# 翻頁
driver.find_element_by_xpath('//div[@id="main-table_paginate"]/a[2]').click()
sleep(3)
# 上滾至頂部
driver.execute_script("window.scrollTo(0,0);")
sleep(3)
pages -= 1
2.4程式碼

spider函式

def spider(key,pages):
    stocks_data = []
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get('http://quote.eastmoney.com/center/gridlist.html#'+key)
    while pages:
        stocks = driver.find_elements(By.XPATH, '//tbody/tr')
        for s in stocks:
            stock_data = []
            for i in [1,2,3,5,6,7,8,9,10,11,12,13,14]:
                stock_data.append(s.find_element_by_xpath('./td['+str(i)+']').text)
            stocks_data.append(stock_data)
        # 翻頁
        driver.find_element_by_xpath('//div[@id="main-table_paginate"]/a[2]').click()
        sleep(3)
        # 上滾至頂部
        driver.execute_script("window.scrollTo(0,0);")
        sleep(3)
        pages -= 1
    return stocks_data

主函式

def main():
    con = pymysql.connect(host='localhost', port=3306, user='root', password='123456', charset='utf8')
    cursor = con.cursor()
    cursor.execute("CREATE DATABASE IF NOT EXISTS stock")
    cursor.execute("USE stock")
    print("opened")
    keys = ['hs_a_board','sh_a_board','sz_a_board']
    names = ['滬深A股','上證A股','深證A股']
    for i in range(3):
        cursor.execute("CREATE TABLE IF NOT EXISTS " +names[i]+"(sNo varchar(8) , sId varchar(16) primary key, "
                       "sName varchar(32), sPrice varchar(8), sChangePer varchar(8), sChangeAmount varchar(8), "
                       "sTransactionVolume varchar(16), sTradingVolume varchar(16),  sAmplitude varchar(8), sHigh varchar(8), "
                       "sLow varchar(8), sToday varchar(8), sYesterday varchar(8))"
                       "ENGINE=InnoDB DEFAULT CHARSET=utf8")
    	stocks = spider(keys[i],3)#爬取3頁
        print(len(stocks))
        for s in stocks:
            print(s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7], s[8], s[9], s[10], s[11], s[12])
            cursor.execute("INSERT INTO " +names[i]+" VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",
                           (s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7], s[8], s[9], s[10], s[11], s[12]))
    con.commit()
    con.close()
    print("closed")
2.5結果

深證A股

3.心得體會

更加理解了selenium的使用,雖然selenium速度很慢,但其直接執行在瀏覽器之中的。使用者可以非常清楚地看到瀏覽器執行的每一步操作,功能也更多。