【資料採集與融合】第二次實踐

阿新 • • 發佈：2021-10-14

作業①

1.1作業內容

要求：在中國氣象網（http://www.weather.com.cn）給定城市集的7日天氣預報，並儲存在資料庫

序號	地區	日期	天氣資訊	溫度
1	北京	7日（今天）	晴間多雲，北部山區有陣雨或雷陣雨轉晴轉多雲	31℃/17℃
2	北京	8日（明天）	多雲轉晴，北部地區有分散陣雨或雷陣雨轉晴	34℃/20℃
3	北京	9日（後臺）	晴轉多雲	36℃/22℃
4	北京	10日（週六）	陰轉陣雨	30℃/19℃
5	北京	11日（週日）	陣雨	27℃/18℃
6......

　　　1.2實驗過程：

　　　核心程式碼如下：

　　　　1.建立一個天氣資料庫的類

class WeatherDB:
    def openDB(self):
        self.con = sqlite3.connect("weathers.db")
        self.cursor = self.con.cursor()
        try:
            self.cursor.execute("create table weathers (wCity varchar(16),wDate varchar(16),wWeather varchar(64),wTemp varchar(32),constraint pk_weather primary key (wCity,wDate)) 
")
        except:
            self.cursor.execute("delete from weathers")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self, city, date, weather, temp):
        try:
            self.cursor.execute("insert into weathers (wCity,wDate,wWeather,wTemp) values (?,?,?,?) 
",
                                (city, date, weather, temp))
        except Exception as err:
            print(err)

    def show(self):
        global num
        self.cursor.execute("select * from weathers")
        rows = self.cursor.fetchall()
        print("{0:^16}{1:{5}^16}{2:{5}^16}{3:{5}^32}{4:{5}^32}".format("序號","city","date", "weather","temp",chr(12288)))
        for row in rows:
            print("{0:^16}{1:{5}^16}{2:{5}^16}{3:{5}^32}{4:{5}^32}".format(str(num),row[0],row[1],row[2],row[3],chr(12288)))
            num += 1

　　　　2.爬取天氣資料，並插入資料庫

class WeatherForecast:
    def __init__(self):
        self.headers = {
                "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        self.cityCode = {"北京": "101010100", "上海": "101020100", "廣州": "101280101", "深圳": "101280601"}

    def forecastCity(self, city):
        if city not in self.cityCode.keys():
            print(city + " code cannot be found")
            return
        url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"
        try:
            req = urllib.request.Request(url, headers=self.headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            dammit = UnicodeDammit(data, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            soup = BeautifulSoup(data, "lxml")
            lis = soup.select("ul[class='t clearfix'] li")
            for li in lis:
                try:
                    date = li.select('h1')[0].text
                    weather = li.select('p[class="wea"]')[0].text
                    temp = li.select("p[class='tem']")[0].text.strip()
                    # print(city,date,weather,temp)
                    self.db.insert(city, date, weather, temp)
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)

    def process(self, cities):
        self.db = WeatherDB()  # 建立天氣資料庫物件，db
        self.db.openDB()  # 開啟資料庫

        for city in cities:
            self.forecastCity(city)  # 迴圈遍歷，逐一爬取和儲存天氣預報資料

        self.db.show()  # 列印資料庫中資料
        self.db.closeDB()  # 關閉資料庫

1.3執行結果如下：

資料庫：

1.4實驗心得：

這次實驗對之前爬取天氣的例項進行了復現，讓我加深了對爬取網頁的程式碼的學習，同時我們還學習了對如何建立一個數據庫類，並且向資料庫中插入資料，學習如何在pycharm中連線資料庫，查看錶格

作業②

2.1作業內容

要求：用requests和自選提取資訊方法定向爬取股票相關資訊，並存儲在資料庫中。

候選網站：東方財富網：https://www.eastmoney.com/新浪股票：http://finance.sina.com.cn/stock/

2.2實驗步驟：

1.開啟網站，快捷鍵f12，點選網路，點選js，找到股票相關資訊的檔案，獲取股票資訊如下：

開啟對應的url檢視：

2.核心程式碼：

獲取資料：

def getHtml(url):
    try:
        header = {
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        resp = requests.get(url,headers=header)
        html = resp.text
        return html
    except Exception as err:
        print(err)

num = 1  # 序號
def getContent(html):
    stocks = re.findall(r"\"diff\":\[(.*?)\]",html,re.M|re.S)
    # 使用正則表示式獲取stock
    # print(stocks)列印stock列表
    stocks = list(eval(stocks[0]))
    # 將爬取的資料轉換為列表
    global num
    result = []  # 爬取結果存入列表
    for stock in stocks:
        stockcode = stock["f12"]
        stockname = stock["f14"]
        newprice = stock["f2"]
        diefu = stock["f3"]
        diee = stock["f4"]
        dealnum = stock["f5"]
        deale = stock["f6"]
        zhenfu = stock["f7"]
        most = stock["f15"]
        least = stock["f16"]
        today = stock["f17"]
        yesterday = stock["f18"]
        # 字典獲取資料
        result.append([num,stockcode,stockname,newprice,diefu,diee,dealnum,deale,zhenfu,most,least,today,yesterday])
        # 存入結果列表
        num += 1

    return result

建立資料庫類：

# 股票資料庫
class stockDB:
    # 開啟
    def openDB(self):
        self.con = sqlite3.connect("stocks.db")
        self.cursor = self.con.cursor()
        try:
            self.cursor.execute("create table stocks (Num varchar(16), stockCode varchar(16),stockName varchar(16),Newprice varchar(16),RiseFallpercent varchar(16),RiseFall varchar(16),Turnover varchar(16),Dealnum varchar(16),Amplitude varchar(16),max varchar(16),min varchar(16),today varchar(16),yesterday varchar(16))")
        except:
            self.cursor.execute("delete from stocks")

    # 關閉
    def closeDB(self):
        self.con.commit()
        self.con.close()

    # 插入
    def insert(self,Num,stockcode,stockname,newprice,risefallpercent,risefall,turnover,dealnum,Amplitude,max,min,today,yesterday):
        try:
            self.cursor.execute("insert into stocks(Num,stockCode,stockName,Newprice,RiseFallpercent,RiseFall,Turnover,Dealnum,Amplitude,max,min,today,yesterday) values (?,?,?,?,?,?,?,?,?,?,?,?,?)",
                                (Num,stockcode,stockname,newprice,risefallpercent,risefall,turnover,dealnum,Amplitude,max,min,today,yesterday))
        except Exception as err:
            print(err)

存入資料庫

    s = "{0:^10}\t{1:{13}^10}\t{2:{13}^10}\t{3:{13}^10}\t{4:{13}^10}\t{5:{13}^10}\t{6:{13}^10}\t{7:{13}^10}\t{8:{13}^10}\t{9:{13}^10}\t{10:{13}^10}\t{11:{13}^10}\t{12:{13}^10}"
    print(s.format("序號","股票程式碼","股票名稱","最新價","漲跌幅","漲跌額","成交量","成交額","振幅","最高","最低","今收","昨收",chr(12288)))
    stockdb = stockDB()  # 建立資料庫物件
    stockdb.openDB()  # 開啟資料庫
    for page in range(1, 6):
        # 因為本人學號尾數是5，所以只爬了五頁
        url = "http://19.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240009917002240502182_1634088844934&pn=" + str(page) + "&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1634088845178"
        html = getHtml(url)
        stocks = getContent(html)
        for stock in stocks:
            print(s.format(stock[0],stock[1],stock[2],stock[3],stock[4],stock[5],stock[6],stock[7],stock[8],stock[9],stock[10],stock[11],stock[12],chr(12288)))
            stockdb.insert(stock[0],stock[1],stock[2],stock[3],stock[4],stock[5],stock[6],stock[7],stock[8],stock[9],stock[10],stock[11],stock[12])
            # 存入資料庫
    stockdb.closeDB()

2.3實驗結果：

資料庫：

2.4實驗心得：

1.學習瞭如何抓包，過程中有很多名字一樣的response，要點進f12功能【網路】面板的【預覽】功能，檢視索要爬取的資料的內容，找到相應的url

2.同時也學習瞭如何建立資料庫的類，把爬取下來的資料內傳入到資料庫裡

3.這次實驗不僅加深了對之前爬取網頁內容的學習，還學習了新的內容，為我後面的爬蟲學習之旅奠定了一定的基礎

作業③

3.1作業內容

要求：爬取中國大學2021主榜（https://www.shanghairanking.cn/rankings/bcur/2021）所有院校資訊，並存儲在資料庫中，同時將瀏覽器F12除錯分析的過程錄製Gif加入至部落格中。

3.2實驗步驟：

開啟網站，F12，點選網路，點選js，找到大學排名的相關資訊：

開啟檔案對應網址，開起來會亂碼，但是沒關係，程式碼中可以解析：

核心程式碼：

獲取資料

def getHtml(url):
    try:
        header = {
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        resp = requests.get(url,headers=header)
        resp.raise_for_status()
        resp.encoding = resp.apparent_encoding
        return resp.text
    except Exception as err:
        print(err)


def getContent(html):
    r1 = 'univNameCn:"(.*?)"'
    r2 = 'score:(.*?),'

    namelist = re.findall(r1,html,re.S|re.M)
    scorelist = re.findall(r2,html,re.S|re.M)
    collegedb = collegeDB()
    collegedb.openDB()
    print("{0:^10}\t{1:{3}^10}\t{2:{3}^10}".format("排名", "學校名稱", "總分", chr(12288)))
    num = 1
    for i in range(len(namelist)):
        print("{0:^10}\t{1:{3}^10}\t{2:{3}^10}".format(num, namelist[i], scorelist[i], chr(12288)))
        collegedb.insert(num,namelist[i],scorelist[i])
        num += 1

建立資料庫類：

class collegeDB:

    def openDB(self):
        self.con = sqlite3.connect("colleges.db") # 連線資料庫，沒有的話會註定建立一個
        self.cursor = self.con.cursor() # 設定一個遊標
        try:
            self.cursor.execute("create table colleges(Rank varchar(10),Schoolname varchar(10),Score varchar(10))")
        except:
            self.cursor.execute("delete from colleges")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self,rank,schoolname,score):
        try:
            self.cursor.execute("insert into colleges(Rank,Schoolname,Score) values (?,?,?)", (rank, schoolname, score))
        except Exception as err:
            print(err)