資料採集與融合技術實驗5

阿新 • • 發佈：2021-12-07

作業①：

要求：

熟練掌握 Selenium 查詢HTML元素、爬取Ajax網頁資料、等待HTML元素等內容。

使用Selenium框架爬取京東商城某類商品資訊及圖片。
候選網站：http://www.jd.com/
關鍵詞：學生可自由選擇
輸出資訊：MySQL的輸出資訊如下


mNo	mMark	mPrice	mNote	mFile
000001	三星Galaxy	9199.00	三星Galaxy Note20 Ultra 5G...	000001.jpg
000002......

1）、京東網手機資料爬取

1.定位網頁搜尋框，輸入關鍵詞"手機"

self.driver.get(url)
        keyInput = self.driver.find_element_by_id("key")
        keyInput.send_keys(key)
        keyInput.send_keys(Keys.ENTER)

2.編寫爬蟲主體，檢查頁面：

利用xpath方法實現定位(由於京東網頁的特殊，圖片連結隱藏在src或者data-lazy-img下)

            lis =self.driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@class='gl-item']")
            for li in lis:
                if count > 413:
                    break;
            # We find that the image is either in src or in data-lazy-img attribute
                try:
                    src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src")
                except:
                    src1 = ""

                try:
                    src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img")
                except:
                    src2 = ""
                try:
                    price = li.find_element_by_xpath(".//div[@class='p-price']//i").text
                except:
                     price = "0"

                try:
                    note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text
                    mark = note.split(" ")[0]
                    mark = mark.replace("愛心東東\n", "")
                    mark = mark.replace(",", "")
                    note = note.replace("愛心東東\n", "")
                    note = note.replace(",", "")

                except:
                    note = ""
                    mark = ""

翻頁處理：

# 找到下一頁的跳轉按鈕位置
           if count < 413:
                try:
                    self.driver.find_element_by_xpath("//span[@class='p-num']//a[@class='pn-next disabled']")
                except:
                    nextPage = self.driver.find_element_by_xpath("//span[@class='p-num']//a[@class='pn-next']")
                    time.sleep(5)
                    nextPage.click()
                    time.sleep(5)
                    self.processSpider()

設定下載的圖片檔名：

self.No = self.No + 1
no = str(self.No)
while len(no) < 6:
    no = "0" + no
print(no, mark, price)
if src1:
    src1 = urllib.request.urljoin(self.driver.current_url, src1)
    p = src1.rfind(".")
    mFile = no + src1[p:]
elif src2:
    src2 = urllib.request.urljoin(self.driver.current_url, src2)
    p = src2.rfind(".")
    mFile = no + src2[p:]

使用多執行緒，快速下載：

                if src1 or src2:
                    T = threading.Thread(target=self.download, args=(src1, src2, mFile))
                    T.setDaemon(False)
                    T.start()
                    self.threads.append(T)
                else:
                    mFile = ""

定義下載函式：

    def download(self, src1, src2, mFile):
        data = None
        if src1:
            try:
                req = urllib.request.Request(src1, headers=MySpider.headers)
                resp = urllib.request.urlopen(req, timeout=10)
                data = resp.read()
            except:
                pass
        if not data and src2:
            try:
                req = urllib.request.Request(src2, headers=MySpider.headers)
                resp = urllib.request.urlopen(req, timeout=10)
                data = resp.read()
            except:
                pass
        if data:
            print("download begin", mFile)
            fobj = open(MySpider.imagePath + "\\" + mFile, "wb")
            fobj.write(data)
            fobj.close()
            print("download finish", mFile)

建立圖片存放目錄：

        imagePath = "download"

        try:
            if not os.path.exists(MySpider.imagePath):
                os.mkdir(MySpider.imagePath)
            images = os.listdir(MySpider.imagePath)
            for img in images:
                s = os.path.join(MySpider.imagePath, img)
                os.remove(s)
        except Exception as err:
            print(err)

3.建立mysql資料庫

 # 連線mysql資料庫
        print("opened")
        try:
            self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root",
                                       password="hts2953936", database="mydb", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.opened = True
            # flag = False
        except Exception as err:
            print(err)
            self.opened = False

插入資料到資料表中：

# 插入資料
self.cursor.execute("insert into phone (Pno,Pmark, Pprice, Pnote, PmFile) values (%s,%s,%s,%s,%s)",(no, mark, price, note, mFile))

結果檢視
控制檯輸出：

資料庫中檢視結果：

圖片：

作業1碼雲連結

2）、心得體會

本次實驗是對之前實驗的復現，鞏固了Selenium模擬爬取京東的資料，以及下載圖片。複習了selenium爬取方法和翻頁，鞏固了對資料庫的操作

作業②：

要求：

熟練掌握 Selenium 查詢HTML元素、實現使用者模擬登入、爬取Ajax網頁資料、等待HTML元素等內容。

使用Selenium框架+MySQL模擬登入慕課網，並獲取學生自己賬戶中已學課程的資訊儲存到MySQL中（課程號、課程名稱、授課單位、教學進度、課程狀態，課程圖片地址），同時儲存圖片到本地專案根目錄下的imgs資料夾中，圖片的名稱用課程名來儲存。

候選網站：中國mooc網：https://www.icourse163.org

輸出資訊：MYSQL資料庫儲存和輸出格式

表頭應是英文命名例如：課程號ID，課程名稱：cCourse……，由同學們自行定義設計表頭：

Id	cCourse	cCollege	cSchedule	cCourseStatus	cImgUrl
1	Python網路爬蟲與資訊提取	北京理工大學	已學3/18課時	2021年5月18日已結束	http://edu-image.nosdn.127.net/C0AB6FA791150F0DFC0946B9A01C8CB2.jpg
2......

1）、selenium爬取mooc資料

編寫函式start

初始化driver

chrome_options = Options()
# 設定啟動chrome時不可見
# chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-gpu')
# 建立options
self.driver = webdriver.Chrome(chrome_options=chrome_options)
url = 'https://www.icourse163.org/'
self.driver.get(url)

將視窗最大化（方便尋找結點）並設定反監聽,防止網頁檢測到selenium

self.driver.maximize_window()
self.driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})

在慕課初始頁面中模擬點選登入按鈕

隨後模擬點選其他登入方式

接下來點選手機號登入

程式碼如下

loginbutton = self.driver.find_element_by_xpath('//div[@class="_1Y4Ni"]/div')
time.sleep(3)
loginbutton.click()
time.sleep(3)
button2 = self.driver.find_element_by_xpath('//span[@class="ux-login-set-scan-code_ft_back"]')
button2.click()
time.sleep(3)
button3 = self.driver.find_element_by_xpath('//ul[@class="ux-tabs-underline_hd"]/li[position()=2]')
button3.click()
time.sleep(3)

在進行到這一步時，接下來的一步是定位到兩個文字框的位置，隨後用sendkeys方法輸入資料

但是在這一步是需要注意到，文字框的結點是儲存在frame框架結點下的document內容中的：

所以文字框直接利用find_element_by_xpath是定位不到的，需要先定位到frame結點，呼叫switch_to.frame方法後才能爬取到

程式碼如下：

frame = self.driver.find_element_by_xpath('/html/body/div[position()=13]/div[position()=2]/div/div/div/div/div/div[position()=1]/div/div[position()=1]/div[position()=2]/div[position()=2]/div/iframe')
self.driver.switch_to.frame(frame)

查詢兩個文字框並利用sendkeys輸入內容

account = self.driver.find_element_by_xpath('/html/body/div[position()=2]/div[position()=2]/div[position()=2]/form/div/div[position()=2]/div[position()=2]/input')
account.send_keys('18016776126')
password = self.driver.find_element_by_xpath('/html/body/div[2]/div[2]/div[2]/form/div/div[4]/div[2]/input[2]')
password.send_keys("hts2953936")

輸入後點擊登入按鈕

loginbutton2 = self.driver.find_element_by_xpath('/html/body/div[2]/div[2]/div[2]/form/div/div[6]/a').click()
time.sleep(10)

這裡設定一個time.sleep(10)是因為有時候會出現需要手動拖動拼圖驗證碼，這部分實現起來比較費時，以後有時間可以花時間研究下。

進入頁面後模擬點選我的課程

mycourses = self.driver.find_element_by_xpath('/html/body/div[position()=4]/div[position()=2]/div[position()=1]/div/div/div[position()=1]/div[position()=3]/div[position()=4]/div').click()
time.sleep(3)

這樣之後就成功利用selenium的模擬登入進入了我們的課程頁面了，可以開始下一步爬取

編寫processSpider函式對頁面資訊進行爬取

定位含課程資訊的結點

body = self.driver.find_elements_by_xpath('//div[@class="course-card-wrapper"]')

遍歷結點爬取資料，下載圖片，插入資料庫：

for i in body:
    count += 1
    cid = count
    img = i.find_element_by_xpath('.//div[@class="img"]/img').get_attribute('src')
    schedule = i.find_element_by_xpath('.//span[@class="course-progress-text-span"]').text
    college = i.find_element_by_xpath('.//div[@class="school"]').text
    title = i.find_element_by_xpath('.//div[@class="title"]/div/span[position()=2]').text
    coursestatus = i.find_element_by_xpath('.//div[@class="course-status"]').text
    downloadurl = img
    file = "C:/Users/86180/Desktop/Data Collection/imgs/" + "course no." + str(count) + " pic no."+".jpg"
    urllib.request.urlretrieve(downloadurl, filename=file)
    print("course no." + str(count) + " download completed")
    print("insert into mooc (cid,cCourse,cCollege,cShedule,cCourseStatus,cImgUrl) values (%s,%s,%s,%s,%s,%s)",(cid,title,college,schedule,coursestatus,img))
    # 執行插入資料庫操作
    if self.opened:
        self.cursor.execute("insert into mooc (cid,cCourse,cCollege,cShedule,cCourseStatus,cImgUrl) values (%s,%s,%s,%s,%s,%s)",
        (cid,title,college,schedule,coursestatus,img))
    print("-------------------------------")

爬取完一頁資訊後進行翻頁操作：

在頁面中ul[@class="ux-pager"]結點下包含頁面資訊，該結點下的li結點中在倒數第二位的是所學課程的最後一頁的頁號（即下一頁的前一個兄弟結點）

所以獲取該結點下的頁號資訊：

page = self.driver.find_element_by_xpath('//ul[@class="ux-pager"]/li[position()=last()-1]/a').text

利用selenium模擬點選，並遞迴呼叫processSpider函式進行翻頁：

if not flag == int(page):
    flag +=1
    nxpgbutton = self.driver.find_element_by_xpath('//li[@class="ux-pager_btn ux-pager_btn__next"]/a')
    nxpgbutton.click()
    time.sleep(5)
    self.processSpider()

最後檢視結果：

資料庫中：

圖片：

作業2碼雲連結

心得體會：

本次實驗考察用selenium模擬登入，步驟上就是逐步定位按鈕的位置，通過click方法進行點選按鈕，在模擬登入時對文字框輸入資料則使用send_keys方法，mooc網站的文字框較為特殊，保存於iframe 的#document中,無法直接定位，需要通過呼叫 driver.switch_to.frame()轉入該結點下的新的html中，然後再進行定位。進入後的爬取等操作就比較熟悉了。在進行模擬登入的過程中出現的問題：短時間內多次登入後，在登陸的時候有時會彈出拼圖驗證碼。

作業③：

作業③：
要求：
理解Flume架構和關鍵特性，掌握使用Flume完成日誌採集任務。
完成Flume日誌採集實驗，包含以下步驟：
任務一：開通MapReduce服務
任務二：Python指令碼生成測試資料
任務三：配置Kafka
任務四：安裝Flume客戶端
任務五：配置Flume採集資料

實驗過程
任務一：開通MapReduce服務
(購買開通MapReduce服務，環境搭建，以便後續實驗使用)
任務二：Python指令碼生成測試資料

步驟1：編寫Python指令碼
1）使用Xshell 7連線伺服器
2）進入/opt/client/目錄，使用vi命令編寫Python指令碼：vi autodatapython.py
（這裡直接用xftp7將本地的autodatapython.py檔案上傳至伺服器/opt/client/目錄下即可，不必再使用vi/vim命令，以免出錯，難以修改，後面上傳Flume客戶端也是一樣。）

步驟2：建立目錄
使用mkdir命令在/tmp下建立目錄flume_spooldir，我們把Python指令碼模擬生成的資料放到此目錄下，後面Flume就監控這個檔案下的目錄，以讀取資料。
命令：

mkdir  /tmp/flume_spooldir/

步驟3:測試執行
執行Python命令，測試生成100條資料
命令：

python  autodatapython.py  "/tmp/flume_spooldir/test.txt"  100
more  /tmp/flume_spooldir/test.txt

任務三：配置Kafka
步驟1:設定環境變數
首先設定環境變數，執行source命令，使變數生效
步驟2:在kafka中建立topic（注意更換為自己Zookeeper的ip，埠號一般不動）
執行如下命令建立topic，替換實際Zookeeper的IP

/opt/client/Kafka/kafka/bin/kafka-topics.sh  --create  --zookeeper 172.16.0.74:2181/kafka  --partitions  1  --replication-factor  1  --topic  fludesc

步驟3：檢視topic資訊

/opt/client/Kafka/kafka/bin/kafka-topics.sh --list  --zookeeper 172.16.0.74:2181/kafka

任務四：安裝Flume客戶端
步驟1:開啟flume服務介面
進入MRS Manager叢集管理介面，開啟服務管理，點選flume，進入Flume服務

步驟2:點選下載客戶端

點選確定，等待下載
下載完成後會有彈出框提示下載到哪一臺伺服器上（這臺機器就是Master節點），路徑就是/tmp/MRS-client：

步驟3:解壓下載的flume客戶端檔案
使用Xshell7登入到上步中的彈性伺服器上，進入/tmp/MRS-client目錄
執行以下命令，解壓壓縮包獲取校驗檔案與客戶端配置包

tar -xvf MRS_Flume_Client.tar

步驟4:校驗檔案包
執行命令：

sha256sum -c MRS_Flume_ClientConfig.tar.sha256

介面顯示如下資訊，表明檔案包校驗成功：

步驟5:解壓“MRS_Flume_ClientConfig.tar”檔案
執行以下命令：

tar -xvf MRS_Flume_ClientConfig.tar

步驟6:安裝Flume環境變數
執行以下命令，安裝客戶端執行環境到新的目錄“/opt/Flumeenv”，安裝時自動生成目錄。

sh  /tmp/MRS-client/MRS_Flume_ClientConfig/install.sh  /opt/Flumeenv

檢視安裝輸出資訊，如有以下結果表示客戶端執行環境安裝成功：

Components client installation is complete

配置環境變數，執行命令：

source /opt/Flumeenv/bigdata_env

步驟7:解壓Flume客戶端
執行命令：

cd /tmp/MRS-client/MRS_Flume_ClientConfig/Flume
tar -xvf FusionInsight-Flume-1.6.0.tar.gz

步驟8:安裝Flume客戶端
安裝Flume到新目錄”/opt/FlumeClient”，安裝時自動生成目錄。
執行命令：

sh /tmp/MRS-client/MRS_Flume_ClientConfig/Flume/install.sh -d /opt/FlumeClient

“-d”：表示Flume客戶端安裝路徑。
系統顯示以下結果表示客戶端執行環境安裝成功：

install flume client successfully。

步驟9:重啟Flume服務
執行一下命令：

cd /opt/FlumeClient/fusioninsight-flume-1.6.0
sh bin/flume-manage.sh restart

任務五：配置Flume採集資料
步驟1:修改配置檔案（注意更換為自己Kafka的ip，埠號一般不動）
進入Flume安裝目錄

cd /opt/FlumeClient/fusioninsight-flume-1.6.0/

在conf目錄下編輯檔案properties.properties(同樣是建議把內容複製到本地檔案，修改ip地址後用Xshell上傳到伺服器，而不是直接vi/vim往裡面複製)
步驟2:建立消費者消費kafka中的資料
執行命令：

/opt/client/Kafka/kafka/bin/kafka-console-consumer.sh  --topic fludesc  --bootstrap-server 192.168.0.54:9092  --new-consumer  --consumer.config  /opt/client/Kafka/kafka/config/consumer.properties

注：此處bootstrap-server的ip對應的是Kafka的Broker的IP。
執行完畢後，在新開一個Xshell 7視窗(右鍵相應會話-->在右選項卡組中開啟)，執行2.2.1步驟三的Python指令碼命令，再生成一份資料，檢視Kafka中是否有資料產生，可以看到，已經消費出資料了：

心得體會

理解Flume架構和關鍵特性，掌握了使用Flume完成日誌採集任務。學習了在華為雲平臺上進行資源申請和釋放。

資料採集與融合技術實驗5

作業①：

1）、京東網手機資料爬取

2）、心得體會

作業②：

1）、selenium爬取mooc資料

心得體會：

作業③：

心得體會

資料採集與融合技術實驗5

資料採集與融合技術實驗1

資料採集與融合技術-實驗1

資料採集與融合技術-實驗二

資料採集與融合技術——實驗三

資料採集與融合技術_實驗一

資料採集與融合技術_實驗3

資料採集與融合技術_實驗四

資料採集與融合技術_實踐5

資料採集與融合技術_實踐2

資料採集與融合技術第二次實踐

資料採集與融合技術第五次實踐

資料採集與融合實驗2

【資料採集與融合】第四次實驗

【資料採集與融合】第五次實驗

【資料採集與融合】第二次實踐

入門資料採集，python爬蟲常見的資料採集與儲存、

M21F4 視訊資料採集終端的技術應用

資料服務與資料庫技術

廣州大學組合語言與介面技術實驗2 簡單介面應用實驗 2020.12

資料採集與融合技術 實驗5

作業①：

1）、京東網手機資料爬取

2）、心得體會

作業②：

1）、selenium爬取mooc資料

心得體會：

作業③：

心得體會

相關推薦

資料採集與融合技術實驗5