Node：使用puppeteer爬取網頁資料

阿新 • • 發佈：2020-11-20

puppeteer？

高階API的node庫，能夠通過devtool控制headless模式的chrome或者chromium，它可以在headless模式下模擬任何的人為操作。

與cheerio的區別

cherrico本質上只是一個使用類似jquery的語法操作HTML文件的庫，使用cherrico爬取資料，只是請求到靜態的HTML文件，如果網頁內部的資料是通過ajax動態獲取的，那麼便爬去不到的相應的資料。而Puppeteer能夠模擬一個瀏覽器的執行環境，能夠請求網站資訊，並執行網站內部的邏輯。然後再通過WS協議動態的獲取頁面內部的資料，並能夠進行任何模擬的操作(點選、滑動、hover等),並且支援跳轉頁面，多頁面管理。甚至能注入node上的指令碼到瀏覽器內部環境執行，總之，你能對一個網頁做的操作它都能做，你不能做的它也能做。

例子：

爬取網頁的資料，並把資料儲存到資料庫

//爬取疫情資料
const chalk = require("chalk")
const fs = require("fs")
const puppeteer = require('puppeteer');
const mysql = require("mysql");
 
//資料庫資訊根據自己的情況來配
const sqlInfo = {
    host: '***.***.***.***',
    user: 'root',
    password: 'root',
    database: '****',
    port: 3306
};
// 建立mysql資料庫連線
const con = mysql.createConnection(sqlInfo);
 // 連線資料庫
con.connect();

puppeteer.launch({
	headless: false, //不使用無頭模式使用本地視覺化
	//executablePath: "./Chromium/chrome-win/chrome.exe", //因為是yarn add puppeteer --ignore-scripts沒有安裝chromium，需要制定本地chromium的chrome.exe路徑所在,剛才下載後解壓後的全路徑
	//設定超時時間
	timeout: 15000,
	//如果是訪問https頁面 此屬性會忽略https錯誤
	ignoreHTTPSErrors: true,
	// 開啟開發者工具, 當此值為true時, headless總為false
	devtools: true,
}).then(async browser => {
	const page = await browser.newPage()
	await page.goto('https://voice.baidu.com/act/newpneumonia/newpneumonia/?from=osari_aladin_banner', { waitUntil: "networkidle2" })
	const moreBtn = await page.$(".Common_1-1-287_3lDRV2");
    await moreBtn.click();
	await page.waitForSelector('.VirusTable_1-1-287_3m6Ybq');
	
	const data = await page.$$eval('.VirusTable_1-1-287_3m6Ybq', data => {
    return data.map(a => {
     return {
     'area_name': a.children[0].children[0].children[1].innerText,
	 'newAdd': a.children[1].innerText,
	 'nowHas': a.children[2].innerText,
	 'total': a.children[3].innerText,
	 'cure': a.children[4].innerText,
	 'death': a.children[5].innerText,
     }
    });
   });
   let json = JSON.stringify(data,null,2)
   console.log(chalk.green("所有資料抓取完畢:\n", json))
   fs.writeFile('YQ.json', json, 'utf8', function(error){
		if(error){
				console.log(chalk.green(error));
				return false;
		}
		console.log(chalk.blue('資料寫入檔案成功！'));
	})
	await browser.close()


	// 開始插入資料庫了
	for(let i = 0;i<data.length;i++){
		  con.query("insert into yiqing(area, newAdd, nowHas, total, cure, death) values(?,?,?,?,?,?)",
          [
		  data[i].area_name,
		  data[i].newAdd,
		  data[i].nowHas,
		  data[i].total,
		  data[i].cure,
		  data[i].death
		  ],function(err) {
                //這裡呢，插入很可能出錯，所以還是要走個形式判斷一下嘛（雖然可能性基本為0只要資料庫能連線上）
                if (err) {
                    //輸出錯誤
                    console.log(err);   
                } else {
                    //到這裡，一條資料就插入成功了
                    console.log(chalk.blue('第'+(i+1)+'條資料成功插入資料庫'));  
                }   
            });  
		
	}
	
           
}).catch(err => console.log(err))

爬到的資料：

寫入到資料庫中的資料：

Node：使用puppeteer爬取網頁資料

puppeteer？

與cheerio的區別

例子：

爬到的資料：

Node：使用puppeteer爬取網頁資料

爬取網頁資料例項

使用Puppeteer爬取頁面資料，以豆瓣的即將上映頁面為例

python爬蟲爬取網頁資料並解析資料

爬取網頁資料

爬蟲： cheerio爬取網頁中的所有圖片

如何用python爬蟲代理ip爬取網頁資料？

Python基於pandas爬取網頁表格資料

Puppeteer爬取單頁面網站的資料示例

Python實現爬取網頁中動態載入的資料

python網路爬蟲案例：批量爬取百度貼吧頁面資料

另類Python爬蟲，利用pandas庫的read_html()方法爬取網頁表格型資料

Python爬蟲實戰：自動化登入網站，爬取商品資料

puppeteer爬取資料 await與forEach的問題解決方法

ext js 選擇本地路徑_使用node爬取網頁上的圖片，並儲存在本地目錄

【網路爬蟲學習】實戰，爬取網頁以及貼吧資料

node.js使用cheerio抓取網頁資料

selenium實戰：視窗化爬取*寶資料（附原始碼連結）

Node爬取網站資料

Python爬取網頁上想要的資料

Node：使用puppeteer爬取網頁資料

puppeteer？

與cheerio的區別

例子：

爬到的資料：

相關推薦