ylbtech-Demo-Node.js：Node.js 寫爬蟲

1、

你不知道的node爬蟲原來這麼簡單

前言

今天給大家帶來的是node簡單爬蟲，對於前端小白也是非常好理解且會非常有成就感的小技能

爬蟲的思路可以總結為：請求 url - > html（資訊） -> 解析html

這篇文章呢，就帶大家爬取豆瓣TOP250電影的資訊

工具

爬蟲必備工具：cheerio
cheerio簡單介紹：cheerio是jquery核心功能的一個快速靈活而又簡潔的實現，主要是為了用在伺服器端需要對DOM進行操作的地方。大家可以簡單的理解為用來解析html非常方便的工具。

使用之前只需要在終端安裝即可npm install cheerio

node爬蟲步驟解析

一、選取網頁url，使用http協議get到網頁資料

豆瓣TOP250連結地址：https://movie.douban.com/top250

首先我們請求http協議，通過http來拿到網頁的所有資料

const https = require('https');
https.get('https://movie.douban.com/top250',function(res){
    // 分段返回的 自己拼接
    let html = '';
    // 有資料產生的時候 拼接
    res.on('data',function 
(chunk){
        html += chunk;
    })
    // 拼接完成
    res.on('end',function(){
        console.log(html);
    })
})

上面程式碼呢，大家一定要注意我們請求資料時，拿到的資料是分段拿到的,我們需要通過自己把資料拼接起來

res.on('data',function(chunk){
        html += chunk;
    })

拼接完成時我們可以輸出一下，看一下我們是否拿到了完整資料

res.on('end',function(){
        console.log(html);
    })

二、使用cheerio工具解析需要的內容

const cheerio = require('cheerio');
res.on('end',function(){
        console.log(html);
        const $ = cheerio.load(html);
        let allFilms = [];
        $('li .item').each(function(){
            // this 迴圈時 指向當前這個電影
            // 當前這個電影下面的title
            // 相當於this.querySelector 
            const title = $('.title', this).text();
            const star = $('.rating_num',this).text();
            const pic = $('.pic img',this).attr('src');
            // console.log(title,star,pic);
            // 存 資料庫
            // 沒有資料庫存成一個json檔案 fs
            allFilms.push({
                title,star,pic
            })
        })

可以通過檢查網頁原始碼檢視需要的內容在哪個標籤下面，然後通過$符號來拿到需要的內容，這裡我就拿了電影的名字、評分、電影圖片

到了這時候，你會發現，node爬蟲實現是非常簡單的，我們只需要認真分析一下我們拿到的html資料，將需要的內容拿出來儲存在本地就基本完成了

儲存資料

下面就是儲存資料了，我將資料儲存在films.json檔案中
將資料儲存到檔案中，我們引入一個fs模組，將資料寫入檔案中去

const fs = require('fs');
fs.writeFile('./films.json', JSON.stringify(allFilms),function(err){
            if(!err){
                console.log('檔案寫入完畢');
            }
        })

檔案寫入程式碼需要寫在res.on('end')裡面，資料讀完->寫入
寫入完成，可以檢視一下films.json，裡面是有爬取的資料的。

下載圖片

我們爬取的圖片資料是圖片地址，如果我們要將圖片儲存到本地呢？這時候只需要跟前面請求網頁資料一樣，把圖片地址url請求回來，每一張圖片寫入到本地即可

function downloadImage(allFilms) {
    for(let i=0; i<allFilms.length; i++){
        const picUrl = allFilms[i].pic;
        // 請求 -> 拿到內容
        // fs.writeFile('./xx.png','內容')
        https.get(picUrl,function(res){
            res.setEncoding('binary');
            let str = '';
            res.on('data',function(chunk){
                str += chunk;
            })
            res.on('end',function(){
                fs.writeFile(`./images/${i}.png`,str,'binary',function(err){
                    if(!err){
                        console.log(`第${i}張圖片下載成功`);
                    }
                })
            })
        })
    }
}

下載圖片的步驟跟爬取網頁資料的步驟是一模一樣的，我們將圖片的格式儲存為.png
寫好了下載圖片的函式，我們在res.on('end')裡面呼叫一下函式就大功告成了

原始碼

// 請求 url - > html（資訊）  -> 解析html
const https = require('https');
const cheerio = require('cheerio');
const fs = require('fs');
// 請求 top250
// 瀏覽器輸入一個 url, get
https.get('https://movie.douban.com/top250',function(res){
    // console.log(res);
    // 分段返回的 自己拼接
    let html = '';
    // 有資料產生的時候 拼接
    res.on('data',function(chunk){
        html += chunk;
    })
    // 拼接完成
    res.on('end',function(){
        console.log(html);
        const $ = cheerio.load(html);
        let allFilms = [];
        $('li .item').each(function(){
            // this 迴圈時 指向當前這個電影
            // 當前這個電影下面的title
            // 相當於this.querySelector 
            const title = $('.title', this).text();
            const star = $('.rating_num',this).text();
            const pic = $('.pic img',this).attr('src');
            // console.log(title,star,pic);
            // 存 資料庫
            // 沒有資料庫存成一個json檔案 fs
            allFilms.push({
                title,star,pic
            })
        })
        // 把陣列寫入json裡面
        fs.writeFile('./films.json', JSON.stringify(allFilms),function(err){
            if(!err){
                console.log('檔案寫入完畢');
            }
        })
        // 圖片下載一下
        downloadImage(allFilms);
    })
})

function downloadImage(allFilms) {
    for(let i=0; i<allFilms.length; i++){
        const picUrl = allFilms[i].pic;
        // 請求 -> 拿到內容
        // fs.writeFile('./xx.png','內容')
        https.get(picUrl,function(res){
            res.setEncoding('binary');
            let str = '';
            res.on('data',function(chunk){
                str += chunk;
            })
            res.on('end',function(){
                fs.writeFile(`./images/${i}.png`,str,'binary',function(err){
                    if(!err){
                        console.log(`第${i}張圖片下載成功`);
                    }
                })
            })
        })
    }
}

總結

爬蟲不是隻有python才行的，我們node也很方便簡單，前端新手掌握一個小技能也是非常不錯的，對自身的node學習有很大的幫助，歡迎大家留言討論

2、

1、 https://juejin.im/post/6844904167429898247 2、

6.返回頂部

作者：ylbtech
出處：http://ylbtech.cnblogs.com/
本文版權歸作者和部落格園共有，歡迎轉載，但未經作者同意必須保留此段宣告，且在文章頁面明顯位置給出原文連線，否則保留追究法律責任的權利。

Demo-Node.js：Node.js 寫爬蟲

你不知道的node爬蟲原來這麼簡單

前言

工具

node爬蟲步驟解析

一、選取網頁url，使用http協議get到網頁資料

二、使用cheerio工具解析需要的內容

儲存資料

下載圖片

原始碼

總結

Demo-Node.js：Node.js 寫爬蟲

筆記-Node.js：Node.js 應用程式的示例

JavaScript-Tool-lhgDialog-js：lhgdialog.js

(一) 《Nest.js：漸進式node.js框架》介紹

Node.js：上傳檔案，服務端如何獲取檔案上傳進度

nodejs.cn-Node.js-簡單介紹：Node.js 簡介

nodejs.cn-Node.js-簡單介紹：Node.js 與瀏覽器的區別

nodejs.cn-Node.js-入門教程：Node.js 從命令列接收引數

nodejs.cn-Node.js-入門教程：Node.js 包執行器 npx

nodejs.cn-Node.js-入門教程：Node.js 事件觸發器

nodejs.cn-Node.js-入門教程：Node.js 檔案系統模組

nodejs.cn-Node.js-入門教程：Node.js 作業系統模組

nodejs.cn-Node.js-入門教程：Node.js 事件模組

nodejs.cn-Node.js-入門教程：Node.js Buffer

nodejs.cn-Node.js-入門教程：Node.js http 模組

nodejs.cn-Node.js-入門教程：Node.js 開發環境與生產環境的區別

nodejs.cn-Node.js-入門教程：Node.js 中的錯誤處理

奇舞週刊第 367 期：Node.js v14.13.0 釋出

Node.js實踐一：node.js安裝及環境配置之Windows篇

執行報錯：sockjs.js?9be2:1606 GET http://localhost:7777/sockjs-node/info?t=1609136473067 net::ERR_CONNECT

Demo-Node.js：Node.js 寫爬蟲

你不知道的node爬蟲原來這麼簡單

前言

工具

node爬蟲步驟解析

一、選取網頁url，使用http協議get到網頁資料

二、使用cheerio工具解析需要的內容

儲存資料

下載圖片

原始碼

總結

相關推薦