node爬蟲框架_node 爬蟲入門例項，簡單易懂

阿新 • • 發佈：2021-01-02

技術標籤：node爬蟲框架

前言

本文介紹一個 koa 的爬蟲專案，受眾物件為初學前端不久的小夥伴，通過這個專案能對 node 爬蟲有一個簡單的認識，也能自己動手寫一些簡單的爬蟲。專案地址：

Fe-Icy/firm-spidergithub.com

啟動 koa 服務

Koa (koajs) -- 基於 Node.js 平臺的下一代 web 開發框架koa.bootcss.com

koa 是基於 nodejs 平臺的新一代 web 開發框架，使用 koa 啟動 node 服務也非常簡單，三行程式碼就能啟動一個 http 服務

const Koa = require('koa')
const app = new Koa()
app.listen(8080)

怎麼樣，是不是看一眼就會，關於 koa 的更多內容可以學習[官方文件](Koa (koajs) -- 基於 Node.js 平臺的下一代 web 開發框架)，只要你能靈活運用 nodejs，koa 也能分分鐘上手。

爬蟲分析

️爬蟲的目的是什麼？其實爬蟲的目的很簡單，就是需要在一個站點中抓取到我們想要的資料。不管用什麼方式，用什麼語言，只要能把資料抓回來，就達到我們的目的了。但是通過分析站點我們發現，有些網站是靜態的，前端無法檢視網站中的 api 請求，所以只能通過分析頁面去提取資料，這種叫靜態抓取。有的頁面是前端請求介面渲染資料的，這種我們可以直接拿到 api 地址，而在爬蟲中去模擬請求，這種叫動態抓取，基於此，我簡單設計了一個通用的爬蟲。

全域性配置

為了方便，我在全域性配置了一些引數方法

const path = require('path')
const base = require('app-root-dir')

// 全域性的 require 方式
global.r = (p = base.get(), m = '') => require(path.join(p, m))

// 全域性的路徑配置
global.APP = {
  R: base.get(),
  C: path.resolve(base.get(), 'config.js'),
  P: path.resolve(base.get(), 'package.json'),
  A: path.resolve(base.get(), 'apis'),
  L: path.resolve(base.get(), 'lib'),
  S: path.resolve(base.get(), 'src'),
  D: path.resolve(base.get(), 'data'),
  M: path.resolve(base.get(), 'model')
}

為了統一管理，我把所有要抓取的頁面地址寫到一個配置檔案中：

// 所有目標
const targets = {
  // 技術社群
  juejinFront: {
    url: 'https://web-api.juejin.im/query',
    method: 'POST',
    options: {
      headers: {
        'X-Agent': 'Juejin/Web',
        'X-Legacy-Device-Id': '1559199715822',
        'X-Legacy-Token': 'eyJhY2Nlc3NfdG9rZW4iOiJoZ01va0dVNnhLV1U0VGtqIiwicmVmcmVzaF90b2tlbiI6IkczSk81TU9QRjd3WFozY2IiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ==',
        'X-Legacy-Uid': '5c9449c15188252d9179ce68'
      }
    }
  },
  // 圖片網站
  pixabay:  {
    url: 'https://pixabay.com'
  }
}

如上所示，有的抓取靜態頁面，有的抓取動態 api，而模擬後者請求的時候，需要設定額外的請求頭，post 請求還需要傳遞 json，都在這裡統一配置。

通用類庫

分析靜態頁面我採用了 cheerio 庫

cheerio 類似於 node 環境中的 jquery，它能解析頁面並提取頁面中的相關資訊，它暴露出的 api 與 jquery 大同小異，可以理解為服務端的 jq，如下進行了簡單的封裝

const cheerio = require('cheerio')

const $ = html => cheerio.load(html, {
  ignoreWhitespace: true,
  xmlMode: true
})

const $select = (html, selector) => $(html)(selector)

// 節點屬性
const $attr = (html, attr) => $(html).attr(attr)


module.exports = {
  $,
  $select,
  $attr
}

superagent 是一個功能完善的服務端 http 庫，它可以把靜態頁面抓回來提供給 cheerio 來分析，也能抓取動態 api 返回資料，基於此我進行了簡單的封裝

// 封裝 superagent 庫
const superagent = require('superagent')
const { isEmpty } = require('lodash')

// 頁面需要轉碼 例如 utf-8
const charset = require('superagent-charset')
const debug = require('debug')('superAgent')

charset(superagent)

const allowMethods = ['GET', 'POST']

const errPromise = new Promise((resolve, reject) => {
  return reject('no url or method is not supported')
}).catch(err => err)


 /*
  * options 包含 post 資料 和 headers, 如
  * {
  *    json: { a: 1 },
  *    headers: { accept: 'json' }
  * }
  */

// mode 區分動態還是靜態抓取， unicode 為頁面編碼方式，靜態頁面中使用
const superAgent = (url, {method = 'GET', options = {}} = {}, mode = 'dynamic', unicode = 'gbk') => {
  if(!url || !allowMethods.includes(method)) return errPromise
  const {headers} = options

  let postPromise 

  if(method === 'GET') {
    postPromise = superagent.get(url)
    if(mode === 'static') {
      // 抓取的靜態頁面需要根據編碼模式解碼
      postPromise = postPromise.charset(unicode)
    }
  }

  if(method === 'POST') {
    const {json} = options
// post 請求要求傳送一個 json
    postPromise = superagent.post(url).send(json)
  }

// 需要請求頭的話這裡設定請求頭
  if(headers && !isEmpty(headers)) {
    postPromise = postPromise.set(headers)
  }

  return new Promise(resolve => {
    return postPromise
      .end((err, res) => {
        if(err) {
          console.log('err', err)
          // 不拋錯
          return resolve(`There is a ${err.status} error has not been resolved`)
        }
        // 靜態頁面，返回 text 頁面內容
        if(mode === 'static') {
          debug('output html in static mode')
          return resolve(res.text)
        }
        // api 返回 body 的內容
        return resolve(res.body)
      })
  })
}

module.exports = superAgent

另外抓回來的資料我們需要讀寫:

const fs = require('fs')
const path = require('path')
const debug = require('debug')('readFile')

// 預設讀取 data 資料夾下的檔案
module.exports = (filename, filepath = APP.D) => {
  const file = path.join(filepath, filename)
  if(fs.existsSync(file)) {
    return fs.readFileSync(file, 'utf8')
  } else {
    debug(`Error: the file is not exist`)
  }
}

const fs = require('fs')
const path = require('path')
const debug = require('debug')('writeFile')


// 預設都寫入 data 資料夾下的對應檔案
module.exports = (filename, data, filepath) => {
  const writeData = JSON.stringify(data, '', 't')
  const lastPath = path.join(filepath || APP.D, filename)
  if(!fs.existsSync(path.join(filepath || APP.D))) {
    fs.mkdirSync(path.join(filepath || APP.D))
  }
  fs.writeFileSync(lastPath, writeData, function(err) {
    if(err) {
      debug(`Error: some error occured, the status is ${err.status}`)
    }
  })
}

一切準備就緒之後開始抓取頁面

抓取動態 api

以掘金社群為例，需要分析並模擬請求

之前的圖片發上來違規，感興趣的可以圍觀github

掘金社群的文章的 feed 流是這樣實現的，上一頁的返回資料中有一個標記`after`，請求下一頁時需要把這個 after 值放在 post 的 json 中，其他的引數是一些靜態的，抓取的時候可以先寫死

const { get } = require('lodash')
const superAgent = r(APP.L, 'superagent')
const { targets } = r(APP.C)
const writeFile = r(APP.L, 'writeFile')
const { juejinFront } = targets

let totalPage = 10 // 只抓取十頁

const getPostJson = ({after = ''}) => {
  return {
    extensions: {query: {id: '653b587c5c7c8a00ddf67fc66f989d42'}},
    operationName: '',
    query: '',
    variables: {limit: 10, category: '5562b415e4b00c57d9b94ac8', after, order: 'POPULAR', first: 20}
  }
}

// 儲存所有文章資料
let data = []
let paging = {}

const fetchData = async (params = {}) => {
  const {method, options: {headers}} = juejinFront
  const options = {method, options: {headers, json: getPostJson(params)}}
  // 發起請求
  const res = await superAgent(juejinFront.url, options)
  const resItems = get(res, 'data.articleFeed.items', {})
  data = data.concat(resItems.edges)
  paging = {
    total: data.length,
    ...resItems.pageInfo
  }
  pageInfo = resItems.pageInfo
  if(resItems.pageInfo.hasNextPage && totalPage > 1) {
    fetchData({after: resItems.pageInfo.endCursor})
    totalPage--
  } else {
  // 請求玩之後寫入 data 資料夾
    writeFile('juejinFront.json', {paging, data})
  }
}

module.exports = fetchData

抓取靜態 html

以某電影網站為例

分析該網站的頁面，有列表頁和詳情頁，要想拿到磁力連結需要進入詳情頁，而詳情頁的連結要從列表頁進入，因此我們先請求列表頁，拿到詳情頁 url 之後進入詳情頁解析頁面拿到磁力連結。

可以看到列表頁中的 url 可以解析 .co_content8 ul table 下的 a 標籤，通過 cheerio 拿到的 dom 節點是一個類陣列，它的 each() api 相當於陣列的 forEach 方法，我們通過這種方式來抓取連結。進入詳情頁之後抓取磁力連結和這個類似。這裡面涉及到 es7 的 async await 語法，是非同步獲取資料的一種有效方式。

const path = require('path')
const debug = require('debug')('fetchMovie')
const superAgent = r(APP.L, 'superagent')
const { targets } = r(APP.C)
const writeFile = r(APP.L, 'writeFile')
const {$, $select} = r(APP.L, 'cheerio')

const { movie } = targets

// 各種電影型別，分析網站得到的
const movieTypes = {
  0: 'drama', 
  1: 'comedy', 
  2: 'action', 
  3: 'love', 
  4: 'sciFi', 
  5: 'cartoon', 
  7: 'thriller',
  8: 'horror', 
  14: 'war',
  15: 'crime',
}

const typeIndex = Object.keys(movieTypes)

// 分析頁面，得到頁面節點選擇器，'.co_content8 ul table'
const fetchMovieList = async (type = 0) => {
  debug(`fetch ${movieTypes[type]} movie`)
  // 存電影資料，title，磁力連結
  let data = []
  let paging = {}
  let currentPage = 1
  const totalPage = 30 // 抓取頁
  while(currentPage <= totalPage) {
    const url = movie.url + `/${type}/index${currentPage > 1 ? '_' + currentPage : ''}.html`
    const res = await superAgent(url, {}, 'static')
    // 拿到一個節點的陣列
    const $ele = $select(res, '.co_content8 ul table')
    // 遍歷
    $ele.each((index, ele) => {
      const li = $(ele).html()
      $select(li, 'td b .ulink').last().each(async (idx, e) => {
        const link = movie.url + e.attribs.href
        // 這裡去請求詳情頁
        const { magneto, score } = await fetchMoreInfo(link)
        const info = {title: $(e).text(), link, magneto, score}
        data.push(info)
        // 按評分倒序
        data.sort((a, b) => b.score - a.score)
        paging = { total: data.length }
      })
    })
    writeFile(`${movieTypes[type]}Movie.json`, { paging, data }, path.join(APP.D, `movie`))
    currentPage++
  }
}

// 獲取磁力連結 '.bd2 #Zoom table a'
const fetchMoreInfo = async link => {
  if(!link) return null
  let magneto = []
  let score = 0
  const res = await superAgent(link, {}, 'static')
  $select(res, '.bd2 #Zoom table a').each((index, ele) => {
    // 不做這個限制了，有些電影沒有 magnet 連結
    // if(/^magnet/.test(ele.attribs.href)) {}
    magneto.push(ele.attribs.href)
  })
  $select(res, '.position .rank').each((index, ele) => {
    score = Math.min(Number($(ele).text()), 10).toFixed(1)
  })
  return { magneto, score }
}

// 獲取所有型別電影，併發
const fetchAllMovies = () => {
  typeIndex.map(index => {
    fetchMovieList(index)
  })
}

module.exports = fetchAllMovies

資料處理

抓取回來的資料可以存資料庫，我目前寫在本地，本地的資料也可以作為 api 的資料來源，例如電影的資料我可以寫一個本地的 api 作為本地開發的 server 來用

const path = require('path')
const router = require('koa-router')()
const readFile = r(APP.L, 'readFile')
const formatPaging = r(APP.M, 'formatPaging')

// router.prefix('/api');
router.get('/movie/:type', async ctx => {
  const {type} = ctx.params
  const totalData = readFile(`${type}Movie.json`, path.join(APP.D, 'movie'))
  const formatData = await formatPaging(ctx, totalData)
  ctx.body = formatData
})

module.exports = router.routes()

其中我手動維護了一個分頁列表，方便資料給到前端時也實現 feed 流：

// 手動生成分頁資料
const {getQuery, addQuery} = r(APP.L, 'url')
const {isEmpty} = require('lodash')

module.exports = (ctx, originData) => {
  return new Promise((resolve) => {
    const {url, header: {host}} = ctx
    if(!url || isEmpty(originData)) {
      return resolve({
        data: [],
        paging: {}
      })
    }
    const {data, paging} = JSON.parse(originData)
    const query = getQuery(url)
    const limit = parseInt(query.limit) || 10
    const offset = parseInt(query.offset) || 0
    const isEnd = offset + limit >= data.length
    const prev = addQuery(`http://${host}${url}`, {limit, offset: Math.max(offset - limit, 0)})
    const next = addQuery(`http://${host}${url}`, {limit, offset: Math.max(offset + limit, 0)})
    const formatData = {
      data: data.slice(offset, offset + limit),
      paging: Object.assign({}, paging, {prev, next, isEnd})
    }
    return resolve(formatData)
  })
}

方便的話大家可以把資料寫入資料庫，這樣就能實現爬蟲-後端-前端一條龍了哈哈

執行 npm run start 啟動 web 服務可以就看到介面啦

✨✨✨

當然，關於爬蟲能展開講的東西太多了，有些站點做了爬蟲限制，需要構建 ip 池不定時換 ip，有些需要模擬登入，要學習的東西還有很多，喜歡的小夥伴可以提一些 issue 一起交流一起學習

Fe-Icy/firm-spider

node爬蟲框架_node 爬蟲入門例項，簡單易懂

前言

啟動 koa 服務

爬蟲分析

全域性配置

通用類庫

抓取動態 api

抓取靜態 html

資料處理

✨✨✨

node爬蟲框架_node 爬蟲入門例項，簡單易懂

SSM框架之_Mybatis入門筆記，詳解快速上手（四）：日誌與分頁查詢篇

django 檔案上傳功能的相關例項程式碼(簡單易懂)

微信小程式根據不同使用者切換不同`TabBar`，簡單易懂

Vue實現省市區三級聯動，下拉框，簡單易懂

力扣刷題筆記：面試題 16.15. 珠璣妙算（列表操作，簡單易懂，速度不比雜湊錶慢）

java 之 hashMap 原始碼解讀，由淺入深，簡單易懂put方法

二叉樹的前序遍歷，簡單易懂

高效進行介面測試，簡單易懂

Scrapy爬蟲框架，入門案例

這個爬蟲JS逆向加密任務，你還不來試試？逆向入門級，適合一定爬蟲基礎的人

Python反反爬蟲實戰，JS解密入門案例，詳解呼叫有道翻譯

node.js爬蟲框架node-crawler初體驗

10個python爬蟲入門例項(小結)

python爬蟲學習：從資料庫讀取目標爬蟲站點及爬蟲規程，批量爬取目標站點制定資料（scrapy框架）

python爬蟲入門例項

【爬蟲系列】1. 無事，Python驗證碼識別入門

不踩坑的Python爬蟲:《Python爬蟲開發與專案實戰》，從爬蟲入門 Python ！

Python爬蟲_Selenium與PhantomJS入門

python爬蟲新增請求頭程式碼例項

node爬蟲框架_node 爬蟲入門例項，簡單易懂

前言

啟動 koa 服務

爬蟲分析

全域性配置

通用類庫

抓取動態 api

抓取靜態 html

資料處理

✨✨✨

相關推薦