Python爬蟲爬取51job招聘網站

阿新 • • 發佈：2019-01-09

最近學習爬蟲，做了一個python爬蟲工具寫在這裡記錄一下。

# python爬51job工具，稍微改改就可以爬其他網站
# edit by mengqi Date：2018-07-11
# encoding:uft-8


import csv                # 爬下來的資料要寫到csv檔案中，所以要引入這個模組
from urllib import request, error
from lxml import etree   # 元素樹用來進行xpath語法解析時，
import random             # 這裡我構造了五個瀏覽器的user-agent，防止被檢測出來


# 1. get_html()這個函式是將給定url和encode方式，返回為html的字串形式
def get_html(url,encode='utf-8'):
    try:
        ua_value1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) " \
                    "Gecko/20100101 Firefox/61.0"
        ua_value2 = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ' \
                    '(KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'
        ua_value3 = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 ' \
                    '(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
        ua_value4 = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)' \
                    ' AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16'
        ua_value5 = 'Mozilla/5.0 (Windows NT 6.1; WOW64)' \
                    ' AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'
        # 建立user-agent集合，模擬瀏覽器登陸
        ua = (ua_value1, ua_value2, ua_value3, ua_value4, ua_value5)  # 元組裡面的東西不能隨便被修改
        req = request.Request(url)   # 3.構建爬蟲請求物件
        req.add_header("User-Agent", random.choice(ua))   # 4.在請求頭中新增Uer-Agent
        response = request.urlopen(req)    # 5.傳送請求並獲取伺服器的響應物件response
        html_str2 = response.read().decode(encode)  # 6.從響應物件中讀取網頁中的原始碼（響應正文）

    except error.URLError:            # 拋異常，如果是url錯誤的話執行這個
        print('url 請求錯誤')
    except error.HTTPError:
        print('請求錯誤')
    except Exception:
        print('程式錯誤')
    return html_str2


def crawl_onepage(html_str1):              # 這個方法用來將獲取到的str格式的html進行xpath解析到rows這個列表中
    html_ = etree.HTML(html_str1)       # 將html字串結構轉換成html文件結構
    html = etree.ElementTree(html_)     # 將html文件結構轉換成元素樹結構
    # 使用xpath語法進行資料清洗
    div_el = html.xpath('//div[@id="resultList"]/div[@class="el"]')  # 獲取id=“resultlist‘ 內所有的class=’el‘的div,div的列表
    rows = list()
    # 通過for迴圈尋找每一行el資料
    for index, el in enumerate(div_el):  # el資料型別是html文件型別
        el = etree.ElementTree(el)  # 同上：需要將html文件結構再轉換成元素樹的格式（節點）
        title = el.xpath('/div/p/span/a/@title')  # 職位名
        title = title[0] if title else None
        link = el.xpath('/div/p/span/a/@href')  # 進入詳情頁的地址
        link = link[0] if link else None        
        company = el.xpath('/div/span[@class="t2"]/a/@title')  # 公司
        company = company[0] if company else None
        city = el.xpath('/div/span[@class="t3"]/text()')  # 工作地點
        city = city[0] if city else None
        salary = el.xpath('/div/span[@class="t4"]/text()')  # 薪水
        salary = salary[0] if salary else None
        time = el.xpath('/div/span[@class="t5"]/text()')  # 釋出時間
        time = time[0] if time else None
        child_str = get_html(link, 'gbk')
        child_ = etree.HTML(child_str)
        child = etree.ElementTree(child_)  # 元素樹（只有節點才能使用xpath語法）
        exp = child.xpath('//div[@class="jtag inbox"]/div/span/em[@class="i1"]/parent::span/text()')
        exp = exp[0] if exp else None
        degree = child.xpath('//div[@class="jtag inbox"]/div/span/em[@class="i2"]/parent::span/text()')
        degree = degree[0] if degree else None
        fuli = child.xpath('//div[@class="jtag inbox"]/p/span/text()')
        fuli = fuli if fuli else None  # 福利就是一個列表，需要將列表轉成字串
        row = (title, company, city, salary, time, exp, degree, fuli)  # 將每一行資料封裝到元祖中
        # print(row)
        rows.append(row)  # 每次獲取到的職位相關資訊，放入到空列表中
    return rows


def csv_write(filename,mode,content):                              # 用於寫入csv檔案的方法
    with open(filename, mode, newline ="",encoding ='utf-8') as job:         # 用指定的mode方式開啟filename檔案，指定了編碼格式
        file = csv.writer(job)
        if mode == 'w':                                 # 寫的方式，覆蓋寫
            file.writerow(content)
        if mode == 'a':                                #append方式寫，不覆蓋
            file.writerows(content)


def crawl_manypage(keyword,start,end):              # 爬取多頁資料，第一個引數表示關鍵字，第二個是開始頁，第三個是結束頁
    head = ('職位', '公司', '工作地點', '薪資', '釋出時間', '工作經驗', '學歷', '福利')   # 第一行資料表頭
    csv_write('{}.csv'.format(keyword), 'w', head)                     # 呼叫剛才的csv_write方法
    for page in range(start, end+1):                                    # page變數是頁數
        url1 = 'https://search.51job.com/list/010000,000000,0000,00,9,99,{},2,{}.html?' \
               'lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99' \
               '&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=' \
               '&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='.format(keyword,page)
        html_str = get_html(url1,'gbk')                   # 按照gbk的編碼格式獲取html字串
        rows = crawl_onepage(html_str)                    # 呼叫函式爬取一頁資料
        csv_write('{}.csv'.format(keyword), 'a', rows)    #  寫入到csv檔案中
# 51job通過協程實現併發爬蟲
crawl_manypage('python',1,3)

然後開啟pycharm中的python.csv檔案右鍵選擇file encoding，選擇gbk，就可以用excel開啟看到了：

下一篇會對這個爬蟲進行優化，並將爬蟲結果做簡要分析

Python爬蟲爬取51job招聘網站

最近學習爬蟲，做了一個python爬蟲工具寫在這裡記錄一下。# python爬51job工具，稍微改改就可以爬其他網站 # edit by mengqi Date：2018-07-11 # encoding:uft-8 import csv

Python爬蟲爬取美劇網站

一直有愛看美劇的習慣，一方面鍛鍊一下英語聽力，一方面打發一下時間。之前是能在視訊網站上面線上看的，可是自從廣電總局的限制令之後，進口的美劇英劇等貌似就不在像以前一樣同步更新了。但是，作為一個宅diao的我又怎甘心沒劇追呢，所以網上隨便查了一下就找到一個能用迅雷下載的美劇

python爬蟲爬取拉勾網站內容

本次主要內容是分享下拉勾網站模擬搜尋以及搜尋內容的爬取，這裡先引入一些用到的庫，由於網站本身的反爬蟲技術和網路原因，這裡使用了fake_useragent和多執行緒模式，當然如果有條件的話也可以使用代理池，這樣可以更加保險一點。由於我沒有弄那些收費的代理，而免費

Python爬蟲爬取古詩文網站專案分享

作為一個靠python自學入門的菜鳥，想和大家分享自己寫的第一個也是目前為止唯一一個爬蟲程式碼寫爬蟲要具備的能力基礎：python入門基礎，html5基礎知識，然後這邊用的是scrapy框架，所以

簡易爬蟲爬取51job招聘資訊

import re import urllib.request class Grab(object): # 定義類屬性 num = 0 def __init__(self)

python 爬蟲爬取證券之星網站

爬蟲周末無聊，找點樂子。。。#coding:utf-8 import requests from bs4 import BeautifulSoup import random import time #抓取所需內容 user_agent = ["Mozilla/5.0 (Windows NT 10.0

Python爬蟲-爬取騰訊QQ招聘崗位資訊（Beautiful Soup）

爬取騰訊招聘資訊-Beautiful Soup --------------------------------------- ============================================ =================================

Python爬蟲——爬取網站的例項化原始碼

缺點：1爬了一個網站好幾次以後不能再進行爬取。沒有解決這個問題 2在寫入資料的時候還是存在很大問題。以後多加練習這個檔案的儲存的相關工作 import re import urllib.request def function(): """需求：1 https://re

python 爬蟲爬取某網站的漫畫

文章目錄宣告前言思路流程程式結果宣告為了表示對網站的尊重，已將網站地址隱藏，下載的漫畫之前我就看過了，所以也會刪掉，絕不侵犯網站的利益。前言

Python爬蟲爬取網站上的圖片

Python爬蟲爬取智聯招聘職位資訊

目的：輸入要爬取的職位名稱，五個意向城市，爬取智聯招聘上的該資訊，並列印進表格中 #coding:utf-8 import urllib2 import re import xlwt class ZLZP(object): def __init__(self

Python爬蟲爬取網站新聞

網站分析爬取過程獲取新聞連結地址使用requests包讀取新聞列表頁面，然後使用正則表示式提取出其中的新聞頁面連結，返回urls列表 def getList(url): li = requests.get(url) re

使用python爬蟲爬取百度手機助手網站中app的資料

一、爬取程式流程圖爬蟲程式流程圖如下： Created with Raphaël 2.1.0開始分析地址結構獲得app類別頁的url爬取app詳情頁url爬取App詳情頁的資料將爬取資料儲存到json檔案結束二、具體步驟 1.分析

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

python爬蟲爬取頁面源碼在本頁面展示

一個 nts ring 想要 strip code 空白列表 ngs python爬蟲在爬取網頁內容時，需要將內容連同內容格式一同爬取過來，然後在自己的web頁面中顯示，自己的web頁面為django框架首先定義一個變量html，變量值為一段HTML代碼 >&

python爬蟲爬取海量病毒文件

tle format nbsp contex logs request spl tde __name__ 因為工作需要，需要做深度學習識別惡意二進制文件，所以爬一些資源。 # -*- coding: utf-8 -*- import requests import re

用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）

enc 用途 css選擇器狀態 csv文件表格 area 加密重要用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）在進行爬取前，首先要了解： 1、什麽是CSS選擇器？每一條css樣式定義由兩部分組成，形式如下： [code] 選擇器{樣式} [/code

python爬蟲——爬取古詩詞

爬蟲古詩詞實現目標 1.古詩詞網站爬取唐詩宋詞 2.落地到本地數據庫頁面分析通過firedebug進行頁面定位：源碼定位：根據lxml etree定位div標簽：# 通過 lxml進行頁面分析 response = etree.HTML(data

利用Python爬蟲爬取淘寶商品做數據挖掘分析實戰篇，超詳細教程

實戰趨勢 fat sts AI top 名稱 2萬安裝模塊項目內容本案例選擇>> 商品類目：沙發；數量：共100頁 4400個商品；篩選條件：天貓、銷量從高到低、價格500元以上。項目目的 1. 對商品標題進行文本分析詞雲可視化 2.

Python爬蟲 - 爬取百度html代碼前200行

http src mage bsp bubuko str 百度爬蟲圖片 Python爬蟲 - 爬取百度html代碼前200行 - 改進版, 增加了對字符串的.strip()處理 Python爬蟲 - 爬取百度html代碼前200行

Python爬蟲爬取51job招聘網站

相關推薦