爬取51job招聘資訊（一）

阿新 • • 發佈：2021-08-16

目標，將網頁上的內容爬取下來，並實現翻頁，儲存為csv。

import os
from concurrent.futures.thread import ThreadPoolExecutor
from threading import Thread

import requests
from re import findall
from json import loads
import time
import pymysql
from multiprocessing import Queue

import csv

# 獲取每頁的內容，定義一個函式
def get_one_page(page, city_code=' 
000000'):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 90.0.4430.212 Safari / 537.36'
    }
    url = f'https://search.51job.com/list/000000,000000,0000,00,9,99,資料分析,2,{page}.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare= 
'
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        json_data = findall(r'window.__SEARCH_RESULT__\s*=\s*(\{.+?\})</script>', response.text)[0]
        return loads(json_data)['engine_search_result']
    else:
        print('請求失敗!')

# 需要多少頁！
start_page=1
ts 
=[]
for i in range(10):
    result = get_one_page(start_page)
    if not result:
        print('沒有更多資料')
        break
    ts.append(result)
    start_page += 1

#data_1 = get_one_page(1) #嘗試儲存一頁的內容
data_1=[] # 建立空列表，用於儲存多頁


for i in range(len(ts)):
    for j in range(50):#一頁50條
        data_1.append(ts[i][j])

# 我需要儲存的資訊

jobs = []
for job in data_1:
    job_info = [job.get('job_name'),
               job.get('providesalary_text'),
               job.get('company_name'),
               job.get('companytype_text'),
               job.get('workarea_text'),
               '-'.join(job.get('attribute_text', ['-', '-', '-', '-', '-'])),
               job.get('jobwelf')
               ]
    jobs.append(job_info)

name=['job_name','providesalary_text','company_name','companytype_text','workarea_tex','attribute_text','jobwelf']
test=pd.DataFrame(columns=name,data=jobs)
test.to_csv("testcsv.csv") # 儲存為csv格式

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   job_name            500 non-null    object
 1   providesalary_text  500 non-null    object
 2   company_name        500 non-null    object
 3   companytype_text    500 non-null    object
 4   workarea_tex        500 non-null    object
 5   attribute_text      500 non-null    object
 6   jobwelf             500 non-null    object
dtypes: object(7)
memory usage: 27.5+ KB


重要參考：https://gitee.com/wenhaha8/job51_analysis

爬取51job招聘資訊（一）

目標，將網頁上的內容爬取下來，並實現翻頁，儲存為csv。 import os from concurrent.futures.thread import ThreadPoolExecutor

python爬蟲實現爬取網頁主頁資訊（html程式碼）

技術標籤：pythonweb python爬蟲實現爬取網頁主頁資訊（html程式碼） 1.爬取網站原始碼

web爬蟲系列（一）- 爬取電影天堂迅雷地址

一、爬蟲介紹目前爬蟲框架層出不窮，當然很多公司也會根據自己的業務做二次開發，Java的有WebMagic和WebCollector等，Python的有PySpider和Scrapy等。不能說孰好孰壞，只能說根據自己的業務場景選擇不同框架，Pytho

爬取西刺代理的IP與埠（一）

0x01 簡陋程式碼是，獲取(.*?)的字串 #coding:utf-8 from requests import * import re headers = { "accept":"text/html,application/xhtml+xml,application/xml;",

爬取淘寶膝上型電腦資料（一）

爬取淘寶膝上型電腦資料 1.匯入模組 from selenium import webdriver import time import csv import re

爬蟲實戰：批量爬取京東內衣圖片（自動爬取多頁，非一頁）

技術標籤：爬蟲爬蟲python 做下男生想做的事，爬取大量妹子內衣圖。作者：電氣-餘登武

python論文爬取（一）

對論文爬取的前端頁面進行了一些設計 style.css * { box-sizing: border-box; } .subbox table {

python爬蟲學習（一）requests爬取dytt下載地址

當網址有加密傳送安全證書時可以使用verify=False，因為dytt利用的字元編碼是gb2312，所以解碼也要用gb2312

python爬蟲學習（一）requests+bs4爬取圖片

import requests from bs4 import BeautifulSoup import time url = \"http://www.bizhi360.com/meinv/\" resp = requests.get(url)

python爬蟲學習（一）爬取高清桌布（各種主流大小的高清圖）

from bs4 import BeautifulSoup import requests import time import os def get_photo(key): url = \"https://desk.zol.com.cn/meinv/\"+str(key)+\".html\"

爬取拉勾網職位等資訊（Java）

工具：火狐瀏覽器，selenium IDE（3版本往上），Eclipse，selenium-java.jar（需匯入），selenium-server-standalone-3.141.5.jar(需匯入)，poi-bin-5.2.2

Python爬蟲實戰，requests+openpyxl模組，爬取手機商品資訊資料（附原始碼）

前言今天給大家介紹的是Python爬取手機商品資訊資料，在這裡給需要的小夥伴們程式碼，並且給出一點小心得。

小程式爬坑（一）之時間格式IOS的相容

new Date()傳參差異化問題在安卓系統中，直接傳入標準格式字串，就可以轉換為Date格式資料

Python爬取抖音視訊（沒有水印的喲）

使用更簡單的方法隨便開啟一個抖音個人主頁，我選擇的是愛奇藝體育,接著右鍵檢查網頁元素，點選network選項卡下的xhr選項，分析抓到的包

51job多執行緒爬取指定職業資訊資料

51job多執行緒爬取指定職業資訊資料 # datetime:2020/10/7 14:02 # 51job多執行緒 import requests

資訊學競賽中問題求解題常見考查題型（一）

有n(n≥3)個硬幣，其中一個是假幣，已知假幣的重量比其他的要重一些，你有一架天平。現在要稱出哪個假幣來。

python-spider_BeautifulSoup入門實踐（一）安裝以及簡單的抓取資料

BeautifulSoup安裝以及簡單抓取 beautifulsoup4.4.0官方文件：使用例項網站：bilibli.com

Azure Kay Vault（一）.NET Core Console App 獲取金鑰保管庫中的機密資訊

一，引言　　Azure 金鑰保管庫用於儲存敏感資訊，例如連結字串，密碼，API 金鑰等。我們無法直接從Azure 金鑰庫中訪問機密！那麼我們如何才能訪問應用程式中的機密資訊？比如，在我們的實際專案中，對於資料庫連結字

例項一>>爬取airbnb房屋資訊

from selenium import webdriver # 下載的geckodriver的儲存位置 driver = webdriver.Chrome(executable_path=\'D:\\\\chromedriver.exe\')

記一次python 爬蟲爬取深圳租房資訊的過程及遇到的問題

為了分析深圳市所有長租、短租公寓的資訊，爬取了某租房公寓網站上深圳區域所有在租公寓資訊，以下記錄了爬取過程以及爬取過程中遇到的問題：