python爬取拉勾網之selenium

阿新 • • 發佈：2018-12-01

重點程式碼解釋：
1.呼叫lxml的etree實現xpath方法呼叫，xpath相對正則比較簡單，可以不在使用Beauitfulsoup定位

from lxml import etree

2.介面的可視話與否，對於你的執行資源只能用減少

opt=webdriver.ChromeOptions()
# 把chrome設定成無介面模式，不論windows還是linux都可以，自動適配對應引數
opt.set_headless()#無介面
self.driver=webdriver.Chrome(options=opt)

3.載入資料時到xpath定位的位置進行爬取

#此句話大致意思，執行driver 時間不超過20s 什麼時候載入到xpath定位的位置神魔時候停止開始執行頁面內容爬去
WebDriverWait(driver=self.driver,timeout=20).until(EC.presence_of_all_elements_located(By.XPATH,’//*[@id=“s_position_list”]/div[2]/div/a[6]’))

4.python 中join()函式strip() 函式和 split() 函式的詳解及例項詳細內容請到我的另一篇部落格
檢視

content = “”.join(html.xpath("//dd[@class=‘job_bt’]//text()")).strip()
‘’’
語法： ‘sep’.join(seq)
引數說明
sep：分隔符。可以為空
seq：要連線的元素序列、字串、元組、字典
上面的語法即：以sep作為分隔符，將seq所有的元素合併成一個新的字串
返回值：返回一個以分隔符sep連線各個元素後生成的字串
‘’’

from selenium import  webdriver
import  lxml
from lxml import  etree
import re
import time
import  pymysql
import urllib.request
import requests

from selenium.webdriver.support.ui import  WebDriverWait
from selenium.webdriver.support import  expected_conditions as EC
from selenium.webdriver.common.by import  By



class LagouSpider(object):
    def __init__(self):
        opt=webdriver.ChromeOptions()
        # 把chrome設定成無介面模式，不論windows還是linux都可以，自動適配對應引數
        opt.set_headless()#無介面
        self.driver=webdriver.Chrome(options=opt)
        self.url="https://www.lagou.com/zhaopin/Python/"
       
    def run(self):
        self.driver.get(self.url)
        while True:
            source = self.driver.page_source
            #此句話大致意思，執行driver 時間不超過20s 什麼時候載入到xpath定位的位置神魔時候停止開始執行頁面 內容爬去
            WebDriverWait(driver=self.driver,timeout=20).until(EC.presence_of_all_elements_located(By.XPATH,'//*[@id="s_position_list"]/div[2]/div/a[6]'))

            # WebDriverWait(driver=self.driver, timeout=20).until(
            #     EC.presence_of_element_located((By.XPATH, '//*[@id="s_position_list"]/div[2]/div/a[6]'))
            # )
            self.parse_list_page(source)
            # 點“下一頁”
            next_btn=self.driver.find_element_by_xpath(  '//*[@id="s_position_list"]/div[2]/div/a[6]')
            # 提取下一頁的按鈕，注意class的值中有空格不可用。
            if "pager_next_disabled" in next_btn.get_attribute("class"):
                break
            else:
                next_btn.click()
            time.sleep(1)
            # next_btn = self.driver.find_element_by_xpath(
            #      '//*[@id="s_position_list"]/div[2]/div/a[6]')
            # if "pager_next_disabled" in next_btn.get_attribute("class"):
            #     break
            # else:
            #     next_btn.click()
            # time.sleep(1)


        # source=self.driver.page_source
        # #print(source)
        # self.parse_list_page(source)
    #職位url列表
    def parse_list_page(self,source):
        #t通過etree呼叫xpath
        html=etree.HTML(source)
        links=html.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/@href')
        for  link in links:
            self.request_detail_page(link)
            # print(link)
            #time.sleep(1)


    #執行提取的url
    def request_detail_page(self,url):
        #self.driver.get(url)
        #開啟新的頁面
        self.driver.execute_script("window.open('%s')"%url)
        #切換控制代碼進入新開啟的頁面
        self.driver.switch_to.window((self.driver.window_handles[1]))
        # self.driver.execute_script("window.open('%s')" % url)
        # self.driver.switch_to.window(self.driver.window_handles[1])
        #加載出來工作名開始爬取
        WebDriverWait(driver=self.driver, timeout=20).until(
            EC.presence_of_element_located((By.XPATH, "//div[@class='job-name']/span[@class='name']")) )

        source=self.driver.page_source
        self.parse_detail_page(source)
        # 關閉當前詳情頁，並且切換到列表頁
        self.driver.close()
        self.driver.switch_to.window(self.driver.window_handles[0])
        #self.parse_list_page(source)

    #提取具體資訊
    def parse_detail_page(self,source):
        html=etree.HTML(source)
        positionName=html.xpath("//div[@class='position-head']/div/div[1]/div/span/text()")[0]
        job_request_spans=html.xpath("//div[@class='position-head']/div/div[1]/dd/p[1]/span")
        salary=job_request_spans[0].xpath(".//text()")[0].strip()
        city=job_request_spans[1].xpath('.//text()')[0].strip()
        #city = re.match(r'<span class="xh-highlight">/(.*?) /</span>',city)
        city = re.sub(r"[\s/]", "", city)#此處將"/"替換為空""
        work_years = job_request_spans[2].xpath('.//text()')[0].strip()
        work_years = re.sub(r"[\s/]", "", work_years)
        education = job_request_spans[3].xpath('.//text()')[0].strip()
        education = re.sub(r"[\s/]", "", education)
        content = "".join(html.xpath("//dd[@class='job_bt']//text()")).strip()
        #print(positionName)
        #mysql=MySQLPipeline()
        #mysql.process_item(positionName,salary,city,work_years,education,content)
        '''
        語法：  'sep'.join(seq)
        引數說明
        sep：分隔符。可以為空
        seq：要連線的元素序列、字串、元組、字典
        上面的語法即：以sep作為分隔符，將seq所有的元素合併成一個新的字串
        返回值：返回一個以分隔符sep連線各個元素後生成的字串
        '''
class MySQLPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(host="localhost",user="root",password="root",db="lagou", charset='utf8')
        self.cursor = self.conn.cursor()

    def process_item(self,positionName,salary,city,work_years,education,content):
        insert_sql = '''
                    insert into lagou_table(positionName,salary,city,work_years,education,content)
                    values(%s,%s,%s,%s,%s,%s)
                '''
        self.cursor.execute(insert_sql,(positionName,salary,city,work_years,education,content))
        self.conn.commit()


    def close_spider(self,spider): #TypeError: close_spider() takes 1 positional argument but 2 were given
        self.cursor.close()
        self.conn.close()



if __name__=="__main__":
    spider=LagouSpider()
    spider.run()

python爬取拉勾網之selenium

重點程式碼解釋： 1.呼叫lxml的etree實現xpath方法呼叫，xpath相對正則比較簡單，可以不在使用Beauitfulsoup定位 from lxml import etree 2.介面的可視話與否，對於你的執行資源只能用減少 opt=webdri

用python爬取拉勾網招聘資訊並以CSV檔案儲存

爬取拉勾網招聘資訊 1、在網頁原始碼中搜索資訊，並沒有搜到，判斷網頁資訊使用Ajax來實現的 2、檢視網頁中所需的資料資訊，返回的是JSON資料； 3、條件為北京+資料分析師的公司一共40087家，而實際拉勾網展示的資料只有 15條/頁 * 30頁 = 450條，所以需要判斷

python爬取拉勾網網際網路大資料職業情況

爬取拉勾網資訊資料處理製圖所需知識只有一點點（畢竟是個小白）： requests基礎部分 json pyecharts wordcloud 接下來開始敲程式碼了，程式碼分成了3個部分：爬取、製圖、生成詞雲爬取部分：首先要說明的是，拉勾網有反爬

Python爬取拉勾網招聘資訊存入資料庫

先抓包分析我們想要獲取的資料，很明顯都是動態資料，所以直接到Network下的XHR裡去找，這裡我們找到具體資料後，就要去尋分析求地址與請求資訊了。還有需要提交的表單資訊分析完畢之後，我們就可以開始寫我們的爬蟲專案了。一.編寫Itemitem編寫比較簡單# 拉鉤職位資訊 cl

Python爬取拉勾網招聘資訊

此程式碼執行建議Python3，省卻中文編碼的麻煩遇到的幾個問題：（1）拉鉤網的資料是通過js的ajax動態生成，所以不能直接爬取，而是通過post’http://www.lagou.com/jobs/positionAjax.json?needAddt

python爬取拉勾網資料儲存到mysql資料庫

環境:python3 相關包:requests , json , pymysql 思路:1.通過chrome F12找到拉鉤請求介面,分析request的各項引數 2.模擬瀏覽器請求拉鉤介面 3.預設返回的json不是標準格式 ,

Python爬取拉勾網招聘資訊並可視化分析

需求: 1:獲取指定崗位的招聘資訊 2:對公司地區,公司待遇,學歷情況,工作經驗進行簡單分析並可視化展示視覺化分析: 公司地區:柱狀圖,地圖公司待遇:雲圖公司-學歷情況:餅圖公司工作經

Python爬取拉勾網資料(破解反爬蟲機制)

人生苦短, 我學 Python! 這篇文章主要記錄一下我學習 Python 爬蟲的一個小例子, 是爬取的拉勾網的資料. 1.準備配置 Python 環境什麼的就不說了, 網上教程很多, 自行解決. 2.扒原始碼先開啟拉勾網的網頁. 我們要爬取這部分的資料

python爬取拉勾網資料並進行資料視覺化

爬取拉勾網關於python職位相關的資料資訊，並將爬取的資料已csv各式存入檔案，然後對csv檔案相關欄位的資料進行清洗，並對資料視覺化展示，包括柱狀圖展示、直方圖展示、詞雲展示等並根據視覺化的資料做進一步的分析，其餘分析和展示讀者可自行發揮和擴充套件包括各種分析和不同的儲存方式等。。。。。一、爬取和分析

Python爬蟲：爬取拉勾網資料分析崗位資料

1 JSON介紹 JSON（JavaScript Object Notation）已經成為通過HTTP請求在Web瀏覽器和其他應用程式之間傳送資料的標準格式之一。比CSV格式更加靈活。Json資料格式，非常接近於有效的Pyhton程式碼，其特點是：JSON物件所

Python scrapy 爬取拉勾網招聘資訊

週末折騰了好久，終於成功把拉鉤網的招聘資訊爬取下來了。現在總結一下！環境： windows 8.1 + python 3.5.0 首先使用 scrapy 建立一個專案： E:\mypy> scrapy startproject lgjob 建立後目錄結構：

scrapy爬取拉勾網python職位+Mysql+視覺化

在進行爬取目標網站中為遇到一個問題，爬取5頁資料之後會出錯，設定了每一次請求的隨機超時間10-20->time.sleep(random.randint(10, 20))，同樣會被拉勾網禁止請求資料，可能被輕度判定為爬取，所以可以設定每一次的隨機超時間為20-30秒，就可以解決這個問題。

Python 爬蟲入門-爬取拉勾網實戰

這幾天學習了 python 爬蟲的入門知識，也遇到很多坑，開個貼記錄一下基本原理 Python 爬蟲基本要具備以下功能：（參考此回答）向伺服器傳送請求，伺服器響應你的請求。（你可能需要了解：網頁的基本知識）從抓取到的網頁中提取出需要

Python 爬蟲-模擬登入知乎-爬取拉勾網職位資訊

用Python寫爬蟲是很方便的,最近看了xlzd.me的文章，他的文章寫的很到位，提供了很好的思路。因為他的文章部分程式碼省略了。下面是基於他的文章的三個程式碼片段: 基於Python3,Python2的話需要修改下input輸入函式和print的用法。爬取豆瓣電影top250 爬取拉勾網職位資訊模擬

python爬蟲: 爬取拉勾網職位並分析

0. 前言本文從拉勾網爬取深圳市資料分析的職位資訊，並以CSV格式儲存至電腦, 之後進行資料清洗, 生成詞雲，進行描述統計和迴歸分析,最終得出結論. 1. 用到的軟體包 Python版本： Python3.6 requests: 下載網

Python爬蟲基礎教程，手把手教你爬取拉勾網！

一、思路分析：在之前寫拉勾網的爬蟲的時候，總是得到下面這個結果（真是頭疼），當你看到下面這個結果的時候，也就意味著被反爬了，因為

HttpClient爬取拉勾網招聘資訊

1.匯入jar包 <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>htt

【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位資訊（1）

5-14更新注意：目前拉勾網換了json結構，之前是content - result 現在改成了content- positionResult - result,所以大家寫程式碼的時候要特別注意加上

python爬取拉鉤網招聘資訊

拉鉤網網址為：https://www.lagou.com/點選F12進入控制檯觀察結構，發現所有的招聘內容都在此json檔案中：注意headers中的請求url以及請求方法：還有表單資料：獲取以上資訊後，基本就可以開始爬取工作，注意，拉鉤網有反爬機制，所以需要使用cookie

python爬取拉鉤網資料

import requests import re#引用正則匹配 from bs4 import BeautifulSoup headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) A

python爬取拉勾網之selenium

相關推薦