python爬取崗位資料並分析_爬取拉勾資料分析崗位

阿新 • • 發佈：2020-12-17

技術標籤：python爬取崗位資料並分析

拉勾的反爬機制做得特別殘暴。

javascript加密和直接訪問json資料會給你返回偽裝的資料不說。最殘暴也是最簡單的，限制短時間內的多次訪問。只要爬蟲速度稍快點，就會要你進行驗證。還有未登陸狀態下，每訪問10個頁面，就會要求你登入。不管你是用爬蟲還是正常訪問。

一般是有兩種爬取思路。一種是requests，一種是selenium。

requests逆向

requests需要逆向。逆推源網址的構造方法。拉勾的逆向比較簡單，從XHR檔案中可以看到，在請求搜尋的時候傳遞了三個資料：first，pn(頁數)，kd(搜尋關鍵詞)。所以可以在請求的時候帶上這三個資料，獲取json中的資料。

不過還有一個問題，獲取的json資料構造有點特殊。拉勾這裡又藏了一手，導致無法獲取json字典中的資料，目前還沒解析出來，所以這次我們用的是另外一種方法。

selenium

selenium相比於要去逆向網站這種方法來的簡單，主要就是模仿瀏覽器的行為，無需進行逆向等操作。在用selenium進行爬取的時候會遇到以下幾個問題： 1.新開啟的職業搜尋列表首頁會有一個紅包彈窗，可以用顯示等待相關元素加載出來後點擊退出。 2.拉勾職業列表最多顯示30頁(有的城市可能不足30頁)，每頁15條資料。所以需要判斷是否爬取到最後一頁。 3.為了防止每開啟10個頁面拉勾跳出登入介面，需要將爬取到的連結分為10個一組。

4.因為爬取的時候，selenium會不斷的呼叫瀏覽器，所以沒有新增'--headless'的話，在爬取的時候你就無法使用你的電腦，因為瀏覽器會不斷的彈出來。

原始碼如下：

"""@Author: [email protected]: [email protected]: [email protected]: 72.0.3626.81"""# 匯入相關模組(未安裝可執行 pip install xxx 命令安裝)from selenium import webdriverfrom lxml import etreeimport randomimport timefrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Byimport csvfrom selenium.webdriver import ChromeOptions# 建立類class LagouSpider():    def __init__(self):        option = ChromeOptions()        option.add_argument('--headless')        # 初始化類例項時開啟谷歌瀏覽器(可檢視測試過程)        self.driver = webdriver.Chrome(options=option)        self.driver.set_window_size(1920, 1080)        # 搜尋頁面的url        self.url = "https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/p-city_6?px=default#filterBox"        # 存放所有職位詳情頁的url        self.all_links = []    def run2(self, ten_links):        '''        每次對10個職位詳情url請求並解析，儲存職位詳細資訊，退出瀏覽器        :param ten_links: 10個職位詳情頁url組成的list        :return:        '''        # 遍歷每個detail_url        for link in ten_links:            # 呼叫request_detail_page請求並解析            self.request_detail_page(link)            # 隨機間隔3-6s，避免反爬            time.sleep(random.randint(5, 9))        # 獲取10個職位資訊後退出瀏覽器        self.driver.quit()    def run1(self):        '''        開啟搜尋頁面，並迴圈翻頁至最後一頁，解析html獲得all_detail_links        :return:        '''        # 在當前開啟的瀏覽器中載入頁面        self.driver.get(self.url)        time.sleep(10)        #點選彈出的廣告按鈕        button = self.driver.find_element_by_class_name('body-btn')        button.click()        # 用於記錄當前是第幾頁        count_page = 1        # 迴圈翻頁直到最後一頁        while True:            # 獲取當前頁的網頁原始碼            source = self.driver.page_source            # 利用xpath解析source獲得detail_links並儲存到            self.get_all_detail_links(source)            print('Fetched page %s.' % str(count_page))            # 找到【下一頁】按鈕所在的節點            next_btn = self.driver.find_element_by_xpath('//div[@]/span[last()]')            # 判斷【下一頁】按鈕是否可用            if "pager_next_disabled" in next_btn.get_attribute("class"):                # 【下一頁】按鈕不可用時即達到末頁，退出瀏覽器                self.driver.quit()                # 返回所有職位詳情頁url列表(去重後的)                return list(set(self.all_links))            else:                # 【下一頁】按鈕可用則點選翻頁                next_btn.click()                count_page += 1                time.sleep(random.randint(2, 4))            time.sleep(random.randint(3, 5))    def get_all_detail_links(self, source):        '''        利用xpath解析source獲得detail_links並儲存到self.all_links        :param source: 網頁原始碼html        :return:        '''        html = etree.HTML(source)        links = html.xpath('//a[@]/@href')        self.all_links += links    def request_detail_page(self, url):        '''        請求職位詳情頁面，並呼叫parse_detail_page函式        :param url: 職位詳情頁url        :return:        這部分不建議修改        '''        # 在當前視窗中同步執行javascript        self.driver.execute_script("window.open('%s')" % url)        # 執行後開啟新頁面(控制代碼追加一個新元素)        # driver.switch_to.window：將焦點切換到指定的視窗        # driver.window_handles：返回當前會話中所有視窗的控制代碼        self.driver.switch_to.window(self.driver.window_handles[1])  # 切換到新開啟的視窗，即第2個--index==1        source = self.driver.page_source        self.parse_detail_page(source)        self.driver.close()        self.driver.switch_to.window(self.driver.window_handles[0])  # 切換到主視窗(否則不能再次開啟新視窗)    def parse_detail_page(self, source):        '''        解析詳情頁，用xpath提取出需要儲存的職位詳情資訊並儲存        :param source: 職位詳情頁的網頁原始碼html        :return:        '''        # 將source傳入lxml.etree.HTML()解析得到etree.HTML文件        html = etree.HTML(source)        # 對html用xpath語法找到職位名稱所在節點的文字，即position_name        position_name = html.xpath("//h1[@class='name']/text()")[0]        # 對html用xpath語法找到職位id所在的節點，提取獲得position_id        #position_id = html.xpath("//link[@rel='canonical']/@href")[0].split('/')[-1].replace('.html', '')        # 找到職位標籤，依次獲取：薪資、城市、年限、受教育程度、全職or兼職        job_request_spans = html.xpath('//dd[@]//span')        salary = job_request_spans[0].xpath('.//text()')[0].strip()         # 列表索引0==xpath第1個節點        city = job_request_spans[1].xpath('.//text()')[0].strip().replace("/", "").strip()        work_year = job_request_spans[2].xpath('.//text()')[0].strip("/").strip()        education = job_request_spans[3].xpath('.//text()')[0].strip("/").strip()        work_full = job_request_spans[4].xpath('.//text()')[0]        # 找到公司標籤，獲取company_short_name        company_short_name = html.xpath('//dl[@]//em/text()')[0].replace("\n", "").strip()        # 找到公司標籤中的industry_field和finance_stage、scale規模        company_infos = html.xpath('//dl[@]//li')   # 注意該節點下的text()索引0和2是空的        industry_field = company_infos[0].xpath('.//h4[@]/text()')[0]        finance_stage = company_infos[1].xpath('.//h4[@]/text()')[0]        scale = company_infos[2].xpath('.//h4[@]/text()')[0]        # 找到工作地址所在的區        district = html.xpath('//div[@]/a[2]/text()')[0].strip()        # 找到職位誘惑，獲取position_advantage        position_advantage = html.xpath('//dd[@]//p/text()')[0].strip("/").strip().replace("，", ",")        # 職位描述        job_des = html.xpath('//div[@]/p/text()')        # 以追加的方式寫入csv文建        with open('test.csv', 'a', encoding='utf-8', newline='') as f:            writer = csv.writer(f)            writer.writerow([position_name, salary, city, work_year, education, work_full, company_short_name, industry_field, finance_stage, scale, district, position_advantage, job_des])if __name__ == "__main__":    # 記錄專案開始時間    start_time = time.time()    # 例項化LagouSpider類，呼叫run1方法獲取所有職位詳情頁的url    needed_all_links = LagouSpider().run1()    # 將所有職位詳情url以10位單位拆分成巢狀列表    nested_all_links = [needed_all_links[i:i + 10] for i in range(0, len(needed_all_links), 10)]    count = 10    # 連續請求10個詳情頁就會彈出登入頁，故每請求10個重啟一次瀏覽器    for ten_links in nested_all_links:        # 每10個為一組，開啟一次瀏覽器，呼叫run2方法儲存職位詳細資訊        LagouSpider().run2(ten_links)        # count計數調整間隔時間，請求過多彈出登入        time.sleep(random.randint(6, 12) * (count // 100 + 1))        count += 10        print('-------------------------')        print('Have fetched %s positions.' %str(count))    # 記錄專案結束時間    end_time = time.time()    print('\n【專案完成】\n【總共耗時：%.2f分鐘】' %((end_time - start_time) / 60))

簡單一點的改善的話，可以將城市代號作為引數寫入函式中，迴圈爬取。再進一步的話，就是做多執行緒，提高爬取速度。參考：

1.拉勾網Python爬蟲：Selenium+Xpath 反反爬、免登陸獲取全部職位詳情

2.python爬蟲：爬取拉勾網職位並分析

python爬取崗位資料並分析_爬取拉勾資料分析崗位

python爬取崗位資料並分析_爬取拉勾資料分析崗位

自動加密資料夾命令_拖放檔案和資料夾以自動完成命令列

python爬蟲爬取安居客並進行簡單資料分析

Python爬蟲實戰，openpyxl模組學習，爬取房價資訊並簡單的資料分析

Python爬取資料並寫入MySQL資料庫的例項

python+selenium定時爬取丁香園的新型冠狀病毒資料並製作出類似的地圖（部署到雲伺服器）

04爬取拉勾網Python崗位分析報告

Python爬取資料並實現視覺化程式碼解析

Python如何爬取51cto資料並存入MySQL

利用Python爬取疫情資料並使用視覺化工具展示

python爬蟲爬取網頁資料並解析資料

python爬取資料並可視化展現

通過Python的requests庫爬取資料並儲存為csv檔案

Python爬取招聘網站資料並做資料視覺化處理

Python爬蟲入門教程：爬取boss直聘招聘資料並做視覺化展示

爬取boss直聘全國招聘資料並做視覺化 - Python

Python爬取資料並儲存到csv檔案中

Python爬取資料並輸出到資料庫

爬取拉勾網大資料相關崗位薪資資訊存到excel，並作資料分析

Python3實現的爬蟲爬取資料並存入mysql資料庫操作示例

python爬取崗位資料並分析_爬取拉勾資料分析崗位

相關推薦