Python爬蟲批量下載文獻

阿新 • • 發佈：2021-06-30

最近在看NeurIPS的文章，想下載多一點有空就看，但是一篇篇下載太繁瑣。
就想到了之前一直聽說的python爬蟲，於是就學著弄一下。先放最終執行的程式：

結果程式

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import os

BASE_URL = 'https://proceedings.neurips.cc/'


def openAndDownload(url, title):
    str_subhtml = requests.get(url)
    soup1 = BeautifulSoup(str_subhtml.text, 'lxml')
    subdata = soup1.select('body > div.container-fluid > div > div > a:nth-child(4)')
    # print('subdata:', subdata)
    downloadUrl = BASE_URL + subdata[0].get('href')
    print(downloadUrl)
    getFile(downloadUrl, title)


def getFile(url, title):
    title = title.replace(':', '')
    title = title.replace('?', '')
    filename = title + '.pdf'
    urlretrieve(url, './essay/%s' % filename.split('/')[-1])
    print("Sucessful to download " + title)


url = 'https://proceedings.neurips.cc/paper/2020'
strhtml = requests.get(url)
soup = BeautifulSoup(strhtml.text, 'lxml')
data = soup.select('body > div.container-fluid > div > ul > li > a')

list = []
for item in data:
    list.append([item.get_text(), item.get('href')])

name = ['title', 'link']
test = pd.DataFrame(columns=name, data=list)
print(test)
test.to_csv('./essaylist.csv')
# 檢查是否下載過
file_dir = os.path.join(os.getcwd(), 'essay')
downloaded_list = []
for root, dirs, files in os.walk(file_dir):
    downloaded_list = files

for et, el in zip(test[name[0]], test[name[1]]):
    essay_url = BASE_URL + el
    checkname = et + '.pdf'
    checkname = checkname.replace(':', '')
    checkname = checkname.replace('?', '')
    if (checkname in downloaded_list):
        print(checkname + ' has been downloaded! ')
    else:
        openAndDownload(essay_url, et)

參考了python爬蟲入門教程：http://c.biancheng.net/view/2011.html

用到了requests，BeautifulSoup，urllib.request包

對於NeurIPS網頁的文獻批量下載程式設計

對網頁進行分析

目標是從目前NeurIPS2020會議的列表下載論文，
首先分析網頁構成：開啟網頁後按F12調出開發者介面，可以看到網頁的原始碼。
把滑鼠放到右側Elements程式碼裡不同位置，左側會有不同的控制元件高亮，以此找到一篇文章的所在位置，如下圖所示

python中，使用BeautifulSoup以同樣的方式開啟該網頁

import requests  
from bs4 import BeautifulSoup

url = 'https://proceedings.neurips.cc/paper/2020'
strhtml = requests.get(url)
soup = BeautifulSoup(strhtml.text, 'lxml')
print(soup)

結果：

可以看到文獻都是以某種列表的格式整齊排列。
通過BeautifulSoup的select函式將相關欄位選出，select函式所需路徑從開發者介面中複製來：

複製得

body > div.container-fluid > div > ul > li:nth-child(1) > a

其中li:nth-child(1)，表示某一項。要獲取整個列，只選li，程式碼結果如下：

data = soup.select('body > div.container-fluid > div > ul > li > a')
print(data)

結果：

可以看到，每一個元素裡面，有文章的名字和連結，正好是我們需要的。
文章名在標籤< a >中，使用get_text()獲取；連結在< a >標籤的herf屬性中，使用get('href')獲取。
將其全部提出來並儲存為csv格式，以便之後查詢使用。

list = []
for item in data:
    list.append([item.get_text(), item.get('href')])

name = ['title', 'link']
test = pd.DataFrame(columns=name, data=list)
test.to_csv('./essaylist.csv')

結果：

單個檔案下載

由於這個介面的超連結並不是檔案的下載連結，開啟後而是文章的詳情頁面：

可以從這個頁面爬出所需要的資訊如摘要等，但目前我們只想下載paper，因此用與前文相同的Copy selector的方式選出檔案下載地址的路徑：

得到

body > div.container-fluid > div > div > a:nth-child(4)

此時不再需要去除nth-child(4)，因為我們只需要這一項。獲得了連結後還得與網站主地址組合起來形成完整的地址：

essay_url = 'https://proceedings.neurips.cc/' + test['link'][0]
str_subhtml = requests.get(essay_url)
soup1 = BeautifulSoup(str_subhtml.text, 'lxml')
subdata = soup1.select('body > div.container-fluid > div > div > a:nth-child(4)')
downloadUrl = 'https://proceedings.neurips.cc/' + subdata[0].get('href')  # 拼接成完整url
print(downloadUrl)

結果：

接下來通過urlretrieve進行下載操作。

filename = test['title'][0] + '.pdf'  # 補全檔案格式
urlretrieve(downloadUrl, './%s' % filename.split('/')[-1])
print("Sucessful to download： " + test['title'][0])

即可下載成功：

全部檔案下載與改錯

全部檔案的下載加個迴圈即可，具體如最前面的結果程式所示。

另外在執行過程中發現了一些問題：

檔案命名問題
下載過程中某些檔名只有前面幾個單詞，且檔案不完整。
經過觀察發現，出錯的是文章名字帶有':'或'？'的，這些是檔案命名所不允許的字元，因此在程式中將這些字元替換掉。
下載重複
文章實在有點多，一次可能下不完（或者有更高效的批量下載方式）。
於是修改了程式，通過遍歷本地檔案獲得下載了的文獻列表，使用checkname in downloaded_list的方式判斷文獻是否已經下載過。具體實現如最前面的結果程式所示。

待補充與改進

初次寫爬蟲，也許多了一些不必要的工作，下載方式和顯示方式也還有待優化。

Python爬蟲批量下載文獻

結果程式

對於NeurIPS網頁的文獻批量下載程式設計

對網頁進行分析

單個檔案下載

全部檔案下載與改錯

待補充與改進

Python爬蟲批量下載文獻

用python爬蟲批量下載pdf的實現

python FTP批量下載/刪除/上傳例項

[python][爬蟲]批量爬取【漫畫DB】的漫畫圖片

Python 爬蟲批量爬取網頁圖片儲存到本地的實現程式碼

Python自動化批量下載網上的論文

Python實用案例，Python指令碼，Python實現批量下載百度圖片

python爬蟲-scrapy下載中介軟體

Python爬蟲教程：python批量下載整站高清大圖

Python爬蟲案例：批量下載超清畫質手機桌布

如何基於Python批量下載音樂

Python爬蟲實現vip電影下載的示例程式碼

python爬蟲：抓取下載電影檔案，合併ts檔案為完整視訊

Python實現超簡單【抖音】無水印視訊批量下載

Python百度圖片批量下載器的空間複核崗dskjfhe

如何讓程式像人一樣的去批量下載歌曲？Python爬取付費歌曲

讓程式像人一樣的去批量下載歌曲？Python採集付費歌曲

Python爬蟲：多種方式實現嗶哩嗶哩（bilibili）視訊下載

Python爬蟲練習小作文下載！輔導兒子作文有素材了！

Python Excel 批量付款匯入明細資料分析整理核銷下載表匯入資料轉換

Python爬蟲批量下載文獻

結果程式

對於NeurIPS網頁的文獻批量下載程式設計

對網頁進行分析

單個檔案下載

全部檔案下載與改錯

待補充與改進

相關推薦