爬蟲之jupyter的使用，requests模塊，正則表達式，bs4

阿新 • • 發佈：2019-02-27

管理 2.0 數字 com start 敵人 windows har 自動

一.爬蟲

1.介紹

　　爬蟲就是通過編寫程序模擬瀏覽器上網，然後讓其去互聯網上抓取數據的過程

2.分類

　　通用爬蟲：通用爬蟲是搜索引擎（Baidu、Google、Yahoo等）“抓取系統”的重要組成部分

　　聚焦爬蟲：聚焦爬蟲是根據指定的需求抓取網絡上指定的數據。

3.反爬蟲

　　門戶網站通過相應的策略和技術手段，防止爬蟲程序進行網站數據的爬取。

4.反反爬蟲

　　爬蟲程序通過相應的策略和技術手段，破解門戶網站的反爬蟲手段，從而爬取到相應的數據。

5.robots.txt協議

　　如果自己的門戶網站中的指定頁面中的數據不想讓爬蟲程序爬取到的話，那麽則可以通過編寫一個robots.txt的協議文件來約束爬蟲程序的數據爬取。

二.jupyter notebook的使用

1.介紹

　　Jupyter Notebook是以網頁的形式打開，可以在網頁頁面中直接編寫代碼和運行代碼，代碼的運行結果也會直接在代碼塊下顯示。如在編程過程中需要編寫說明文檔，可在同一個頁面中直接編寫，便於作及時的說明和解釋。

2.組成部分

　　網頁應用

　　文檔

3.特點

編程時具有語法高亮、縮進、tab補全的功能。
可直接通過瀏覽器運行代碼，同時在代碼塊下方展示運行結果。
對代碼編寫說明文檔或語句時，支持Markdown語法。

4.安裝

5.運行

　　默認端口啟動

　　在終端中輸入以下命令：

jupyter notebook

　　指定端口啟動

jupyter notebook --port <port_number>

　　啟動服務器但不打開瀏覽器

jupyter notebook --no-browser

6.快捷鍵的使用

向上插入一個cell：a
向下插入一個cell：b
刪除cell：x
將code切換成markdown：m
將markdown切換成code：y
運行cell：shift+enter
查看幫助文檔：shift+tab
自動提示：tab

二.requests模塊

　　requests模塊代碼編寫的流程：

　　　　指定url

　　　　發起請求

　　　　獲取響應對象中的數據

　　　　持久化存儲

例子

　　爬取搜狗的頁面

import requests
#1
url = ‘https://www.sogou.com/‘
#2.
response = requests.get(url=url)
#3.
page_text = response.text
#4.
with open(‘./sogou.html‘,‘w‘,encoding=‘utf-8‘) as fp:
    fp.write(page_text)

#需求：爬取搜狗指定詞條搜索後的頁面數據
import requests
url = ‘https://www.sogou.com/web‘
#封裝參數
wd = input(‘enter a word:‘)
param = {
    ‘query‘:wd
}
response = requests.get(url=url,params=param)

page_text = response.content
fileName = wd+‘.html‘
with open(fileName,‘wb‘) as fp:
    fp.write(page_text)
    print(‘over‘)

#爬取百度翻譯結果
url = ‘https://fanyi.baidu.com/sug‘
wd = input(‘enter a word:‘)
data = {
    ‘kw‘:wd
}
response = requests.post(url=url,data=data)

print(response.json())

#response.text : 字符串
#.content : 二進制
#.json() : 對象

#爬取豆瓣電影分類排行榜 https://movie.douban.com/中的電影詳情數據
url = ‘https://movie.douban.com/j/chart/top_list‘
param = {
    "type": "5",
    "interval_id": "100:90",
    "action": ‘‘,
    "start": "60",
    "limit": "100",
}

movie_data = requests.get(url=url,params=param).json()

print(movie_data)

　　User-Agent參數，簡稱為UA，該參數的作用是用於表明本次請求載體的身份標識。

#需求：爬取國家藥品監督管理總局中基於中華人民共和國化妝品生產許可證相關數據http://125.35.6.84:81/xk/
#反爬機制:UA檢測  --> UA偽裝
import requests
url = ‘http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList‘
headers = {
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36‘
}
id_list = []
for page in range(1,11):
    data = {
        "on": "true",
        "page": str(page),
        "pageSize": "15",
        "productName": "",
        "conditionType": "1",
        "applyname": "",
        "applysn": "",
    }
    json_data = requests.post(url=url,data=data,headers=headers).json()
    for dic in json_data[‘list‘]:
        id = dic[‘ID‘]
        id_list.append(id)
    
detail_url = ‘http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById‘
for id in id_list:
    detail_data = {
        ‘id‘:id
    }
    detail_json = requests.post(url=detail_url,data=detail_data,headers=headers).json()
    print(detail_json)

　　urllib是Python自帶的一個用於爬蟲的庫，其主要作用就是可以通過代碼模擬瀏覽器發送請求。其常被用到的子模塊在Python3中的為urllib.request和urllib.parse，在Python2中是urllib和urllib2。

#爬取照片
url = ‘https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=806201715,3137077445&fm=26&gp=0.jpg‘
img_data = requests.get(url=url,headers=headers).content
with open(‘./xiaohua.jpg‘,‘wb‘) as fp:
    fp.write(img_data)

　　用urllib

import urllib
url = ‘https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=806201715,3137077445&fm=26&gp=0.jpg‘
urllib.request.urlretrieve(url=url,filename=‘./123.jpg‘)

三.正則表達式

單字符：
        . : 除換行以外所有字符
        [] ：[aoe] [a-w] 匹配集合中任意一個字符
        \d ：數字  [0-9]
        \D : 非數字
        \w ：數字、字母、下劃線、中文
        \W : 非\w
        \s ：所有的空白字符包,括空格、制表符、換頁符等等。等價於 [ \f\n\r\t\v]。
        \S : 非空白
    數量修飾：
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可無  0次或者1次
        {m} ：固定m次 hello{3,}
        {m,} ：至少m次
        {m,n} ：m-n次
    邊界：
        $ : 以某某結尾 
        ^ : 以某某開頭
    分組：
        (ab)  
    貪婪模式： .*
    非貪婪（惰性）模式： .*?

    re.I : 忽略大小寫
    re.M ：多行匹配
    re.S ：單行匹配

    re.sub(正則表達式, 替換內容, 字符串)

import re
string = ‘‘‘fall in love with you
i love you very much
i love she
i love her‘‘‘

re.findall(‘^i.*‘,string,re.M)

import re

#匹配全部行
string1 = """細思極恐
你的隊友在看書
你的敵人在磨刀
你的閨蜜在減肥
隔壁老王在練腰
"""
re.findall(‘.*‘,string1,re.S)

　　爬取糗事百科中所有的圖片進行保存

import requests
import re
import urllib
import os

url = ‘https://www.qiushibaike.com/pic/page/%d/?s=5170552‘
# page = 1
headers = {
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36‘
}
if not os.path.exists(‘./qiutu‘):
    os.mkdir(‘./qiutu‘)
    
start_page = int(input(‘enter a start pageNum:‘))
end_page = int(input(‘enter a end pageNum:‘))

for page in range(start_page,end_page+1):
    new_url = format(url%page)
#     print(new_url)
    page_text = requests.get(url=new_url,headers=headers).text
    img_url_list = re.findall(‘<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>‘,page_text,re.S)
    for img_url in img_url_list:
        img_url = ‘https:‘+img_url
        imgName = img_url.split(‘/‘)[-1]
        imgPath = ‘qiutu/‘+imgName
        urllib.request.urlretrieve(url=img_url,filename=imgPath)
        print(imgPath,‘下載成功!‘)
        
print(‘over!!!‘)

四.bs4

　　下載：

1.pii install bs4 
2.pip install lxml

　　解析原理：

　　　　1.將即將要進行解析的源碼加載到bs對象

　　　　2.調用bs對象中相關的方法或屬性進行源碼中的相關標簽的定位

　　　　3.將定位到的標簽之間存在的文本或者屬性值獲取

import requests
from bs4 import BeautifulSoup

url = ‘http://www.shicimingju.com/book/sanguoyanyi.html‘
headers = {
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36‘
}
page_text = requests.get(url=url,headers=headers).text

soup = BeautifulSoup(page_text,‘lxml‘)

a_list = soup.select(‘.book-mulu > ul > li > a‘)

fp = open(‘sanguo.txt‘,‘w‘,encoding=‘utf-8‘)
for a in a_list:
    title = a.string
    detail_url = ‘http://www.shicimingju.com‘+a[‘href‘]
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    
    soup = BeautifulSoup(detail_page_text,‘lxml‘)
    content = soup.find(‘div‘,class_=‘chapter_content‘).text
    
    fp.write(title+‘\n‘+content)
    print(title,‘下載完畢‘)
print(‘over‘)
fp.close()

爬蟲之jupyter的使用，requests模塊，正則表達式，bs4

爬蟲之jupyter的使用，requests模塊，正則表達式，bs4

python之常用模塊和正則表達式

常用模塊之re模塊以及正則表達式擴展

python基礎13 ---函數模塊3(正則表達式)

re模塊和正則表達式

二十一、re模塊和正則表達式

python--(常用模塊-3-正則表達式)

Python re模塊與正則表達式詳解

爬蟲之jupyter的使用，requests模塊，正則表達式，bs4

python基礎之正則表達式，re模塊

內置函數，匿名函數，正則表達式，常用模塊

學習之路（三）淺談：輸出重定向，grep及正則表達式，egrep

sqlserver 運行正則表達式，調用c# 函數、代碼

正則表達式，re 和collect

正則表達式，字符串中需要兩個反斜杠“\d”

base64 正則表達式，判斷圖片是base64還是圖片鏈接

學會這二十個正則表達式，能讓你少些1000行代碼！

簡單的正則表達式，

正則表達式，grep/egrep工具的使用

知道這20個正則表達式，能讓你少寫1,000行代碼

關於Linux，用戶，組，權限，文本處理工具，正則表達式，vim文本編輯器

爬蟲之jupyter的使用，requests模塊，正則表達式，bs4

相關推薦