爬蟲學習3——BeautifulSoup

阿新 • • 發佈：2018-06-04

python爬蟲入門

沒啥廢話，直接開始吧，新建一個Python文件，對著練習就行了。
可以添加print查看程序執行情況。

from bs4 import BeautifulSoup
#測試的網頁源碼
html_doc = """    
<html>
        <head>
                <title>The Dormouse‘s story</title>
        </head>
        <body>
    <p class="story">Once upon a time there were many children and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    <a href="http://example.com/Bob" class="sister" id="link4">Bob</a>;
    <a href="http://example.com/King" class="brother" id="link5">King</a>;
    <a href="http://example.com/Mary" class="sister" id="link6">Mary</a>;
    <a href="http://example.com/Target" class="sister" id="link7">Target said：“I am very strong”</a>;
    <a href="http://example.com/Jack" class="brother" id="link8">Jack</a>;
    and they lived at the bottom of a well.
    </p>

                <p class="story">...</p>
"""

bs = BeautifulSoup(html,‘lxml‘)   #使用lmxl進行解析html

doc = bs.prettify    #把代碼格式化輸出

doc = bs.title.string   #獲取title標簽的內容
doc = bs.title.text    #同樣是獲取title標簽的內容
doc = bs.a.text    #獲取a標簽的內容
doc = bs.a.string  #獲取a標簽的內容

doc = bs.title   #獲取title標簽

doc = bs.head   #獲取出head標簽

doc = bs.body.a  #獲取body下的a標簽，但是只是打印第一個，鑲嵌選擇

doc = bs.p[‘class‘]   #獲取p標簽的屬性class屬性

doc = bs.find_all(‘a‘)   #獲取所有的a標簽

doc = bs.find(‘a‘)   #查找a標簽，只是返回查找的第一個

doc = bs.a.parent   #獲取a標簽的父標簽

doc = bs.a.parents   #獲取a標簽的祖先標簽
#print(type(doc))   #祖先標簽是generator類型，通過for循環打印
#for item in doc:
#    print(item)

doc = bs.a.next_sibling  #獲取a標簽的下一個兄弟節點
#print(doc)
#for item in doc:   獲取所有兄弟節點靠for循環輸出
#    print(item)

#find_next_silbings()  返回後面的所有兄弟標簽
#find_previous_sibilings()  返回前面的所有兄弟標簽
#find_next_silbing()  返回後面的第一個兄弟標簽
#find_previous_sibiling()  返回前面的第一個兄弟標簽

doc = bs.find_all(‘a‘)  #查找所有的a標簽

doc = bs.find_all(attrs={‘id‘:‘link1‘})   #通過屬性查找所有的標簽
doc = bs.find_all(attrs={‘id‘:‘link3‘})   #通過屬性查找所有的標簽

doc = bs.find_all(id=‘link3‘)   #通過id直接查找，而不是通過字典查找
doc = bs.find_all(class_=‘brother‘)   #class後面有個 _

doc = bs.find_all(text=‘Target‘) #根據文本內容查找，文本內容必須要完全匹配才能查找上，這個就找不到
doc = bs.find_all(text=‘Bob‘)  #這個能查找上

#find_all_next()  返回節點後所有符合條件的節點
#find_next()   返回節點後第一個符合條件的節點

doc = bs.select(‘#link3‘)  #這裏的select是bs中內置的css選擇器，可以直接通過css選擇
doc = bs.select(‘.brother‘)  #通過id進行查找
doc = bs.select(‘p a‘)  #獲取p標簽下的所有a標簽

doc = bs.select(‘a‘)   #獲取所有的a標簽，並輸出每個的href的屬性內容，需要通過for循環輸出
#for item in doc:
#    print(item[‘href‘])

doc = bs.select(‘a‘)   #獲取所有的a標簽，並輸出每個標簽的內容，需要通過for循環輸出
#for item in doc:
#    print(item.text)

實戰：爬去豆瓣上的指定的電影信息：

import requests
from bs4 import BeautifulSoup
from urllib.parse import quote
import re

def write_info(head,body):   #將獲得信息寫入txt
        with open(‘moveinfo.txt‘,‘a‘,encoding=‘utf-8‘) as f:
                f.write(head+body+‘\n\n‘)
                f.close()

def get_info(type,url,name):   #獲取演員，上映時間...等詳細信息
        response = requests.get(url).text
        doc = BeautifulSoup(response,‘lxml‘)
        info = doc.select(‘#info‘)       #查找到info標簽
        for item in info:
                head = type + ":" + name
                print(‘正在加載 ‘,type,":","《" +name+ "》",‘ 信息......‘)
                print(head)
                item = item.text
                write_info(head,item)

def get_url(search):
        url = ‘https://www.douban.com/search?cat=1002&q=‘ + quote(search)  #對搜索的內容進行編碼
        response = requests.get(url)   #發起請求
        doc = BeautifulSoup(response.text,‘lxml‘)  #使用bs進行解析
        doc = doc.find_all(‘h3‘)
        doc = str(doc)
        pattern = re.compile(‘<h3>.*?<span>\[(.*?)\]</spa.*?href="(.*?)".*?target.*?>(.*?)</a>‘,re.S)
        result = re.findall(pattern,doc)
        for item in result:
            print(‘====================================================‘)
            get_info(item[0],item[1],item[2])

if __name__ == ‘__main__‘:
        search = input(‘請輸入要收集的電影信息：‘)    
        get_url(search)

爬蟲學習3——BeautifulSoup

python爬蟲入門沒啥廢話，直接開始吧，新建一個Python文件，對著練習就行了。可以添加print查看程序執行情況。 from bs4 import BeautifulSoup #測試的網頁源碼 html_doc = """ <html> <head>

爬蟲學習3-網頁內容獲取工具URLConnection

URLConnection也是java後臺用來獲取請求資料類，demo程式碼如下： package com.jack.spiderone.test; import java.io.BufferedReader; import java.io.IOException;

Python3爬蟲學習3:Beautiful Soup的用法

Beautiful Soup的用法之前通過urllib.request模組可以將網頁當作本地檔案來讀取，那麼獲得網頁的html程式碼後，自然就是要將我們所需要的部分從雜亂的html程式碼中分離出來。既然要做資料的查詢和提取，當然我們首先想到的應該是正則表示式

python爬蟲學習筆記3：bs4及BeautifulSoup庫學習

Beuatiful Soup bs類對應一個HTML/xml文件的全部內容 from bs4 import BeautifulSoup import bs4 soup=BeautifulSoup('data','ht

爬蟲庫之BeautifulSoup學習（二）

不必要 baidu html left 官方 blank 正則文本處理比較 BeautifulSoup官方介紹文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 四大對象種

爬蟲庫之BeautifulSoup學習（三）

子節點 rom lac repr 文檔 strong 爬蟲 time contents 遍歷文檔樹：　　1、查找子節點　　.contents　　　　tag的.content屬性可以將tag的子節點以列表的方式輸出。　　print soup.body.cont

爬蟲庫之BeautifulSoup學習（四）

所有字符串判斷 href gin int 過濾器 amp link 探索文檔樹： find_all(name,attrs,recursive,text,**kwargs) 方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件 1、name參數，可

Python爬蟲學習（3）

collect nbsp pri div time urlparse links ews 是否在慕課網學習並創建了一個簡單的爬蟲包，爬取百度百科相關詞條信息程序中會用到第三方解析包（BeautifulSoup4），Windows環境下安裝命令：pip install B

python3網絡爬蟲學習——基本庫的使用（3）

進行程序如果 www int control content cti expires 這一節我們主要講解處理異常在我們發送請求的時候，有的時候可能網絡不好，出現了異常，程序因為報錯而終止運行，為此我們需要對其進行處理 urllib裏的error模塊定義了request

python3 學習 3：python爬蟲之爬取動態載入的圖片，以百度圖片為例

轉： https://blog.csdn.net/qq_32166627/article/details/60882964 前言：前面我們爬取圖片的網站都是靜態的，在頁面中右鍵檢視原始碼就能看到網頁中圖片的位置。這樣我們用requests庫得到頁面原始碼後，再用bs4庫解析標籤即可儲存圖片

【Python3 爬蟲學習筆記】資料儲存 3 -- CSV檔案儲存

CSV，全稱為Comma-Separated Values，中文可以叫作逗號分隔符或字元分隔值，其檔案以純文字形式儲存表格資料。該檔案時一個字元序列，可以由任意數目的記錄組成，記錄間以某種換行符分隔。每條記錄由欄位組成，欄位間的分隔符是其他字元或字串，最常見的是逗號或製表符。不過所有記錄都有

【Python3 爬蟲學習筆記】基本庫的使用 10—— 正則表示式 3

3.search() mathch()方法是從字串的開頭開始匹配的，一旦開頭不匹配，那麼整個匹配就失敗了。因為match()方法在使用時需要考慮到開頭的內容，這在匹配時並不方便。它更適合用來檢測某個字串是否符合某個正則表示式的規則。 search()方法在匹配時會掃描整個字串，然後返回第

【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

Beautiful Soup可以藉助網頁的結構和屬性等特性來解析網頁。有了Beautiful Soup，我們不用再去寫一些複雜的正則表示式，只需要簡單的幾條語句，就可以完成網頁中某個元素的提取。 Beautiful Soup是Python的一個HTML或XML的解析庫，可以用它來方便地從

【Python3 爬蟲學習筆記】解析庫的使用 5 —— Beautiful Soup 3

提取資訊要獲取關聯元素節點的資訊，比如文字、屬性等，如下： html = """ <html> <body> Once upon a time there were three little sisters

【Python3 爬蟲學習筆記】解析庫的使用 10 —— 使用pyquery 3

節點操作 pyquery提供了一系列方法來對節點進行動態修改，比如為某個節點新增一個class，移除某個節點等，這些操作有時候會為提取資訊帶來極大地便利。 addClass和removeClass html = ''' <div class="wrap"> <d

python爬蟲學習筆記四：BeautifulSoup庫對HTML文字進行操作

只要你提供的資訊是標籤，就可以很好的解析怎麼使用BeautifulSoup庫？ from bs4 import BeautifulSoup soup=BeautifulSoup('data','html.parser'）例如： import

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

本文是Python爬蟲從入門到精通系列的第3篇。我們將總結BeautifulSoup這個解析庫以及常用的find和select方法。我們還會利用requests庫和BeauitfulSoup來爬取糗事百科上的段子, 並對比下單執行緒爬蟲和多執行緒爬蟲的爬取效率。什麼是

【Python3 爬蟲學習筆記】資料儲存 3 -- CSV檔案儲存 1

CSV，全稱為Comma-Separated Values，中文可以叫作逗號分隔符或字元分隔值，其檔案以純文字形式儲存表格資料。該檔案時一個字元序列，可以由任意數目的記錄組成，記錄間以某種換行符分隔。每條記錄由欄位組成，欄位間的分隔符是其他字元或字串，最常見的是

【Python3 爬蟲學習筆記】pyspider框架的使用 3

啟動爬蟲返回爬蟲的主頁面，將爬蟲的status設定成DEBUG或RUNNING，點選右側的Run按鈕即可開始爬取。在最左側我們可以定義專案的分組，以方便管理。rate/burst代表當前的爬取速率，rate代表1秒發出多少個請求，burst相當於流量控制中

爬蟲學習3——BeautifulSoup

相關推薦