requests , bs4 和 lxml庫鞏固

阿新 • • 發佈：2021-01-09

請求頭

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.58'
}

request_params = '''
requests 方法 請求引數

• url 請求的URL地址
• params GET請求引數
• data POST請求引數
• json 同樣是POST請求引數，要求服務端接收json格式的資料
• headers 請求頭字典
• cookies cookies資訊（字典或CookieJar）
• files 上傳檔案
• auth HTTP鑑權資訊
• timeout 等待響應時間，單位秒
• allow_redirects 是否允許重定向
• proxies 代理資訊
• verify 是否校驗證書
• stream 如果為False，則響應內容將直接全部下載
• cert 客戶端證書地址

 
'''

Response = '''

欄位

• cookies 返回CookieJar物件
• encoding 報文的編碼
• headers 響應頭
• history 重定向的歷史記錄
• status_code 響應狀態碼，如200
• elaspsed 傳送請求到接收響應耗時
• text 解碼後的報文主體
• content 位元組碼，可能在raw的基礎上解壓

方法

• json() 解析json格式的響應
• iter_content() 需配置stream=True，指定chunk_size大小
• iter_lines() 需配置stream=True，每次返回一行
• raise_for_status() 400-500之間將丟擲異常
• close()

 
'''

soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class=" 
sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

.next_sibling 屬性獲取了該節點的下一個兄弟節點，
.previous_sibling 屬性獲取了該節點的上一個兄弟節點，
如果節點不存在，則返回 None
注:
因為空白或者換行也可以被視作一個節點，
所以得到的結果可能是空白或者換行。

lxml_roles = '''
 標籤名   選取此節點的所有子節點
 
 /       從當前節點選取直接子節點
 
 //      從當前節點選取子孫節點
 
 .      選取當前節點
 
 ..     選取當前節點的父節點
 
 @      選取屬性
 
 *      萬用字元,選擇所有元素節點與元素名
 
 @*     選取所有屬性

[@attrib] 選取具有給定屬性的所有元素

[@attrib='value'] 選取給定屬性具有給定值的所有元素

[tag] 選取所有具有指定元素的直接子節點

[tag='text'] 選取所有具有指定元素並且文字內容是 text 節點

'''

lxml_operators = '''

or 或

and 與

mod 取餘

| 取兩個節點的集合

+ 加 , - 減 , * 乘 , div 除

= 等於 , != 不等於 , < 小於 

<= 小於或等於 , > 大於 , >= 大於或等於

'''

由於 jupyter 複製過來文字會亂,以上為 jupyter 檔案轉 html 截圖

下面為 以上三種庫的文字形式

requests 庫

import requests

In[2]:

requests.get('https://httpbin.org/get')
# 傳送 get 請求

Out[2]:

<Response [200]>

In[3]:

# 帶有引數 , 使用 params 引數
data = {
    'key':'value'
}
requests.get('https://httpbin.org/get',params = data)

Out[3]:

<Response [200]>

In[4]:

# 傳送 post 請求
data = {
    'key':'value'
}
requests.post('https://httpbin.org/get',data = data)
# 405 表示請求的方式不對

Out[4]:

<Response [405]>

In[5]:

# 定義請求頭
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.58'
}

In[6]:

response = requests.get('https://httpbin.org/get',headers = headers)

In[7]:

# .content 響應內容的位元組碼，一般處理二進位制檔案
response.content

Out[7]:

b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.58", \n    "X-Amzn-Trace-Id": "Root=1-5fa0cbda-7feeafcb2b0b5f78242e2d3e"\n  }, \n  "origin": "111.43.128.132", \n  "url": "https://httpbin.org/get"\n}\n'

In[8]:

# 自動選擇適當的編碼，對 .content解碼
response.text

Out[8]:

'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.58", \n    "X-Amzn-Trace-Id": "Root=1-5fa0cbda-7feeafcb2b0b5f78242e2d3e"\n  }, \n  "origin": "111.43.128.132", \n  "url": "https://httpbin.org/get"\n}\n'

In[9]:

eval(response.text)['origin']
# 使用 eval 將字串轉換為字典 , 提取資料

Out[9]:

'111.43.128.132'

In[10]:

response.json()
# 解析json格式的資料，如果無法解析，則丟擲異常

Out[10]:

{'args': {},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.58',
  'X-Amzn-Trace-Id': 'Root=1-5fa0cbda-7feeafcb2b0b5f78242e2d3e'},
 'origin': '111.43.128.132',
 'url': 'https://httpbin.org/get'}

In[11]:

response.json()['url']

Out[11]:

'https://httpbin.org/get'

In[12]:

request_params = '''
requests 方法 請求引數

• url 請求的URL地址
• params GET請求引數
• data POST請求引數
• json 同樣是POST請求引數，要求服務端接收json格式的資料
• headers 請求頭字典
• cookies cookies資訊（字典或CookieJar）
• files 上傳檔案
• auth HTTP鑑權資訊
• timeout 等待響應時間，單位秒
• allow_redirects 是否允許重定向
• proxies 代理資訊
• verify 是否校驗證書
• stream 如果為False，則響應內容將直接全部下載
• cert 客戶端證書地址

'''

In[13]:

Session = '''
Session可以持久化請求過程中的引數，以及cookie
需要登入的網頁，使用session可以避免每次的登入操作
'''
s = requests.Session()
s.cookies

Out[13]:

<RequestsCookieJar[]>

In[14]:

s.cookies = requests.cookies.cookiejar_from_dict({'key': 'value'})
# 修改 cookie 的資訊
s.cookies

Out[14]:

<RequestsCookieJar[Cookie(version=0, name='key', value='value', port=None, port_specified=False, domain='', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>

In[15]:

r = s.get('https://httpbin.org/cookies')
r.text

Out[15]:

'{\n  "cookies": {\n    "key": "value"\n  }\n}\n'

In[16]:

'''
Session 提供預設值

'''
s = requests.Session()
s.headers.update(
    {'h1':'val1',
    'h2':'val2'}
)

r = s.get('https://httpbin.org/headers', headers={'h2': 'val2_modify'})
r.text

Out[16]:

'{\n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "H1": "val1", \n    "H2": "val2_modify", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.24.0", \n    "X-Amzn-Trace-Id": "Root=1-5fa0cbde-38199df23237b30c6c65df0c"\n  }\n}\n'

In[17]:

Response = '''

欄位

• cookies 返回CookieJar物件
• encoding 報文的編碼
• headers 響應頭
• history 重定向的歷史記錄
• status_code 響應狀態碼，如200
• elaspsed 傳送請求到接收響應耗時
• text 解碼後的報文主體
• content 位元組碼，可能在raw的基礎上解壓

方法

• json() 解析json格式的響應
• iter_content() 需配置stream=True，指定chunk_size大小
• iter_lines() 需配置stream=True，每次返回一行
• raise_for_status() 400-500之間將丟擲異常
• close()

'''

bs4 庫

from bs4 import BeautifulSoup,element
# 匯入 BeautifulSoup

In[2]:

import lxml
import requests

In[3]:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In[4]:

soup = BeautifulSoup(html_doc,'lxml')  #建立 beautifulsoup 物件

In[5]:

soup1 = BeautifulSoup(open('index.html'))

In[6]:

soup.prettify()#列印 soup 物件的內容，格式化輸出

Out[6]:

'<html>\n <head>\n  <title>\n   The Dormouse\'s story\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The Dormouse\'s story\n   </b>\n  </p>\n  <p class="story">\n   Once upon a time there were three little sisters; and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">\n    Elsie\n   </a>\n   ,\n   <a class="sister" href="http://example.com/lacie" id="link2">\n    Lacie\n   </a>\n   and\n   <a class="sister" href="http://example.com/tillie" id="link3">\n    Tillie\n   </a>\n   ;\nand they lived at the bottom of a well.\n  </p>\n  <p class="story">\n   ...\n  </p>\n </body>\n</html>'

In[7]:

# Beautiful Soup 所有物件可以歸納為4種:
# • Tag
# • NavigableString
# • BeautifulSoup
# • Comment

In[8]:

soup.title # 獲取標題資訊

Out[8]:

<title>The Dormouse's story</title>

In[9]:

soup.head # 獲取頭

Out[9]:

<head><title>The Dormouse's story</title></head>

In[10]:

soup.a # 獲取第一個 a 連結

Out[10]:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In[11]:

soup.p # 獲取第一個 p 段落

Out[11]:

<p class="title"><b>The Dormouse's story</b></p>

In[12]:

soup.name

Out[12]:

'[document]'

In[13]:

soup.a.attrs # 第一個a標籤的屬性

Out[13]:

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

In[14]:

soup.p.attrs

Out[14]:

{'class': ['title']}

In[15]:

soup.a.get('href') # 單獨獲取某一個屬性

Out[15]:

'http://example.com/elsie'

In[16]:

soup.a['href']

Out[16]:

'http://example.com/elsie'

In[17]:

soup.a['href'] = 'https://www.cnblogs.com/hany-postq473111315/'
# 對屬性進行修改

In[18]:

del soup.a['href'] # 刪除屬性

In[19]:

soup.p.string # 使用 string 獲取內容

Out[19]:

"The Dormouse's story"

In[20]:

soup.a.string # 輸出 a 的內容

Out[20]:

'Elsie'

In[21]:

'''
.string 輸出的內容，已經把註釋符號去掉了，可能會帶來麻煩
'''
print(type(soup.a.string))
if type(soup.a.string)==element.Comment:
    print(soup.a.string)

<class 'bs4.element.NavigableString'>

In[22]:

soup

Out[22]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In[23]:

soup.head.contents # 將tag的子節點以列表的方式輸出

Out[23]:

[<title>The Dormouse's story</title>]

In[24]:

soup.head.contents[0] # 列表方式取值

Out[24]:

<title>The Dormouse's story</title>

In[25]:

soup.head.children # list 生成器物件

Out[25]:

<list_iterator at 0x292935d4fc8>

In[26]:

for item in soup.head.children:
    print(item)
    # 通過迴圈輸出

<title>The Dormouse's story</title>

In[27]:

'''
.contents 和 .children 屬性僅包含tag的直接子節點，
.descendants 屬性可以對所有tag的子孫節點進行遞迴迴圈

'''
for item in soup.descendants:
    print(item)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

In[28]:

soup.head.string # 檢視內容

Out[28]:

"The Dormouse's story"

In[29]:

soup.title.string

Out[29]:

"The Dormouse's story"

In[30]:

soup.strings

Out[30]:

<generator object Tag._all_strings at 0x00000292931AD548>

In[31]:

for string in soup.strings:
    # soup.strings 為 soup 內的所有內容
    print(string)

The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

In[32]:

# 使用 .stripped_strings 可以去除多餘空白內容
for string in soup.stripped_strings:
    print(string)

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

In[33]:

soup

Out[33]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In[34]:

soup.p.parent.name # 父標籤的名稱

Out[34]:

'body'

In[35]:

soup.head.title.string.parent.name

Out[35]:

'title'

In[36]:

'''
通過元素的 .parents 屬性可以遞迴得到元素的所有父輩節點
'''
for parent in soup.head.title.string.parents:
#     print(parent)
    print(parent.name)

title
head
html
[document]

In[37]:

'''
.next_sibling 屬性獲取了該節點的下一個兄弟節點，
.previous_sibling 屬性獲取了該節點的上一個兄弟節點，
如果節點不存在，則返回 None
注:
因為空白或者換行也可以被視作一個節點，
所以得到的結果可能是空白或者換行。
'''
soup.p.next_sibling

Out[37]:

'\n'

In[38]:

soup.p.previous_sibling

Out[38]:

'\n'

In[39]:

soup.p.next_sibling.next_sibling

Out[39]:

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In[40]:

'''
.next_siblings 和 .previous_siblings 
 可以對當前節點的兄弟節點迭代
'''
for sibling in soup.a.next_siblings:
    print(sibling)

,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
and they lived at the bottom of a well.

In[41]:

soup.head.next_element # 後一個節點

Out[41]:

<title>The Dormouse's story</title>

In[42]:

soup.head.previous_element # 前一個節點

Out[42]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In[43]:

'''
通過 .next_elements 和 .previous_elements 的迭代器
可以向前或向後訪問文件的解析內容
'''

for element in soup.a.next_elements:
    print(element)

Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

In[44]:

'''
find_all() 方法搜尋當前tag的所有tag子節點,
並判斷是否符合過濾器的條件
'''
soup.find_all('b')

Out[44]:

[<b>The Dormouse's story</b>]

In[45]:

import re 
for tag in soup.find_all(re.compile('^b')):
    # 通過傳入正則表示式,進行查詢
    print(tag)
    print(tag.name)

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
body
<b>The Dormouse's story</b>
b

In[46]:

soup.find_all(['a','b'])
# 傳遞列表,查詢元素

Out[46]:

[<b>The Dormouse's story</b>,
 <a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[47]:

soup.find_all(['a','b'])[2]['href']
# 查詢指定元素

Out[47]:

'http://example.com/lacie'

In[48]:

for tag in soup.find_all(True):
    # 查詢所有的 tag,不會返回字串節點
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

In[49]:

# 傳遞方法
def has_href(tag):
    # 如果存在就返回 True
    return tag.has_attr('href')
soup.find_all(has_href)

Out[49]:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[50]:

soup.find_all(id = 'link2')
# 尋找指定的屬性值

Out[50]:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In[51]:

soup.find_all(href = re.compile('tillie'))

Out[51]:

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[52]:

# 使用多個指定名字的引數可以同時過濾tag的多個屬性
soup.find_all(href=re.compile("tillie"), id='link3')

Out[52]:

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[53]:

# class_ 代替 class 進行查詢
soup.find_all('a',class_ = 'sister')

Out[53]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[54]:

'''
通過 find_all() 方法的 attrs 引數定義一個字典引數來搜尋包含特殊屬性的tag
'''
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(attrs = {'data-foo':'value'})
# attrs = {'data-foo':'value'} 進行篩選

Out[54]:

[<div data-foo="value">foo!</div>]

In[55]:

'''
通過 text 引數可以搜尋文件中的字串內容
text 引數接受 字串 , 正則表示式 , 列表, True
'''
soup.find_all(text=["Tillie", "Elsie", "Lacie"])

Out[55]:

['Elsie', 'Lacie', 'Tillie']

In[56]:

soup.find_all(text="Tillie")

Out[56]:

['Tillie']

In[57]:

soup.find_all(text=re.compile("Dormouse"))

Out[57]:

["The Dormouse's story", "The Dormouse's story"]

In[58]:

# 使用 limit 引數限制返回結果的數量
soup.find_all('a',limit = 2)

Out[58]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In[59]:

'''
呼叫tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點
如果只想搜尋tag的直接子節點,可以使用引數 recursive=False
'''
soup.html.find_all('title',recursive=False)

Out[59]:

[]

In[60]:

soup.html.find_all('title',recursive=True)

Out[60]:

[<title>The Dormouse's story</title>]

In[61]:

'''
CSS選擇器
標籤名不加任何修飾，類名前加點，id名前加 #
'''
soup.select('title')

Out[61]:

[<title>The Dormouse's story</title>]

In[62]:

soup.select('a')

Out[62]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[63]:

soup.select('.sister')
# 通過類名查詢

Out[63]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[64]:

soup.select('#link2')

Out[64]:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In[65]:

'''
查詢 p 標籤中，id 等於 link1的內容，二者需要用空格分開
一定注意是 p 標籤下的
'''
soup.select("p #link1")

Out[65]:

[<a class="sister" id="link1">Elsie</a>]

In[66]:

soup.select('head > title')

Out[66]:

[<title>The Dormouse's story</title>]

In[67]:

soup.select('a[class="sister"]')
# 查詢時還可以加入屬性元素，屬性需要用中括號括起來

Out[67]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[68]:

'''
select 選擇後,使用 get_text() 方法獲取內容
'''
soup.select('title')

Out[68]:

[<title>The Dormouse's story</title>]

In[69]:

soup.select('title')[0].get_text()

Out[69]:

"The Dormouse's story"

In[70]:

soup.select('title')[0].string

Out[70]:

"The Dormouse's story"

In[71]:

for title in soup.select('p .sister'):
    print(title.get_text())

Elsie
Lacie
Tillie

In[]:

lxml 庫

import lxml

In[2]:

lxml_roles = '''
 標籤名   選取此節點的所有子節點

 /       從當前節點選取直接子節點

 //      從當前節點選取子孫節點

 .      選取當前節點

 ..     選取當前節點的父節點

 @      選取屬性

 *      萬用字元,選擇所有元素節點與元素名

 @*     選取所有屬性

[@attrib] 選取具有給定屬性的所有元素

[@attrib='value'] 選取給定屬性具有給定值的所有元素

[tag] 選取所有具有指定元素的直接子節點

[tag='text'] 選取所有具有指定元素並且文字內容是 text 節點

'''

In[3]:

from lxml import etree

text='''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">第一個</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0"><a href="link5.html">a屬性</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
# html -> <Element html at 0x207bc230e08>
etree.tostring(html,encoding='utf-8').decode('utf-8')
# etree.tostring 解析成位元組

Out[3]:

'<html><body><div>\n    <ul>\n         <li class="item-0"><a href="link1.html">第一個</a></li>\n         <li class="item-1"><a href="link2.html">second item</a></li>\n         <li class="item-0"><a href="link5.html">a屬性</a>\n     </li></ul>\n </div>\n</body></html>'

In[4]:

etree.tostringlist(html)
# 解析成列表

Out[4]:

[b'<html><body><div>\n    <ul>\n         <li class="item-0"><a href="link1.html">&#31532;&#19968;&#20010;</a></li>\n         <li class="item-1"><a href="link2.html">second item</a></li>\n         <li class="item-0"><a href="link5.html">a&#23646;&#24615;</a>\n     </li></ul>\n </div>\n</body></html>']

In[5]:

html.xpath('//li/a') 
# li 標籤下的 a 標籤

Out[5]:

[<Element a at 0x17cbf6d1308>,
 <Element a at 0x17cbf6d1348>,
 <Element a at 0x17cbf6d1388>]

In[6]:

html.xpath('//li/a') [0].text

Out[6]:

'第一個'

In[7]:

html.xpath('//li[@class="item-1"]')
# li 標籤下 class 屬性為 item-1 的

Out[7]:

[<Element li at 0x17cbf6d8648>]

In[8]:

# 使用 text 獲取節點的文字
html.xpath('//li[@class="item-1"]/a/text()')
# 獲取a節點下的內容

Out[8]:

['second item']

In[9]:

from lxml import etree
from lxml.etree import HTMLParser

text='''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">第一個</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
     </ul>
 </div>
'''
html = etree.HTML(text,etree.HTMLParser())
html.xpath('//a[@href="link2.html"]/../@class')
# .. 父節點 , @ 取屬性

Out[9]:

['item-1']

In[10]:

html.xpath('//a[@href="link2.html"]/parent::*/@class')
# 使用 parent::* 來獲取父節點

Out[10]:

['item-1']

In[11]:

html.xpath('//li//text()') 
#獲取li下所有子孫節點的內容

Out[11]:

['第一個', 'second item']

In[12]:

# 使用 @ 符號即可獲取節點的屬性
html.xpath('//li/a/@href')

Out[12]:

['link1.html', 'link2.html']

In[13]:

text1='''
<div>
    <ul>
         <li class="aaa item-0"><a href="link1.html">第一個</a></li>
         <li class="bbb item-1"><a href="link2.html">second item</a></li>
     </ul>
 </div>
'''
html=etree.HTML(text1,etree.HTMLParser())
# 使用 contains(屬性,值) 進行獲取
html.xpath('//li[contains(@class,"aaa")]/a/text()')

Out[13]:

['第一個']

In[14]:

text1='''
<div>
    <ul>
         <li class="aaa" name="item"><a href="link1.html">第一個</a></li>
         <li class="aaa" name="fore"><a href="link2.html">second item</a></li>
     </ul>
 </div>
'''
html = etree.HTML(text1,etree.HTMLParser())

In[15]:

html.xpath('//li[@class="aaa" and @name="fore"]/a/text()')

Out[15]:

['second item']

In[16]:

html.xpath('//li[contains(@class,"aaa") and contains(@name,"fore")]/a/text()')

Out[16]:

['second item']

In[17]:

html.xpath('//li[contains(@class,"aaa") and @name="fore"]/a/text()')

Out[17]:

['second item']

In[18]:

lxml_operators = '''

or 或

and 與

mod 取餘

| 取兩個節點的集合

+ 加 , - 減 , * 乘 , div 除

= 等於 , != 不等於 , < 小於 

<= 小於或等於 , > 大於 , >= 大於或等於

'''

In[19]:

# 利用中括號引入索引的方法獲取特定次序的節點
text1='''
<div>
    <ul>
         <li class="aaa" name="item"><a href="link1.html">第一個</a></li>
         <li class="aaa" name="item"><a href="link1.html">第二個</a></li>
         <li class="aaa" name="item"><a href="link1.html">第三個</a></li>
         <li class="aaa" name="item"><a href="link1.html">第四個</a></li> 
     </ul>
 </div>
'''
html = etree.HTML(text1,etree.HTMLParser())

In[20]:

#獲取所有 li 節點下 a 節點的內容
html.xpath('//li[contains(@class,"aaa")]/a/text()')

Out[20]:

['第一個', '第二個', '第三個', '第四個']

In[21]:

#獲取第一個
html.xpath('//li[1][contains(@class,"aaa")]/a/text()')

Out[21]:

['第一個']

In[22]:

#獲取最後一個
html.xpath('//li[last()][contains(@class,"aaa")]/a/text()')

Out[22]:

['第四個']

In[23]:

#獲取第三個
html.xpath('//li[position()>2 and position()<4][contains(@class,"aaa")]/a/text()')

Out[23]:

['第三個']

In[24]:

#獲取倒數第三個
html.xpath('//li[last()-2][contains(@class,"aaa")]/a/text()')

Out[24]:

['第二個']

In[25]:

#獲取所有祖先節點
html.xpath('//li[1]/ancestor::*')

Out[25]:

[<Element html at 0x17cbf6e9c08>,
 <Element body at 0x17cbf6f4b48>,
 <Element div at 0x17cbf6f9188>,
 <Element ul at 0x17cbf6f9948>]

In[26]:

# 獲取 div 祖先節點
html.xpath('//li[1]/ancestor::div')

Out[26]:

[<Element div at 0x17cbf6f9188>]

In[27]:

# 獲取所有屬性值
html.xpath('//li[1]/attribute::*')

Out[27]:

['aaa', 'item']

In[28]:

# 獲取所有直接子節點
html.xpath('//li[1]/child::*')

Out[28]:

[<Element a at 0x17cbf6f07c8>]

In[29]:

# 獲取所有子孫節點的 a 節點
html.xpath('//li[1]/descendant::a')

Out[29]:

[<Element a at 0x17cbf6f07c8>]

In[30]:

# 獲取當前子節點之後的所有節點
html.xpath('//li[1]/following::*')

Out[30]:

[<Element li at 0x17cbf6fedc8>,
 <Element a at 0x17cbf6f0d48>,
 <Element li at 0x17cbf6fee08>,
 <Element a at 0x17cbf6f0d88>,
 <Element li at 0x17cbf6fee48>,
 <Element a at 0x17cbf6f0dc8>]

In[31]:

# 獲取當前節點的所有同級節點
html.xpath('//li[1]/following-sibling::*')

Out[31]:

[<Element li at 0x17cbf6fedc8>,
 <Element li at 0x17cbf6fee08>,
 <Element li at 0x17cbf6fee48>]

2021-01-09

from bs4 import BeautifulSoup,element
# 匯入 BeautifulSoup

In[2]:

import lxml
import requests

In[3]:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In[4]:

soup = BeautifulSoup(html_doc,'lxml')  #建立 beautifulsoup 物件

In[5]:

soup1 = BeautifulSoup(open('index.html'))

In[6]:

soup.prettify()#列印 soup 物件的內容，格式化輸出

Out[6]:

'<html>\n <head>\n  <title>\n   The Dormouse\'s story\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The Dormouse\'s story\n   </b>\n  </p>\n  <p class="story">\n   Once upon a time there were three little sisters; and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">\n    Elsie\n   </a>\n   ,\n   <a class="sister" href="http://example.com/lacie" id="link2">\n    Lacie\n   </a>\n   and\n   <a class="sister" href="http://example.com/tillie" id="link3">\n    Tillie\n   </a>\n   ;\nand they lived at the bottom of a well.\n  </p>\n  <p class="story">\n   ...\n  </p>\n </body>\n</html>'

In[7]:

# Beautiful Soup 所有物件可以歸納為4種:
# • Tag
# • NavigableString
# • BeautifulSoup
# • Comment

In[8]:

soup.title # 獲取標題資訊

Out[8]:

<title>The Dormouse's story</title>

In[9]:

soup.head # 獲取頭

Out[9]:

<head><title>The Dormouse's story</title></head>

In[10]:

soup.a # 獲取第一個 a 連結

Out[10]:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In[11]:

soup.p # 獲取第一個 p 段落

Out[11]:

<p class="title"><b>The Dormouse's story</b></p>

In[12]:

soup.name

Out[12]:

'[document]'

In[13]:

soup.a.attrs # 第一個a標籤的屬性

Out[13]:

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

In[14]:

soup.p.attrs

Out[14]:

{'class': ['title']}

In[15]:

soup.a.get('href') # 單獨獲取某一個屬性

Out[15]:

'http://example.com/elsie'

In[16]:

soup.a['href']

Out[16]:

'http://example.com/elsie'

In[17]:

soup.a['href'] = 'https://www.cnblogs.com/hany-postq473111315/'
# 對屬性進行修改

In[18]:

del soup.a['href'] # 刪除屬性

In[19]:

soup.p.string # 使用 string 獲取內容

Out[19]:

"The Dormouse's story"

In[20]:

soup.a.string # 輸出 a 的內容

Out[20]:

'Elsie'

In[21]:

'''
.string 輸出的內容，已經把註釋符號去掉了，可能會帶來麻煩
'''
print(type(soup.a.string))
if type(soup.a.string)==element.Comment:
    print(soup.a.string)

<class 'bs4.element.NavigableString'>

In[22]:

soup

Out[22]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In[23]:

soup.head.contents # 將tag的子節點以列表的方式輸出

Out[23]:

[<title>The Dormouse's story</title>]

In[24]:

soup.head.contents[0] # 列表方式取值

Out[24]:

<title>The Dormouse's story</title>

In[25]:

soup.head.children # list 生成器物件

Out[25]:

<list_iterator at 0x292935d4fc8>

In[26]:

for item in soup.head.children:
    print(item)
    # 通過迴圈輸出

<title>The Dormouse's story</title>

In[27]:

'''
.contents 和 .children 屬性僅包含tag的直接子節點，
.descendants 屬性可以對所有tag的子孫節點進行遞迴迴圈

'''
for item in soup.descendants:
    print(item)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

In[28]:

soup.head.string # 檢視內容

Out[28]:

"The Dormouse's story"

In[29]:

soup.title.string

Out[29]:

"The Dormouse's story"

In[30]:

soup.strings

Out[30]:

<generator object Tag._all_strings at 0x00000292931AD548>

In[31]:

for string in soup.strings:
    # soup.strings 為 soup 內的所有內容
    print(string)

The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

In[32]:

# 使用 .stripped_strings 可以去除多餘空白內容
for string in soup.stripped_strings:
    print(string)

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

In[33]:

soup

Out[33]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In[34]:

soup.p.parent.name # 父標籤的名稱

Out[34]:

'body'

In[35]:

soup.head.title.string.parent.name

Out[35]:

'title'

In[36]:

'''
通過元素的 .parents 屬性可以遞迴得到元素的所有父輩節點
'''
for parent in soup.head.title.string.parents:
#     print(parent)
    print(parent.name)

title
head
html
[document]

In[37]:

'''
.next_sibling 屬性獲取了該節點的下一個兄弟節點，
.previous_sibling 屬性獲取了該節點的上一個兄弟節點，
如果節點不存在，則返回 None
注:
因為空白或者換行也可以被視作一個節點，
所以得到的結果可能是空白或者換行。
'''
soup.p.next_sibling

Out[37]:

'\n'

In[38]:

soup.p.previous_sibling

Out[38]:

'\n'

In[39]:

soup.p.next_sibling.next_sibling

Out[39]:

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In[40]:

'''
.next_siblings 和 .previous_siblings 
 可以對當前節點的兄弟節點迭代
'''
for sibling in soup.a.next_siblings:
    print(sibling)

,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
and they lived at the bottom of a well.

In[41]:

soup.head.next_element # 後一個節點

Out[41]:

<title>The Dormouse's story</title>

In[42]:

soup.head.previous_element # 前一個節點

Out[42]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In[43]:

'''
通過 .next_elements 和 .previous_elements 的迭代器
可以向前或向後訪問文件的解析內容
'''

for element in soup.a.next_elements:
    print(element)

Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

In[44]:

'''
find_all() 方法搜尋當前tag的所有tag子節點,
並判斷是否符合過濾器的條件
'''
soup.find_all('b')

Out[44]:

[<b>The Dormouse's story</b>]

In[45]:

import re 
for tag in soup.find_all(re.compile('^b')):
    # 通過傳入正則表示式,進行查詢
    print(tag)
    print(tag.name)

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
body
<b>The Dormouse's story</b>
b

In[46]:

soup.find_all(['a','b'])
# 傳遞列表,查詢元素

Out[46]:

[<b>The Dormouse's story</b>,
 <a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[47]:

soup.find_all(['a','b'])[2]['href']
# 查詢指定元素

Out[47]:

'http://example.com/lacie'

In[48]:

for tag in soup.find_all(True):
    # 查詢所有的 tag,不會返回字串節點
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

In[49]:

# 傳遞方法
def has_href(tag):
    # 如果存在就返回 True
    return tag.has_attr('href')
soup.find_all(has_href)

Out[49]:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[50]:

soup.find_all(id = 'link2')
# 尋找指定的屬性值

Out[50]:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In[51]:

soup.find_all(href = re.compile('tillie'))

Out[51]:

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[52]:

# 使用多個指定名字的引數可以同時過濾tag的多個屬性
soup.find_all(href=re.compile("tillie"), id='link3')

Out[52]:

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[53]:

# class_ 代替 class 進行查詢
soup.find_all('a',class_ = 'sister')

Out[53]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[54]:

'''
通過 find_all() 方法的 attrs 引數定義一個字典引數來搜尋包含特殊屬性的tag
'''
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(attrs = {'data-foo':'value'})
# attrs = {'data-foo':'value'} 進行篩選

Out[54]:

[<div data-foo="value">foo!</div>]

In[55]:

'''
通過 text 引數可以搜尋文件中的字串內容
text 引數接受 字串 , 正則表示式 , 列表, True
'''
soup.find_all(text=["Tillie", "Elsie", "Lacie"])

Out[55]:

['Elsie', 'Lacie', 'Tillie']

In[56]:

soup.find_all(text="Tillie")

Out[56]:

['Tillie']

In[57]:

soup.find_all(text=re.compile("Dormouse"))

Out[57]:

["The Dormouse's story", "The Dormouse's story"]

In[58]:

# 使用 limit 引數限制返回結果的數量
soup.find_all('a',limit = 2)

Out[58]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In[59]:

'''
呼叫tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點
如果只想搜尋tag的直接子節點,可以使用引數 recursive=False
'''
soup.html.find_all('title',recursive=False)

Out[59]:

[]

In[60]:

soup.html.find_all('title',recursive=True)

Out[60]:

[<title>The Dormouse's story</title>]

In[61]:

'''
CSS選擇器
標籤名不加任何修飾，類名前加點，id名前加 #
'''
soup.select('title')

Out[61]:

[<title>The Dormouse's story</title>]

In[62]:

soup.select('a')

Out[62]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[63]:

soup.select('.sister')
# 通過類名查詢

Out[63]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[64]:

soup.select('#link2')

Out[64]:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In[65]:

'''
查詢 p 標籤中，id 等於 link1的內容，二者需要用空格分開
一定注意是 p 標籤下的
'''
soup.select("p #link1")

Out[65]:

[<a class="sister" id="link1">Elsie</a>]

In[66]:

soup.select('head > title')

Out[66]:

[<title>The Dormouse's story</title>]

In[67]:

soup.select('a[class="sister"]')
# 查詢時還可以加入屬性元素，屬性需要用中括號括起來

Out[67]:

[<a class="sister" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In[68]:

'''
select 選擇後,使用 get_text() 方法獲取內容
'''
soup.select('title')

Out[68]:

[<title>The Dormouse's story</title>]

In[69]:

soup.select('title')[0].get_text()

Out[69]:

"The Dormouse's story"

In[70]:

soup.select('title')[0].string

Out[70]:

"The Dormouse's story"

In[71]:

for title in soup.select('p .sister'):
    print(title.get_text())

Elsie
Lacie
Tillie

In[]:

requests , bs4 和 lxml庫鞏固

請求頭 headers = { \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 Edg/86.0.622.58\'

Python爬蟲——使用XPath和lxml庫解析HTML

目錄 0 安裝 XPath Helper 外掛 1 XPath 語法 1.1 節點 1.2 謂語 2 lxml 庫使用例項 2.1 解析字串為 HTML

python requests bs4入門（一）-獲取TOP100榜電影名字和主演寫入資料庫

Requests 獲取貓眼TOP100榜電影名字和主演 1 import time 2 import requests 3 from model import *

pyenv虛擬環境管理python多版本和軟體庫的方法

可能大家在日常工作中會遇到這麼個問題，現在基本的linux系統都是自帶老版本的python2.7.x版本，我又不想用老版本，但直接升級可能會出問題，或是依賴老版本的程式就執行不了，有沒辦法能安裝3.x新版本的？

Python大資料之使用lxml庫解析html網頁檔案示例

本文例項講述了Python大資料之使用lxml庫解析html網頁檔案。分享給大家供大家參考，具體如下：

淺談Python3識別判斷圖片主要顏色並和顏色庫進行對比的方法

【更新】主要提供兩種方案：方案一：（參考網上程式碼，感覺實用性不是很強）使用PIL擷取影象，然後將RGB轉為HSV進行判斷，統計判斷顏色，最後輸出RGB值

解決pycharm每次開啟專案都需要配置直譯器和安裝庫問題

前言最近在使用pycharm開發新專案的時候,每次開啟新的工程都顯示沒有直譯器,要不加了直譯器就是程式碼一堆沒有紅色錯誤提示沒有模組問題,找到了解決辦法做一個記錄.

python實現按鍵精靈找色點選功能教程,使用pywin32和Pillow庫

Python圖片處理模組PIL（pillow） pywin32的主要作用 1.捕獲視窗； 2.模擬滑鼠鍵盤動作；

stm32儲存器映像和標準庫中定義外設地址的方法

結合儲存器映像理解stm32標準庫中定義外設地址的方法。 stm32f103zet6是32位的。它所能訪問的地址空間範圍為2^32=4GB，把4GB分為8個block，分別為block0-block-7。把這8個block用於不同的用途。

【STM32F407開發板使用者手冊】第23章 STM32F407的USART串列埠基礎知識和HAL庫API

最新教程下載：http://www.armbbs.cn/forum.php?mod=viewthread&tid=93255 第23章 STM32F407的USART串列埠基礎知識和HAL庫API

【STM32F429開發板使用者手冊】第23章 STM32F429的USART串列埠基礎知識和HAL庫API

最新教程下載：http://www.armbbs.cn/forum.php?mod=viewthread&tid=93255 第23章 STM32F429的USART串列埠基礎知識和HAL庫API

【STM32F429開發板使用者手冊】第25章 STM32F429的TIM定時器基礎知識和HAL庫API

最新教程下載：http://www.armbbs.cn/forum.php?mod=viewthread&tid=93255 第25章 STM32F429的TIM定時器基礎知識和HAL庫API

lxml.etree.HTML()，lxml.etree.fromstring()和lxml.etree.tostring()三者的區別與聯絡

在學習xpath()的過程中，除了學習xpath的基本語法外，我們最先遇到的往往是文件的格式化問題！因為只有正確格式化之後的文件，才能準確利用xpath尋找其中的關鍵資訊。

【STM32F407開發板使用者手冊】第31章 STM32F407的SPI匯流排基礎知識和HAL庫API

最新教程下載：http://www.armbbs.cn/forum.php?mod=viewthread&tid=93255 第31章 STM32F407的SPI匯流排基礎知識和HAL庫API

【STM32F429開發板使用者手冊】第31章 STM32F429的SPI匯流排基礎知識和HAL庫API

最新教程下載：http://www.armbbs.cn/forum.php?mod=viewthread&tid=93255 第31章 STM32F429的SPI匯流排基礎知識和HAL庫API

【STM32F429開發板使用者手冊】第45章 STM32F429的圖形加速器DMA2D的基礎知識和HAL庫API

最新教程下載：http://www.armbbs.cn/forum.php?mod=viewthread&tid=93255 第45章 STM32F429的圖形加速器DMA2D的基礎知識和HAL庫API

Windows動態庫和靜態庫

Windows動態庫和靜態庫庫：二進位制檔案靜態庫 .lib，庫檔案不是可執行程式在連結階段，將程式碼完整的拷貝到可執行程式中

linux下庫檔案和連結庫問題

openssl yum install openssl openssl-devel 二進位制檔案安裝在/usr/bin目錄下，庫檔案安裝在/usr/lib或者/usr/lib64目錄下，標頭檔案安裝在/usr/include目錄下，但是隻有動態庫沒有靜態庫：

2. Git的工作區、暫存區和版本庫

工作區：就是在電腦裡能看到的，經過git init初始化，包含.git檔案的目錄版本庫：工作區有一個隱藏目錄.git，這個不算工作區，而是Git的版本庫。

爬蟲中requests模組和urllib模組的異同點

相同點： requests模組和urllib模組都能實現對網頁的請求，並獲取網頁資料異同點：

requests , bs4 和 lxml庫 鞏固

相關推薦

requests , bs4 和 lxml庫鞏固