1. 程式人生 > >定向爬蟲requests-bs4<2>

定向爬蟲requests-bs4<2>

Beautifulsoup4安裝

>>>Pip install beautifulsoup4

Beautifulsoup簡單使用 from BeautifulSoup(open(“index.html”)) soup = BeautifulSoup(“data”)

>>>from bs4 import BeautifulSoup
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
...     <body>
... <p class="title"><b>The Dormouse's story</b></p>
...
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
...
... <p class="story">...</p>
... """
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

BeautifulSoup類的元素 將HTML文件轉換成標籤樹 **Tag:**標籤 **Name:**標籤的名字 .name **Attributes:**標籤的屬性 .attrs **NavigableString:**標籤內非屬性字串 .string **Comment:**標籤內字串註釋部分

‘’’
任何存在HTML語法中的標籤<tag>,都可以用soup.<tag>獲得
當HTML文件中存在多個相同<tag>對應內容時,soup.<tag>返回第一個
‘’’
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

#每一個<tag>的名字可以用<tag>.name獲取
>>> soup.a.name
'a'

#一個<tag>可以有零或多個屬性,字典型別
>>> t=soup.a
>>> t.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
>>> t.attrs['href']
'http://example.com/elsie'

#
>>> soup.a.string
'Elsie'

遍歷文件樹 下行遍歷 .contents 將tag的子節點以列表的方式輸出 .children 子節點迭代型別 可以對所有tag的子節點進行遞迴迴圈 .descendants 子孫節點迭代型別 可以對所有tag的子孫節點進行遞迴迴圈 .string 若tag只有一個 NavigableString 型別子節點,可以使用 .string 得到子節點,否則,用.Strings或.stripped_strings迴圈遍歷

>>> soup.body.contents
 
['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n']
>>> for i in soup.body.children:
   print(i)

 


<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>


>>> for i in soup.body.descendants:
   print(i)

 


<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...


>>> for i in soup.body.strings:
   print(i)

 


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

#使用.strings輸出的字串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多餘空白內容
>>> for i in soup.body.stripped_strings:
   print(i)

 
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

上行遍歷 .parent 獲取某元素的父節點 .parents 父輩節點迭代型別 可以對所有tag的父輩節點進行遞迴迴圈

>>> soup.a.parent
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
>>> for i in soup.a.parents:
   print(i)

 
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

>>> for i in soup.a.parents:
   print(i.name)

 
p
body
html
[document]

>>> soup.a.string.parent
 
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

>>> for i in soup.a.string.parents:
   print(i.name)

 
a
p
body
html
[document]

平行遍歷 .next_sibling 按照HTML文字順序的下一個兄弟節點 .previous_sibling 按照HTML文字順序的上一個兄弟節點 .next_siblings 迭代型別 返回按照HTML文字順序後續所有兄弟節點 .previous_siblings 迭代型別 返回按照HTML文字順序前續所有兄弟節點

#如果以為第一個<a>標籤的 .next_sibling 結果是第二個<a>標籤,那就錯了,
真實結果是第一個<a>標籤和第二個<a>標籤之間的頓號和換行符
>>> soup.a.next_Sibling
 
',\n'

>>> soup.a.previous_sibling
 
'Once upon a time there were three little sisters; and their names were\n'

>>> for sibling in soup.p.next_siblings:
 print(sibling)

 


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>


>>> for sibling in soup.p.previous_siblings:
 print(sibling)



>>> for sibling in soup.a.previous_siblings:
 print(sibling)

 
Once upon a time there were three little sisters; and their names were
bs4庫的prettify()方法
  >>> print(soup.prettify())
 
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>