定向爬蟲requests-bs4<2>
阿新 • • 發佈:2018-12-15
Beautifulsoup4安裝
>>>Pip install beautifulsoup4
Beautifulsoup簡單使用 from BeautifulSoup(open(“index.html”)) soup = BeautifulSoup(“data”)
>>>from bs4 import BeautifulSoup >>> html_doc = """ ... <html><head><title>The Dormouse's story</title></head> ... <body> ... <p class="title"><b>The Dormouse's story</b></p> ... ... <p class="story">Once upon a time there were three little sisters; and their names were ... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, ... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and ... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; ... and they lived at the bottom of a well.</p> ... ... <p class="story">...</p> ... """ >>> soup = BeautifulSoup(html_doc, 'html.parser') >>> print(soup.prettify()) <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
BeautifulSoup類的元素 將HTML文件轉換成標籤樹 **Tag:**標籤 **Name:**標籤的名字 .name **Attributes:**標籤的屬性 .attrs **NavigableString:**標籤內非屬性字串 .string **Comment:**標籤內字串註釋部分
‘’’ 任何存在HTML語法中的標籤<tag>,都可以用soup.<tag>獲得 當HTML文件中存在多個相同<tag>對應內容時,soup.<tag>返回第一個 ‘’’ >>> soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> #每一個<tag>的名字可以用<tag>.name獲取 >>> soup.a.name 'a' #一個<tag>可以有零或多個屬性,字典型別 >>> t=soup.a >>> t.attrs {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} >>> t.attrs['href'] 'http://example.com/elsie' # >>> soup.a.string 'Elsie'
遍歷文件樹 下行遍歷 .contents 將tag的子節點以列表的方式輸出 .children 子節點迭代型別 可以對所有tag的子節點進行遞迴迴圈 .descendants 子孫節點迭代型別 可以對所有tag的子孫節點進行遞迴迴圈 .string 若tag只有一個 NavigableString 型別子節點,可以使用 .string 得到子節點,否則,用.Strings或.stripped_strings迴圈遍歷
>>> soup.body.contents ['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n'] >>> for i in soup.body.children: print(i) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> >>> for i in soup.body.descendants: print(i) <p class="title"><b>The Dormouse's story</b></p> <b>The Dormouse's story</b> The Dormouse's story <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> Elsie , <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> Lacie and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> Tillie ; and they lived at the bottom of a well. <p class="story">...</p> ... >>> for i in soup.body.strings: print(i) The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ... #使用.strings輸出的字串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多餘空白內容 >>> for i in soup.body.stripped_strings: print(i) The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ...
上行遍歷 .parent 獲取某元素的父節點 .parents 父輩節點迭代型別 可以對所有tag的父輩節點進行遞迴迴圈
>>> soup.a.parent
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
>>> for i in soup.a.parents:
print(i)
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>>> for i in soup.a.parents:
print(i.name)
p
body
html
[document]
>>> soup.a.string.parent
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> for i in soup.a.string.parents:
print(i.name)
a
p
body
html
[document]
平行遍歷 .next_sibling 按照HTML文字順序的下一個兄弟節點 .previous_sibling 按照HTML文字順序的上一個兄弟節點 .next_siblings 迭代型別 返回按照HTML文字順序後續所有兄弟節點 .previous_siblings 迭代型別 返回按照HTML文字順序前續所有兄弟節點
#如果以為第一個<a>標籤的 .next_sibling 結果是第二個<a>標籤,那就錯了,
真實結果是第一個<a>標籤和第二個<a>標籤之間的頓號和換行符
>>> soup.a.next_Sibling
',\n'
>>> soup.a.previous_sibling
'Once upon a time there were three little sisters; and their names were\n'
>>> for sibling in soup.p.next_siblings:
print(sibling)
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
>>> for sibling in soup.p.previous_siblings:
print(sibling)
>>> for sibling in soup.a.previous_siblings:
print(sibling)
Once upon a time there were three little sisters; and their names were
bs4庫的prettify()方法
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>