Beautiful Soup模塊

阿新 • • 發佈：2018-10-30

clas 轉換縮進 cut 找到 soup ott use 導航

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫,它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間.

快速開始，以如下html作為例子.

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p> 


<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class 
="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析這段代碼,能夠得到一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,‘html.parser‘)
print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse 
‘s story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse‘s story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

幾個簡單的瀏覽結構化數據的方法:

soup.title
<title>The Dormouse‘s story</title>

soup.title.name
‘title‘

soup.title.string
"The Dormouse‘s story"

soup.title.strings
<generator object _all_strings at 0x0000025B5572A780>

soup.title.parent.name
‘head‘

soup.p
<p class="title"><b>The Dormouse‘s story</b></p>

soup.p[‘class‘]
[‘title‘]

soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all(‘a‘)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id=‘link3‘)
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

從文檔中找到所有<a>標簽的鏈接:

for link in soup.find_all(‘a‘):
    print(link.get(‘href‘))
    
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

從文檔中獲取所有文字內容:

print(soup.get_text())
The Dormouse‘s story
The Dormouse‘s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

Beautiful Soup模塊

爬蟲-Beautiful Soup模塊

parse 方法 xml html 字符串但是特殊則表達式 ttr 推薦閱讀目錄一介紹二基本使用三遍歷文檔樹四搜索文檔樹五修改文檔樹六總結一介紹 Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Pyt

Beautiful Soup模塊

clas 轉換縮進 cut 找到 soup ott use 導航 Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫,它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數

爬蟲-Beautiful模塊

技術分享 name 取數據 img 方法的參數 bbbb 當前 sta ali 閱讀目錄一介紹二基本使用三遍歷文檔樹四搜索文檔樹五修改文檔樹六總結一介紹 Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Pyt

模塊與包

文件 clas cal 12px ... log 綁定運行查看一：模塊一個模塊就是一個包含了python定義和聲明的文件，文件名就是模塊名字加上.py的後綴。模塊分類有：1：內建模塊，python‘自帶’的模塊，如os、sys 2

eclipse 創建maven模塊

點擊 -1 eba app 1-1 clip module project 過程先創建一個聚合模塊。勾選Create a simple project 點擊finish 。看到已經創建好了這個聚合。接下來我們創建子模塊。pay-hk

Python篇1.15---模塊與包

def lob 是否函數 16px ont 針對自己的 bsp 一.模塊 1 什麽是模塊？一個模塊就是一個包含了python定義和聲明的文件，文件名就是模塊名字加上.py的後綴。 2 為何要使用模塊？如果你退出python解釋器然後重新進入，那麽你之

Node.js Path 模塊

工具詳細模塊 module tro ebp dex ble put var path = require(‘path‘); module.exports = { entry: ‘./app/index.js‘, output: { filename:

angular js模塊，angular js控制器

bsp ket tro bracket mod [] var angular function AngularJS 模塊 var app = angular.module(‘myApp‘, []); AngularJS 控制器 app.controller(‘myC

處理程序“ExtensionlessUrlHandler-Integrated-4.0”在其模塊列表中有一個錯誤模塊“ManagedPipelineHandler”

images 打開 ext framework ros windows log asp gii IIS上部署MVC網站，打開後ExtensionlessUrlHandler-Integrated-4.0解決方法 IIS上部署MVC網站，打開後500錯誤：處理程序&ldqu

MSP430WARE++的使用3：modbus模塊的調用方法

tails 更改 protocol usr 調用 gb2 targe 文件組 splay MSP430WARE++的使用3：modbus模塊的調用方法 MSP430WARE是一套基於C++語言的開源的MSP430層次化軟件架構，支持多種外設。本文將介紹mo

python argpare 模塊的簡單用法

python1、實例:#!/usr/bin/python #coding:utf-8 import argparse parser = argparse.ArgumentParser() parser.add_argument(‘-s‘,‘--string‘,dest=‘string‘,nargs=1

Node個人學習（一）----模塊

需要區別 class 當前個人一個 min export ava 1、自定義模塊與系統模塊的引入方式區別：----自定義模塊需要加“./”來聲明它不是一個系統模塊 const mod1=require("系統模塊.js"); const mod1=require(

Stitching模塊中focalsFromHomography初步研究

har length alignment local 目前 pad ng- 部分一次在Stitching模塊中，通過“光束法平差”的時候，有一個步驟為“通過單應矩陣估算攝像頭焦距”，調用的地方為： void focalsFromHomography(cons

Python中正則表達式（re模塊）的使用

python中正則表達式Python中正則表達式（re模塊）的使用1、正則表達式的概述（1）概述：正則表達式是一些由字符和特殊符號組成的字符串，他們描述了模式的重復或者表示多個字符，正則表達式能按照某種模式匹配一系列有相似特征的字符串。正則表達式是一種小型的、高度的專業化的編程語言，（2）Python語言中的

saltstack api wheel模塊報錯HTTP/1.1 401 Unauthorized

saltstack api saltapi salt-api報錯當使用saltstack api調用wheel模塊的時候會出現沒有權限的報錯[[email protected]/* */ ~]# curl -k -v https://localhost:8000 -H "Ac

nodejs圖像處理模塊

路徑問題 lan https wip perf 工作 arp rman 最好的首先是搜索了npm包的性能比較，找到了這篇： https://github.com/ivanoff/images-manipulation-performance 性能最好的當屬sharp，

axis2開發webservice之編寫Axis2模塊（Module）

mes idt com 2.x web-inf turn 分享元素 rate axis2中的模塊化開發。能夠讓開發者自由的加入自己所需的模塊。提高開發效率，減少開發的難度。 Axis2能夠通過模塊（Module）進行擴展。Axis2模塊至少須要有兩個類，這兩

python 時間模塊小結（time and datetime）

間隔 -i date對象 per inf ear macbook port 兩個一：經常使用的時間方法 1.得到當前時間使用time模塊，首先得到當前的時間戳 In [42]: time.time() Out[42]: 1408066927.208922 將時間戳轉換

前端模塊化——seaJS

bug jquery 文件的問題 use bre 是個靈活彈出 1、seaJS手記　　一：Bower獲取要安裝bower Npm install -g bower Bower install seajs 二：Use方法是整個

重磅優惠套餐：CCNA零基礎實驗+CCNP路由模塊【晁海江思科全部課程】

ccnaCCNA零基礎實驗+CCNP路由模塊【晁海江思科全部課程】http://edu.51cto.com/pack/view/id-1071.html （等待官方審核）套餐介紹：CCNA+CCNP全新套餐，5折優惠! 鑒於很多學員詢問如何購買我的全部思科課程？是否可以享受比較大的優惠？故組建此優惠套餐

Beautiful Soup模塊

相關推薦