python3爬蟲03（find_all用法等）

阿新 • • 發佈：2018-12-05

#read1.html檔案
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p></body></html>


#!/usr/bin/env python
# # -*- coding:UTF-8 -*-

import os
import re
import requests
from bs4 import NavigableString
from bs4 import BeautifulSoup

curpath=os.path.dirname(os.path.realpath(__file__))
hmtlpath=os.path.join(curpath,'read1.html')

res=requests.get(hmtlpath)

soup=BeautifulSoup(res.content,features="html.parser")

for str in soup.stripped_strings:
    print(repr(str))

links=soup.find_all(class_="sister")
for parent in links.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

print(links.next_sibling)

for link in links:
    print(link.next_element)
print(link.next_sibling)

print(link.privous_element)
print(link.privous_sibling)

def has_class_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

def not_lacie(href):
    return href and not re.compile("lacie").search(href)

def not_tillie(href):
    return href and not re.compile("tillie").search(href)

def not_tillie1(id):
    return id and not re.compile("link2").search(id)

file=open("soup.html","r",encoding="utf-8")
soup=BeautifulSoup(file,features="lxml")

#find_all用法
tags=soup.find_all(re.compile('^b'))
tags=soup.find_all('b')
tags=soup.find_all(['a','b'])
tags=soup.find_all(has_class_no_id)
tags=soup.find_all(True)
tags=soup.find_all(href=not_lacie)
for tag in tags:
    print(tag.name)

def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

tags=soup.find_all(id=not_tillie1)
for tag in tags:
    print(tag)

tags=soup.find_all(attrs={"id":"link3"})
for tag in tags:
    print(tag)

soup.find_all(recursive=False)
tags=soup.select("body a")
tags=soup.select("p > a")
tags=soup.select("p > #link1")
tags=soup.select("html head title")
tags=soup.select(".sister")
tags=soup.select("[class~=sister]")
tags=soup.select("#link1 + .sister")
tags=soup.select("#link1")
tags=soup.select("a#link1")
tags=soup.select("a[href]")
tags=soup.select('a[href^="http://example"]')
tags=soup.select('a[href$="tillie"]')
tags=soup.select('a[href*=".com/el"]')
for tag in tags:
    print(tag)

python3爬蟲03（find_all用法等）

#read1.html檔案# <html><head><title>The Dormouse's story</title></head># <body># <p class="title"><b>The Dorm

python3基礎03（requests常見請求）

# coding:utf-8#-*- coding:utf-8 -*-import requestsimport jsonimport reimport urllib3from urllib.parse import urlencode,quote,unquoteurl="https://www.baidu.

python3爬蟲入門（urllib和requests簡單使用）

知道python有強大的的爬蟲庫，但是對於我們普通小白來說，寫一個完整的爬蟲需要知道什麼甚至瞭解什麼都是很重要的。掌握了這些基本點，才能夠熟悉爬蟲的構成和獲取有用的資訊。編寫一個小爬蟲個人感覺可以分為三個階段： 1：請求，這個就是使用urlib2或者requests

Python3爬蟲04（其他例子，如處理獲取網頁的內容）

ont htm file tle imp 獲取url con images 其他 #!/usr/bin/env python# -*- coding:utf-8 -*-import osimport reimport requestsfrom bs4 import Navi

python3 爬蟲日記（二）將資料存到Mongodb

python版本：3.6.1 開發工具：PyCharm社群版，Anaconda3 資料庫：MongoDB 視覺化MongoDB工具：MongoVUE 1.開啟資料庫後，開啟MongoVUE使MongoDB視覺化。 2.用PyCharm編寫程式碼，爬取資料並儲存到資料庫中。

Python3爬蟲實戰（requests模組）

上次我通過兩個實戰教學展示瞭如何使用urllib模組（http://blog.csdn.net/mr_blued/article/details/79180017）來構造爬蟲，這次告訴大家一個更好的實現爬蟲的模組，requests模組。使用requests模組進行爬蟲構造時最

Python3爬蟲實戰（urllib模組）

import urllib.request import os import re import time def url_open(url): # 建立一個 Request物件 req req = urllib.request.Request(url) # 通過 add_head

python3爬蟲初探（四）之檔案儲存

接著上面的寫，抓取到網址之後，我們要把圖片儲存到本地，這裡有幾種方法都是可以的。　　#-----urllib.request.urlretrieve----- import urllib.request imgurl = 'http://img.ivsky.com/

python3爬蟲實戰（三）：mitmproxy對接python下載抖音小視訊

一、前言前面我們已經用appium爬取了微信朋友圈，今天我們學習下mitmproxy，mitmproxy是幹什麼的呢，它跟charles和fiddler類似，是一個抓包工具，以控制檯的形式顯示，mitmproxy的重要性在於它可以對接python,可

Python3.X 爬蟲實戰（併發爬取）

1 背景在這一系列開始前我們就說過，簡單的爬蟲很容易，但是要完成一個高效健壯的爬蟲不是一個簡單的事情，這一系列我們已經明白了爬蟲相關的如下核心知識點。基於上面這幾篇其實我們把爬蟲當作自己便利的開發工具來使用基本上是夠了（譬如老闆讓你定期留意觀

簡單Python3爬蟲程式（1）簡單架構：佇列、集合、正則

<span style="font-size:18px;">import re import urllib.request import urllib from collections i

簡單Python3爬蟲程式（2）進階：偽裝瀏覽器、超時功能、儲存資料

import urllib.request import http.cookiejar # head: dict of header def makeMyOpener(head = { 'Co

關於W8.1不能安裝VS2015（包括2017等）

說了 nvi 自動啟動 mage 崩潰自己下一步某某 bios 電腦本來是W7 64位+OPENCV3.1，今天突然系統崩潰了，然後感覺W7過時了遇到很多問題直接系統崩潰還得了啊，幹脆裝了一個W8.1了。好吧~~本來想直接說問題的，幹脆先把裝系統給記錄一下--

IDEA 在某個工程下一個module如何使用另一個module中的資源文件（.xml .prop等）

blog mave 一個 pid 技術 mage module 如何使用依賴關系問題如題，經google，解決方案有四種，選擇了比較直觀有效的一種羅列如下：因為項目采用maven管理，所以我們可以在module2下的pom.xml制定<resource

Google map自定義style（午夜藍等）

com 樣式 ogl map aps ges 希望 img color 最近有個項目調用google map，希望用午夜藍樣式的地圖，找了好久找到下面這個網站，提供了很多自定義的style https://snazzymaps.com/explore Google map自

爬蟲二（urllib模塊）

span 訪問 b2b sta 字符串 rom seq app IT 1、在python2和python3中的差異在python2中，urllib和urllib2各有各自的功能，雖然urllib2是urllib的升級版，但是urllib2還是不能完全替代urllib，但是

解決下載ftp文件過程中，瀏覽器直接解析文件（txt,png等）的問題

filename map 需要 etc home 3.0.0 var att esp 搭建了一個ftp服務器，供用戶進行上傳下載，在下載過程中發現，一些文件，例如txt,jpg,png,pdf等直接被瀏覽器解析了。在瀏覽器中顯示其內容，沒有下載。下面通過網上查詢得到一些解

牛客網牛客小白月賽8 F數列操作（vector用法技巧）

操作 const 預留空間數列 iostream eof 需要 \n == 題目鏈接：https://www.nowcoder.com/acm/contest/214/F 題目：你需要寫一個毒瘤(劃掉)簡單的數據結構,滿足以下操作 1.插入一個數x(insert)

網路爬蟲原理（概要了解）

一、網路爬蟲原理 1.1 等同於瀏覽器訪問網頁的原理（1）真人行為驅動（2）瀏覽器自動執行人為的動作，即將動作自動程式化。 1.2 網路爬蟲就是將瀏覽器訪問網頁的過程，再次抽像成程式。二、網路爬蟲分類 2.1 按連結的訪問層次的先後來分寬度優先和深度優先。寬度優先

少說話多寫程式碼之Python學習033——迴圈語句03（列表導式）

列表導式是利用列表建立新的列表，比如，下面建立一個列表 a=[x*x for x in range(10)] print(a) 輸出 [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] 建立列表過程中也可以增加條件，比如，只要能被2整除的數。 b=[x*

python3爬蟲03（find_all用法等）

相關推薦