Python爬蟲之使用正則表示式抓取資料

阿新 • • 發佈：2018-11-14

匹配標籤

匹配title標籤

相關文章：Linux中的正則表示式

例項：

匹配標籤

匹配title標籤

匹配網頁的 <title></title> 標籤，也就是網頁的標題。 .*？就是匹配1個或多個字元，也就是這裡不能是空的。當加入括號的話，就是代表取值了 (.*?)

import re
import requests

resp=requests.get("http://www.baidu.com")
resp.encoding="utf-8"  #設定編碼格式為utf-8
html=resp.text   
title=re.findall(r'<title>.*?</title>',html)  #匹配 <title></title>
for t in title:
    print(t)
title_value=re.findall(r'<title>(.*?)</title>',html)  #匹配 <title></title>裡面的內容
for t in title_value:
    print(t)
#####################################################################
<title>百度一下，你就知道</title>
百度一下，你就知道

a標籤

匹配<a href="" ></a> ，並且獲取a標籤裡面的內容

import re
import requests

resp=requests.get("http://www.baidu.com")
resp.encoding="utf-8"  #設定編碼格式為utf-8
html=resp.text  
 
urls = re.findall(r"<a.*?>.*?<\/a>", html)   #匹配所有的a標籤
for u in urls:
    print(u)
 
texts = re.findall(r"<a.*?>(.*?)</a>", html)   #獲取超連結<a>和</a>之間內容
for t in texts:
    print(t)
#######################################################################################
<a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a>
<a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a>
<a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a>
<a href=http://v.baidu.com name=tj_trvideo class=mnav>視訊</a>
<a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a>
<a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登入</a>
<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登入</a>
<a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a>
<a href=http://home.baidu.com>關於百度</a>
<a href=http://ir.baidu.com>About Baidu</a>
<a href=http://www.baidu.com/duty/>使用百度前必讀</a>
<a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a>
新聞
hao123
地圖
視訊
貼吧
登入
登入
更多產品
關於百度
About Baidu
使用百度前必讀
意見反饋

table標籤

抓取 <table></table> 表格中的內容。

假設現在有這麼一個網頁

<html>
<table class="table">    
            <tr>
                <th>姓名</th>
                <th>性別</th>
            </tr>
            <tr>
                <td>小謝</td>
                <td>男</td>
            </tr>
            <tr>
                <td>小紅</td>
                <td>女</td>
            </tr>
</table>
</html>

匹配程式碼

import re
import requests

resp=requests.get("http://127.0.0.1/1.html")
resp.encoding="utf-8"  #設定編碼格式為utf-8
html=resp.text  

#匹配table標籤
tables=re.findall(r"<table.*?>.*?<\/table>",html,re.M|re.S)
for table in tables:
    print(table)

#匹配<tr></tr>之間的內容
trs=re.findall(r"<tr>(.*?)</tr>",html,re.S|re.M) #因為<tr>標籤大多數不是在同一行，所以要加 re.S和re.M多行匹配
for tr in trs:
    print(tr)

#匹配<th></th>之間的內容
for row in trs:
    ths=re.findall(r"<th>(.*?)</th>",row,re.S|re.M)
    for th in ths:
        print(th)
        
#匹配<td></td>之間的內容
for row in trs:
    tds=re.findall(r"<td>(.*?)</td>",row,re.S|re.M)
    for td in tds:
        print(td)
##################################################################################
<table class="table">    
            <tr>
                <th>姓名</th>
                <th>性別</th>
            </tr>
            <tr>
                <td><B>小謝</B></td>
                <td>男<br/></td>
            </tr>
            <tr>
                <td><B>小紅</B></td>
                <td>女<br/></td>
            </tr>
</table>

                <th>姓名</th>
                <th>性別</th>
            

                <td>小謝</td>
                <td>男</td>
            

                <td>小紅</td>
                <td>女</td>
            
姓名
性別

小謝
男
小紅
女

匹配標籤裡面的屬性

匹配a標籤裡面的URL

假如現在有網頁

<html>
	<a href="http://www.baidu.com">百度一下，你就知道</a>
	<a href="http://www.mi.com">小米官網</a>
</html>

import re
import requests

resp=requests.get("http://127.0.0.1/1.html")
resp.encoding="utf-8"  #設定編碼格式為utf-8
html=resp.text  

urls=re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')",html,re.I|re.S|re.M)  #匹配 href=""
for url in urls:
    print(url)
###################################################################################
http://www.baidu.com
http://www.mi.com

匹配img標籤裡的 src

加入現在有網頁

<html>
	<img src="http://t1.27270.com/uploads/tu/201811/310/f3e9db6b68.jpg"  name="美女"/>
	<img src="http://t1.27270.com/uploads/tu/201811/229/ea7fda100e.jpg" />
</html>

匹配程式碼：

import re
import requests
resp=requests.get("http://127.0.0.1/1.html")
resp.encoding="utf-8"  #設定編碼格式為utf-8
html=resp.text  

srcs=re.findall(r'src="(.*?)"',html,re.I|re.S|re.M)
for src in srcs:
    print(src)
##################################################################
http://t1.27270.com/uploads/tu/201811/310/f3e9db6b68.jpg
http://t1.27270.com/uploads/tu/201811/229/ea7fda100e.jpg


#假如要獲取圖片的名字，也就是上面的 f3e9db6b68.jpg 或者  ea7fda100e.jpg

import re
import requests
resp=requests.get("http://127.0.0.1/1.html")
resp.encoding="utf-8"  #設定編碼格式為utf-8
html=resp.text  

srcs=re.findall(r'src="(.*?)"',html,re.I|re.S|re.M)
for src in srcs:
    name=src.split("/")[-1]
    print(name)
##################################################################
f3e9db6b68.jpg
ea7fda100e.jpg

Python爬蟲之使用正則表示式抓取資料

目錄匹配標籤匹配title標籤 a標籤 table標籤匹配標籤裡面的屬性匹配a標籤裡面的URL 匹配img標籤裡的 src 相關文章：Linux中的正則表示式 &nbs

Python爬蟲之正則表示式（1）

廖雪峰正則表示式學習筆記 1：用\d可以匹配一個數字；用\w可以匹配一個字母或數字； '00\d' 可以匹配‘007’，但是無法匹配‘00A’; ‘\d\d\d’可以匹配‘010’； ‘\w\w\d’可以匹配‘py3’; 2：.可以匹配任意字元； 'py.'可以匹配'pyc'、

Python爬蟲-利用正則表示式爬取貓眼電影

利用正則來爬去貓眼電影 =================================== ===================================================== 1 ''' 2 利用正則來爬去貓眼電影 3 1. url: http://maoya

Python爬蟲之正則表示式的使用（三）

import re html = ''' <div class="slide-page" style="width: 700px;" data-index="1"> <a class="item" target="_blank" href="https:

Python爬蟲之requests+正則表示式抓取貓眼電影top100以及瓜子二手網二手車資訊(四)

{'index': '1', 'image': 'http://p1.meituan.net/movie/[email protected]_220h_1e_1c', 'title': '霸王別姬', 'actor': '張國榮,張豐毅,鞏俐', 'time': '1993-01-01', 'sc

python爬蟲之正則表達式

ner cde 輸入 set 神奇 tro 轉義規則 error 一、簡介　　正則表達式，又稱正規表示式、正規表示法、正規表達式、規則表達式、常規表示法（英語：Regular Expression，在代碼中常簡寫為regex、regexp或RE），計算機科學的一個概念。

Python爬蟲（正則表示式）

Python爬蟲（正則表示式）最近接觸爬蟲比較多，下面我來展示一個剛爬取的成果，使用正則表示式的方法，希望對剛開始接觸爬蟲的小夥伴有所幫助，同時希望大佬們給予點評和指導接下來，步入正題，使用正則表示式爬取資料是一種原始且有效的方法，正則表示式的作用即字元匹配，匹配出你想得到的

python爬蟲5——正則表示式

正則表示式很好用，之前沒有體會到它的強大，在寫原生的servlet程式，呼叫微服務時，要經常拼接字串，寫sql，需求轉換成程式碼，沒有個靈活的工具處理，真的是會被煩死的。就用sublime_txt +正則表示式，賊好用！為什麼要學正則表示式實際上爬蟲一共就四個主要步驟：

python入門之正則表示式

正則　　通過re模組實現　　eg：>>>import re 　　 >>>re.findall('abc',str_name) 　　在strname裡面完全匹配字串abc，返回列表['abc']，有多個則返回多

Python爬蟲與正則表示式

Python爬蟲與正則表示式一.Python中萬用字元的使用 1.表示方式表示意義 * 匹配0到任意字元 ? 匹配單個字元

python----使用re正則表示式刷選資料，去重，列表，取特定行資料（適用於web的html回包資料提取）

python—-使用re正則表示式刷選資料，去重，列表，取特定行資料（適用於web的html回包資料提取）環境配置：對目標伺服器的日誌檔案進行刷選特定資料（192.168.4.27） /usr/

利用正則表示式抓取網頁上郵箱的小程式

使用方法：把自己在網上儲存下來含有郵箱的網頁所在硬碟的路徑，拷到對應位置即可，此程式用eclipse-luna-64位測試已通過程式最終來源為馬上兵老師釋出的視訊及原始碼，本人是用來學習，並和大家分享視訊連結：http://pan.baidu.com/s/1jIE5qC

正則表示式抓取頁面內所有的超連結

因為最近要做一個類似專業搜尋引擎的東西，需要抓取網頁的所有超連結。大家幫忙測試一下子，下面的程式碼是否可以針對所有的標準超連結。 //如果要轉載本文請註明出處,免的出現版權紛爭,我不喜歡看到那種轉載了我的作品卻不註明出處的人 Seven{See7di#Gmail.com}測試程式碼如下： <?ph

爬蟲之正則表示式基礎篇

一點睛 1 正則表示式工具 http://tool.oschina.net/regex/ 2 測試一下 Hello, my phone number is 029-86432100 and email is [email protected]

【3月24日】Requests+正則表示式抓取貓眼電影Top100

本次實驗爬蟲任務工具較為簡單，主要是熟悉正則表示式的匹配： pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>

使用正則表示式抓取網易雲課堂中的資料

要抓取資料的頁面如下：程式碼： package com.url; import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URL; import java.

php curl 正則表示式抓取微博使用者資訊練習

兩個月前學習php curl時做的練習，今天週末整理了一下。程式封裝了四個類，主要使用了curl來抓取微博使用者的個人資訊頁面以及關注的使用者頁面，然後通過分析頁面結構使用正則表示式以及php的字串函式擷取所需的資訊。 Curl類：用於進行資料庫操作；

python爬蟲之利用scrapy框架抓取新浪天氣資料

scrapy中文官方文件：點選開啟連結Scrapy是Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛，可以用於資料探勘、監測和自動化測試，Scrapy吸引人的地方在於它是一個框架，任何人都可以根據

C# 正則表示式抓取網頁上某個標籤的內容，並替換連結地址和圖片地址

#region 獲取第三方網站內容 //獲取其他網站網頁內容的關鍵程式碼 WebRequest request = WebRequest.Create(第三方的網站地址); WebResponse response = requ

JAVA抓取網頁的圖片,JAVA利用正則表示式抓取網站圖片

利用java抓取網頁上的所有圖片：用兩個正則表示式： 1、匹配html中img標籤的正則：<img.*src=(.*?)[^>]*?> 2、匹配img標籤中得src中http路徑的正則：http:\"?(.*?)(\"|>|\\s+) 實現：

Python爬蟲之使用正則表示式抓取資料

匹配標籤

匹配title標籤

a標籤

table標籤

匹配標籤裡面的屬性

匹配a標籤裡面的URL

匹配img標籤裡的 src

相關推薦