1. 程式人生 > 實用技巧 >Selenium3+python3自動化(二十七)--爬頁面原始碼(page_source)

Selenium3+python3自動化(二十七)--爬頁面原始碼(page_source)

前言

有時候通過元素的屬性查詢頁面上的某個元素,可能不大好找,這時候可以從原始碼中爬出想要的資訊。selenium的page_source方法可以獲取頁面原始碼。

爬頁面原始碼的作用:如,爬出頁面上所有的url地址,可以批量請求頁面url地址,看是否存在404等異常等

一、page_source

1.selenium的page_source方法可以直接返回頁面原始碼

二、re非貪婪模式

1.這裡需匯入re模組

2.用re的正則匹配:非貪婪模式

3.findall方法返回的是一個list集合

4.匹配出來之後發現有一些不是url連結,可以篩選下

findall在字串中找到正則表示式所匹配的所有子串,並返回一個列表,如果沒有找到匹配的,則返回空列表。

語法格式為:re.findall(pattern, string, flags=0)

參考程式碼:

driver=webdriver.Chrome()
driver.get("https://www.cnblogs.com/canglongdao")
#print(type(driver.page_source))
rs=driver.page_source.encode("utf-8")
print(type(rs),type(str(rs)))
aurl=re.findall('href="(.+?)"',str(rs))
print(aurl)

執行結果:

<class 'bytes'> <class 'str'>
['//common.cnblogs.com/favicon.ico?v=20200522', '/css/blog-common.min.css?v=7Pwqzj5EBy4dBv4DJNI181rFKP8_OF0hT7jO3o8jAa0', '/skins/book/bundle-book-2.min.css', '/skins/book/bundle-book-mobile.min.css?v=XFoR99E4sMNWcYA_LxWBPY7uXp4-8NCPb1RnsUN1Mwo', 'https://www.cnblogs.com/canglongdao/rss', 'https://www.cnblogs.com/canglongdao/rsd.xml', 'https://www.cnblogs.com/canglongdao/wlwmanifest.xml', 'https://www.cnblogs.com/canglongdao/', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13595372', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594914', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594459', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590722', 'https://www.cnblogs.com/canglongdao/archive/2020/08/31.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590348', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13589720', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587969', 'https://www.cnblogs.com/canglongdao/archive/2020/08/30.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587061', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13586938', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13585477', 'https://www.cnblogs.com/canglongdao/default.html?page=2', 'https://www.cnblogs.com/', 'javascript:void(0);', 'javascript:void(0);', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/', 'https://www.cnblogs.com/canglongdao/', 'https://i.cnblogs.com/EditPosts.aspx?opt=1', 'https://msg.cnblogs.com/send/%E6%98%9F%E7%A9%BA6', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/rss/', 'https://i.cnblogs.com/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/followers/', 'https://home.cnblogs.com/u/canglongdao/followees/', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/p/', 'https://www.cnblogs.com/canglongdao/MyComments.html', 'https://www.cnblogs.com/canglongdao/OtherPosts.html', 'https://www.cnblogs.com/canglongdao/RecentComments.html', 'https://www.cnblogs.com/canglongdao/tag/', 'https://www.cnblogs.com/canglongdao/category/1593317.html', 'https://www.cnblogs.com/canglongdao/category/1694849.html', 'https://www.cnblogs.com/canglongdao/category/1633461.html', 'https://www.cnblogs.com/canglongdao/category/1616592.html', 'https://www.cnblogs.com/canglongdao/category/1609028.html', 'https://www.cnblogs.com/canglongdao/category/1633189.html', 'https://www.cnblogs.com/canglongdao/category/1750002.html', 'https://www.cnblogs.com/canglongdao/category/1566249.html', 'https://www.cnblogs.com/canglongdao/category/1606140.html', 'https://www.cnblogs.com/canglongdao/category/1629226.html', 'https://www.cnblogs.com/canglongdao/category/1588735.html', 'https://www.cnblogs.com/canglongdao/category/1815562.html', 'https://www.cnblogs.com/canglongdao/category/1588084.html', 'https://www.cnblogs.com/canglongdao/category/1589277.html', 'https://www.cnblogs.com/canglongdao/category/1834572.html', 'https://www.cnblogs.com/canglongdao/category/1611757.html', 'https://www.cnblogs.com/canglongdao/category/1589392.html', 'https://www.cnblogs.com/canglongdao/category/1627263.html', 'https://www.cnblogs.com/canglongdao/category/1619655.html', 'https://www.cnblogs.com/canglongdao/category/1657195.html', 'https://www.cnblogs.com/canglongdao/category/1612257.html', 'https://www.cnblogs.com/canglongdao/category/1769926.html', 'https://www.cnblogs.com/canglongdao/category/1635972.html', 'https://www.cnblogs.com/canglongdao/category/1630667.html', 'https://www.cnblogs.com/canglongdao/archive/2020/09.html', 'https://www.cnblogs.com/canglongdao/archive/2020/08.html', 'https://www.cnblogs.com/canglongdao/archive/2020/07.html', 'https://www.cnblogs.com/canglongdao/archive/2020/06.html', 'https://www.cnblogs.com/canglongdao/archive/2020/05.html', 'https://www.cnblogs.com/canglongdao/archive/2020/04.html', 'https://www.cnblogs.com/canglongdao/archive/2020/03.html', 'https://www.cnblogs.com/canglongdao/archive/2020/02.html', 'https://www.cnblogs.com/canglongdao/archive/2020/01.html', 'https://www.cnblogs.com/canglongdao/archive/2019/12.html', 'https://www.cnblogs.com/canglongdao/archive/2019/11.html', 'https://www.cnblogs.com/canglongdao/archive/2019/10.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/12722846.html', 'https://www.cnblogs.com/canglongdao/p/12606952.html', 'https://www.cnblogs.com/canglongdao/p/12019714.html', 'https://www.cnblogs.com/canglongdao/p/12436272.html', 'https://www.cnblogs.com/canglongdao/p/12726642.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12067902.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12601894.html', 'https://www.cnblogs.com/canglongdao/p/13414829.html']

 三、篩選url地址出來

1.加個if語句判斷,'http'在url裡面說明是正常的url地址了

2.把所有的url地址放到一個集合,就是我們想要的結果

參考程式碼:

# coding:utf-8
from selenium import webdriver
import re
driver=webdriver.Chrome()
driver.get("https://www.cnblogs.com/canglongdao")
#print(type(driver.page_source))
rs=driver.page_source.encode("utf-8")
# print(type(rs),type(str(rs)))
aurl=re.findall('href="(.+?)"',str(rs))
print(aurl)
url=[]
for i in aurl:
    if 'http' in i:
        url.append(i)
#最終的url集合
print(len(url),url)

執行結果: