1. 程式人生 > 其它 >爬取微博簽到頁(一)——確定底層抓取邏輯

爬取微博簽到頁(一)——確定底層抓取邏輯

技術標籤:爬蟲分享大資料爬蟲seleniumpythonchrome

我是利用Python的
webdriver+selenium工具抓取的動態連結

測試

1.測試連通

from bs4 import BeautifulSoup
import lxml
import time
from selenium.webdriver import ActionChains
from selenium import webdriver
driver_path = r'D:\\scrapy\\chromedriver.exe'
url = 'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC'
chrome_options=webdriver.ChromeOptions()#定義物件 #無介面顯示 chrome_options.add_argument('--headless') browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=driver_path) browser.get(url) html = browser.page_source print(html) soup = BeautifulSoup(html,'html.parser') datelist=[]

測試結果如圖
獲取網頁原始碼成功

2.提取測試

提取過程中遇到空列表[]
分離變數觀察是什麼原因,所以爬取一下百度的網站看看問題所在

3.爬取百度測試

from bs4 import BeautifulSoup
import requests
from lxml import etree
import time
from selenium.webdriver import ActionChains
from selenium import webdriver
driver_path = r'D:\\scrapy\\chromedriver.exe'
url = 'https://www.baidu.com/'
#'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC'
chrome_options=webdriver.ChromeOptions()#定義物件 #無介面顯示 chrome_options.add_argument('--headless') browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=driver_path) browser.get(url) ht = browser.page_source #print(html) ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'} soup = BeautifulSoup(ht,'html.parser') response = requests.get(url=url,headers=ua) response.encoding = 'utf-8' html = etree.HTML(response.text) a = html.xpath('//*[@id="s-top-left"]/a[1]/text()') #a = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()') print(a) #addr = response.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]').extract_first() #name = response.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/header/div/div/a/h3').extract() #content = response.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]').extract() '''h3 = soup.find_all(name='div', class_='card-list') for n in h3: name = n.find('h3','m-text-cut').text addr = n.find('span','surl-text').text content= n.find('div','weibo-text').text ''' '''datalist=[] datalist.append([name,addr,content]) print(datalist)''' #card m-panel card9 weibo-member # card-list #m-text-cut

百度部分xpath爬取效果測試

3. 構建要輸出的csv檔案形式
datalist=[]
datalist.append([name,addr,content])
print(datalist)
lie = ['name','content','addr']
test = pd.DataFrame(datalist,columns=lie)
test.to_csv('./test1.csv',index=False)

由於沒爬取到資訊,所以是空的
在這裡插入圖片描述

輸出結果正常,所以可能是代理設定的問題
接下來購買IP代理,我是從某寶上購買的
接下來用代理的手段嘗試

4.新增IP代理

初步使用代理之後是這樣滴

from bs4 import BeautifulSoup
import requests
from lxml import etree
import time
from selenium.webdriver import ActionChains
from selenium import webdriver
import pandas as pd
driver_path = r'D:\\scrapy\\chromedriver.exe'
url = 'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC'
#'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC'
proxies= {
    "http":"http://111.127.119.230:13456",
"http":"http://221.230.216.211:13456",
"http":"http://111.75.125.219:13456",
"http":"http://.38.241.103:13456",
"http":"http://223.214.217.45:13456",
"http":"http://183.4.22.247:13456",
"http":"http://125.87.93.115:13456",
"http":"http://114.233.51.125:13456",
"http":"http://182.38.172.166:13456",
"http":"http://222.189.191.29:13456",
"http":"http://121.233.207.136:13456",
"http":"http://60.184.199.19:13456",
"http":"http://115.226.128.29:13456",
"http":"http://121.233.206.155:13456",
"http":"http://117.91.248.87:13456",
"http":"http://115.152.230.162:13456",
"http":"http://115.152.231.172:13456",
"http":"http://115.196.198.11:13456",
"http":"http://61.130.131.105:13456",
}
#proxies = "http://"+proxy
chrome_options=webdriver.ChromeOptions()#定義物件
#無介面顯示
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=driver_path)
browser.get(url)
ht = browser.page_source
#print(html)
ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'}
soup = BeautifulSoup(ht,'html.parser')
response = requests.get(url=url,proxies=proxies)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
a = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
time.sleep(3)
#a = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
print(a)
addr = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
name = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/header/div/div/a/h3/text()')
content = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/text()')
'''h3 = soup.find_all(name='div', class_='card-list')
for n in h3:
    name = n.find('h3','m-text-cut').text
    addr = n.find('span','surl-text').text
    content= n.find('div','weibo-text').text
'''

datalist=[]
datalist.append([name,addr,content])
print(datalist)
lie = ['name','content','addr']
test = pd.DataFrame(datalist,columns=lie)
test.to_csv('./test1.csv',index=False)
#card m-panel card9 weibo-member
# card-list
#m-text-cut

但是結果獲得的仍然是空列表,再來!又有何懼!

5.改進

花了點軟妹幣,買了一天許可權的IP代理
客服告訴我
這個rp時效只有五分鐘。
可以一秒鐘訪問一下這個網址,然後每次採集的時候都是用最後一次踢出來的代理地址。
接下來需要用redis

接下來的操作見第二部分,接下來會繼續分享