爬取微博簽到頁(一)——確定底層抓取邏輯

阿新 • • 發佈：2021-02-08

我是利用Python的
webdriver+selenium工具抓取的動態連結

測試

1.測試連通

from bs4 import BeautifulSoup
import lxml
import time
from selenium.webdriver import ActionChains
from selenium import webdriver
driver_path = r'D:\\scrapy\\chromedriver.exe'
url = 'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC' 

chrome_options=webdriver.ChromeOptions()#定義物件
#無介面顯示
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=driver_path)
browser.get(url)
html = browser.page_source
print(html)
soup = BeautifulSoup(html,'html.parser')

datelist=[]

測試結果如圖
獲取網頁原始碼成功

2.提取測試

提取過程中遇到空列表[]
分離變數觀察是什麼原因，所以爬取一下百度的網站看看問題所在

3.爬取百度測試

from bs4 import BeautifulSoup
import requests
from lxml import etree
import time
from selenium.webdriver import ActionChains
from selenium import webdriver
driver_path = r'D:\\scrapy\\chromedriver.exe'
url = 'https://www.baidu.com/'
#'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC' 

chrome_options=webdriver.ChromeOptions()#定義物件
#無介面顯示
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=driver_path)
browser.get(url)
ht = browser.page_source
#print(html)
ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'}
soup = BeautifulSoup(ht,'html.parser')
response = requests.get(url=url,headers=ua)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
a = html.xpath('//*[@id="s-top-left"]/a[1]/text()')
#a = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
print(a)
#addr = response.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]').extract_first()
#name = response.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/header/div/div/a/h3').extract()
#content = response.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]').extract()
'''h3 = soup.find_all(name='div', class_='card-list')
for n in h3:
    name = n.find('h3','m-text-cut').text
    addr = n.find('span','surl-text').text
    content= n.find('div','weibo-text').text
'''

'''datalist=[]
datalist.append([name,addr,content])
print(datalist)'''
#card m-panel card9 weibo-member
# card-list
#m-text-cut

百度部分xpath爬取效果測試

3. 構建要輸出的csv檔案形式

datalist=[]
datalist.append([name,addr,content])
print(datalist)
lie = ['name','content','addr']
test = pd.DataFrame(datalist,columns=lie)
test.to_csv('./test1.csv',index=False)

由於沒爬取到資訊，所以是空的
在這裡插入圖片描述

輸出結果正常，所以可能是代理設定的問題
接下來購買IP代理，我是從某寶上購買的
接下來用代理的手段嘗試

4.新增IP代理

初步使用代理之後是這樣滴

from bs4 import BeautifulSoup
import requests
from lxml import etree
import time
from selenium.webdriver import ActionChains
from selenium import webdriver
import pandas as pd
driver_path = r'D:\\scrapy\\chromedriver.exe'
url = 'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC'
#'https://m.weibo.cn/p/index?containerid=23065700428008611000000000000&luicode=10000011&lfid=100103type%3D1%26q%3D%E5%8C%97%E4%BA%AC'
proxies= {
    "http":"http://111.127.119.230:13456",
"http":"http://221.230.216.211:13456",
"http":"http://111.75.125.219:13456",
"http":"http://.38.241.103:13456",
"http":"http://223.214.217.45:13456",
"http":"http://183.4.22.247:13456",
"http":"http://125.87.93.115:13456",
"http":"http://114.233.51.125:13456",
"http":"http://182.38.172.166:13456",
"http":"http://222.189.191.29:13456",
"http":"http://121.233.207.136:13456",
"http":"http://60.184.199.19:13456",
"http":"http://115.226.128.29:13456",
"http":"http://121.233.206.155:13456",
"http":"http://117.91.248.87:13456",
"http":"http://115.152.230.162:13456",
"http":"http://115.152.231.172:13456",
"http":"http://115.196.198.11:13456",
"http":"http://61.130.131.105:13456",
}
#proxies = "http://"+proxy
chrome_options=webdriver.ChromeOptions()#定義物件
#無介面顯示
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options,executable_path=driver_path)
browser.get(url)
ht = browser.page_source
#print(html)
ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'}
soup = BeautifulSoup(ht,'html.parser')
response = requests.get(url=url,proxies=proxies)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
a = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
time.sleep(3)
#a = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
print(a)
addr = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/a/span[2]/text()')
name = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/header/div/div/a/h3/text()')
content = html.xpath('//*[@id="app"]/div[1]/div[2]/div/div/div/div/div/div/article/div/div[1]/text()')
'''h3 = soup.find_all(name='div', class_='card-list')
for n in h3:
    name = n.find('h3','m-text-cut').text
    addr = n.find('span','surl-text').text
    content= n.find('div','weibo-text').text
'''

datalist=[]
datalist.append([name,addr,content])
print(datalist)
lie = ['name','content','addr']
test = pd.DataFrame(datalist,columns=lie)
test.to_csv('./test1.csv',index=False)
#card m-panel card9 weibo-member
# card-list
#m-text-cut

但是結果獲得的仍然是空列表，再來！又有何懼！

5.改進

花了點軟妹幣，買了一天許可權的IP代理
客服告訴我
這個rp時效只有五分鐘。
可以一秒鐘訪問一下這個網址，然後每次採集的時候都是用最後一次踢出來的代理地址。
接下來需要用redis

接下來的操作見第二部分，接下來會繼續分享

爬取微博簽到頁(一)——確定底層抓取邏輯

技術標籤：爬蟲分享大資料爬蟲seleniumpythonchrome 我是利用Python的 webdriver+selenium工具抓取的動態連結

Python selenium爬取微博資料程式碼例項

爬取某人的微博資料，把某人所有時間段的微博資料都爬下來。具體思路：建立driver-----get網頁----找到並提取資訊-----儲存csv----翻頁----get網頁（開始迴圈）----...----沒有“下一頁”就結束，

Scrapy嘗試爬取微博熱搜

首先自己想要的item： 1 import scrapy 2 3 4 class WeiboItem(scrapy.Item): 5 6rank = scrapy.Field()

python爬取微博評論的例項講解

python爬蟲是程式設計師們一定會掌握的知識，練習python爬蟲時，很多人會選擇爬取微博練手。python爬蟲微博根據微博存在於不同媒介上，所爬取的難度有差異，無論是python新入手的小白，還是已經熟練掌握的程式設計師

Python爬取微博熱搜榜，將資料存入資料庫

#-*-coding:utf-8-*- import urllib, pymysql, requests, re # 配置資料庫 config = { \'host\': \'127.0.0.1\',

爆雷，抖音視訊被曝，我連夜爬了微博評論，結果。。。

昨天娛樂圈又又又爆雷了，lixiaolu 和 pg1 的抖音視訊瘋傳網路，看來嫂子就要成內子了。

一種iOS抓取web頁中的圖片並顯示方案

實現思路目前ios原生接入web頁有兩種方式:UIWebView和WKWebView。所以本文兩種方式都會介紹方式，思路大同小異，實現方式略微有點區別。周知h5與oc可以雙向溝通，所以我們可以注入js指令碼，獲取js事件回撥即可得

淘寶評論抓取，怎樣一鍵批量抓取淘寶評論所有圖片

看到不少網友問：淘寶評論圖片和視訊怎麼儲存到手機？怎麼複製淘寶評論，淘寶評論商品圖片怎麼整店批量抓取？淘寶評論圖片怎麼儲存，淘寶評論商品圖片怎麼一鍵批量提取？小編來教大家用工具固喬電商圖片助手

Python爬取 MT論壇主題帖，小批量抓取想看的主題

import requests from bs4 import BeautifulSoup from fake_useragent import UserAgent import random from lxml import etree

爬蟲與Python：（四）爬蟲進階一之資料抓取——2.Python模擬Ajax

如何用用Python模擬Ajax請求，我們仍然以飛常準大資料為例（https://data.variflight.com/analytics/CodeQuery），通過查詢北京機場的三個字母碼“PEK”來請求獲取它的資料，把北京機場的資訊提取出來。

一公司非法抓取抖音直播資料，被判賠償 100 萬元

12 月 31 日訊息，據人民法院報報道，12 月 21 日，浙江省杭州市餘杭區人民法院判決了一起非法抓取、出售抖音直播資料案。判決認定，未經許可的抓取行為構成不正當競爭，同時侵害了主播和打賞使用者的合法權益。法院

pyhton爬取：爬取愛豆（李易峰）微博評論，看看愛豆粉絲的關注點在哪（附原始碼）

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

微博內容爬取，看看大家都在關注些什麼？

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理

python製作微博圖片爬取工具

有小半個月沒有發部落格了，因為一直在研究python的GUI，買了一本書學習了一些基礎，用我所學做了我的第一款GUI——微博圖片爬取工具。本軟體原始碼已經放在了部落格中，另外軟體已經打包好上傳到網盤中以供下載學習

微博話題下的資料爬取

技術標籤：python資料爬取 1、前言新浪微博中，一個話題下各個媒體或使用者發表在平臺發表的資訊是輿情研究的一個很重要的資料來源，這裡記錄一下一個話題下資料的爬取方式，以“#美國疫情#”話題為例。

pyhton爬取愛豆（李易峰）微博評論

今日目標：微博，以李易峰的微博為例： https://weibo.com/liyifeng2007?is_all=1 然後進入評論頁面，進入XHR查詢真是地址：

用Python爬取7月大瓜吳亦凡發表微博的加所有熱評

前言 2021年的七月，最大的瓜應該就是吳亦凡了。娛樂圈爆瓜，早已不是什麼新鮮事，但吳亦凡這個瓜，尤其的大！

python 爬取馬蜂窩景點翻頁文字評論的實現

使用Chrome、python3.7、requests庫和VSCode進行爬取馬蜂窩黃鶴樓的文字評論(http://www.mafengwo.cn/poi/5426285.html)。

使用Puppeteer爬取微信文章的實現

一朋友在群裡問有沒有什麼辦法能夠一次性把這個連結裡的文章儲存下來。點開可以看到，其實就是一個文章合集。所以需求就是，把這個文件中的連結裡的文章挨個儲存下來。儲存形式可以有很多種，可以是圖片，也可以是網

python根據使用者需求輸入想爬取的內容及頁數爬取圖片方法詳解

本次小編向大家介紹的是根據使用者的需求輸入想爬取的內容及頁數。主要步驟：

爬取微博簽到頁(一)——確定底層抓取邏輯

測試

1.測試連通

2.提取測試

3. 構建要輸出的csv檔案形式

4.新增IP代理

5.改進

相關推薦