Python爬取百度實時熱點排行榜

阿新 • • 發佈：2018-12-31

今天爬取的百度的實時熱點排行榜

按照慣例，先下載網站的內容到本地：

1 def downhtml():
2     url = 'http://top.baidu.com/buzz?b=1&fr=20811'
3     headers = {'User-Agent':'Mozilla/5.0'}
4     r = requests.get('url',headers=headers)
5     with open('C:/Code/info_baidu.html','wb') as f:
6         f.write(r.content)

因為我習慣把網頁整個抓到本地再來分析資料，所以會有這一步，後面會貼直接抓取並分析的程式碼。

開始分析資料：

我想抓取的排名，關鍵詞和搜尋指數這三個值。

開啟網頁原始碼：

發現每個標題的各個元素是一個個td被包裝在一個tr標籤裡面，每一個標題都是一個tr（這裡注意前三個標題的tr標籤是有class=‘hideline’，而後面的則沒有）

排名：第一個td　　　　class=''first'

關鍵詞：第二個td　　　 cass = 'keyword'

搜尋指數：最後一個td 　　class = 'last'

確定了我所需要的資料的位置了之後，可以開始寫程式碼了。

寫一個把開啟本地html並返回給BeautifulSoup呼叫的函式：

def send_html():#把本地的html檔案調給get_pages的BeautifulSoup 

    path = 'C:/Code/info_baidu.html'
    htmlfile= open(path,'r')
    htmlhandle = htmlfile.read()
    return htmlhandle

這樣，我就可以在下面的直接用本地html來測試，而不用每次都去請求百度的伺服器了。

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]#切片

因為第一個tr裝的是這些東西

<tr>
        < 
th width="50" class="first">排名</th>
        <th>關鍵詞</th>
        <th width="30%" class="tc">相關連結</th>
        <th width="20%" class="last">搜尋指數</th>
    </tr>

並不是排名第一的標題，所以我用切片把它過濾掉了。

然後開始挨個賦值：

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find('td',class_='last').get_text()#搜尋指數
        topic_rank = each_topic.find('td',class_='first').get_text()#排名
        topic_name = each_topic.find('td',class_='keyword').get_text()#標題目
        print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))

這樣按道理來說應該是可以輸出了，但百度還是想給我一點難度。

這裡出現幾個問題，

1：AttributeError: 'NoneType' object has no attribute 'get_text'

2：輸出的格式

3：只有一個值

按照慣例，第一個問題應該是裡面多了一些不是Tag的型別，所以就來測試一下：

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find('td',class_='last')#搜尋指數
        print(type(topic_times))

輸出如下：

我們可以發現前幾個值都參雜了NoneType（我去原始碼看了一下，並不知道是什麼導致的，等以後我知道了，再回來！）

因此，我們只要把NoneType給過濾掉就行。

def get_pages(html):
    soup = BeautifulSoup(html,'html.parser')
    all_topics=soup.find_all('tr')[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find('td',class_='last')#搜尋指數
        topic_rank = each_topic.find('td',class_='first')#排名
        topic_name = each_topic.find('td',class_='keyword')#標題目
        # print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
        if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find('td',class_='first').get_text()
            topic_name = each_topic.find('td',class_='keyword').get_text()
            topic_times = each_topic.find('td',class_='last').get_text()
            print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))

輸出如下：

這樣就解決了第一個問題，發現可以輸出了，連第三個問題也解決了。

但第二個問題還在，這shit一般的格式讓我很難受，導致這樣的原因我猜是get_text時把一些空格符和換行符也一起輸出了。

所以用replace()就應該可以解決了。

if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
            topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
            topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
            print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))

輸出如下：

哦吼，這樣感覺就不錯了。

但強迫症患者感覺還是很難受啊，這個熱度（搜尋指數）的格式也太亂了。

經過一番搜尋，網友的力量還是很強大的啊哈哈哈，馬上就有辦法了。

if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
            topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
            topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
            #print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
            tplt = "排名：{0:^4}\t標題：{1:{3}^15}\t熱度：{2:^7}"
            print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))

輸出如下：

本強迫症患者終於滿足了哈哈。

附上總程式碼：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 
 5 
 6 def send_html():#把本地的html檔案調給get_pages的BeautifulSoup
 7     path = 'C:/Code/info_baidu.html'
 8     htmlfile= open(path,'r')
 9     htmlhandle = htmlfile.read()
10     return htmlhandle
11 
12 def get_pages(html):
13     soup = BeautifulSoup(html,'html.parser')
14     all_topics=soup.find_all('tr')[1:]
15     for each_topic in all_topics:
16         #print(each_topic)
17         topic_times = each_topic.find('td',class_='last')#搜尋指數
18         topic_rank = each_topic.find('td',class_='first')#排名
19         topic_name = each_topic.find('td',class_='keyword')#標題目
20         if topic_rank != None and topic_name!=None and topic_times!=None:
21             topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
22             topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
23             topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
24             #print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
25             tplt = "排名：{0:^4}\t標題：{1:{3}^15}\t熱度：{2:^7}"
26             print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))    
27 
28 if __name__ =='__main__':
29     get_pages(send_html())

。

還有直接爬取不用下載網頁的總程式碼：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 
 5 def get_html(url,headers):
 6     r = requests.get(url,headers=headers)
 7     r.encoding = r.apparent_encoding 
 8     return r.text
 9 
10 
11 def get_pages(html):
12     soup = BeautifulSoup(html,'html.parser')
13     all_topics=soup.find_all('tr')[1:]
14     for each_topic in all_topics:
15         #print(each_topic)
16         topic_times = each_topic.find('td',class_='last')#搜尋指數
17         topic_rank = each_topic.find('td',class_='first')#排名
18         topic_name = each_topic.find('td',class_='keyword')#標題目
19         if topic_rank != None and topic_name!=None and topic_times!=None:
20             topic_rank = each_topic.find('td',class_='first').get_text().replace(' ','').replace('\n','')
21             topic_name = each_topic.find('td',class_='keyword').get_text().replace(' ','').replace('\n','')
22             topic_times = each_topic.find('td',class_='last').get_text().replace(' ','').replace('\n','')
23             #print('排名：{}，標題：{}，熱度：{}'.format(topic_rank,topic_name,topic_times))
24             tplt = "排名：{0:^4}\t標題：{1:{3}^15}\t熱度：{2:^8}"
25             print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))    
26 
27 def main():
28     url = 'http://top.baidu.com/buzz?b=1&fr=20811'
29     headers= {'User-Agent':'Mozilla/5.0'}
30     html = get_html(url,headers)
31     get_pages(html)
32 
33 if __name__=='__main__':
34     main()

好了。完成任務，生活愉快！

Python爬取百度實時熱點排行榜

今天爬取的百度的實時熱點排行榜按照慣例，先下載網站的內容到本地： 1 def downhtml(): 2 url = 'http://top.baidu.com/buzz?b=1&fr=20811' 3 headers = {'User-Agent':'Mozilla/5.0'}

python爬取百度搜索圖片

知乎需要 with 異常 mage 不足 request height adr 在之前通過爬取貼吧圖片有了一點經驗，先根據之前經驗再次爬取百度搜索界面圖片廢話不說，先上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- #

Python爬取百度貼吧數據

utf-8 支持我 family encode code word keyword 上一條時間　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。　　本渣還是需要上班，才能支持我的

python爬取百度搜索結果ur匯總

百度搜索 sta attr amp end rom range 百度篩選寫了兩篇之後，我覺得關於爬蟲，重點還是分析過程分析些什麽呢： 1）首先明確自己要爬取的目標　　比如這次我們需要爬取的是使用百度搜索之後所有出來的url結果 2）分析手動進行的獲取目標的過程，以便

python 爬取百度url

style not 域名 head dex fin compile threads www 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Date : 2017-08-29 18:38:23 4

【學習筆記】python爬取百度真實url

python 今天跑個腳本需要一堆測試的url，，，挨個找復制粘貼肯定不是程序員的風格，so，還是寫個腳本吧。環境：python2.7 編輯器：sublime text 3 一、分析一下首先非常感謝百度大佬的url分類非常整齊，都在一個

python爬取百度圖片代碼

python爬蟲；import json import itertools import urllib import requests import os import re import sys word=input("請輸入關鍵字：") path="./ok" if

python爬取百度翻譯返回：{'error': 997, 'from': 'zh', 'to': 'en', 'query 問題

escape result words fan use rip odin 解決 base 解決辦法：修改url為手機版的地址：http://fanyi.baidu.com/basetrans User-Agent也用手機版的測試代碼： # -*- coding: utf

python爬取百度貼吧指定內容

環境:python3.6 1：抓取百度貼吧—linux吧內容基礎版抓取一頁指定內容並寫入檔案萌新剛學習Python爬蟲,做個練習貼吧連結: http://tieba.baidu.com/f?kw=linux&ie=utf-8&pn=0 解析原始碼使用的是B

Python 爬取百度圖片的高清原圖

# coding=utf-8 """ 爬取百度圖片的高清原圖 Author : MirrorMan Created : 2017-11-10 """ import re import urllib import os import requests de

python爬取百度圖片---釋出exe小計編碼是個大坑

#*--coding:utf-8--* import requests import sitecustomize import os import sys reload(sys) sys.setdefaultencoding('utf-8') type=sys.getfilesystemencodi

Python爬取百度貼吧標題

# -*- coding: utf-8 -*- """ Created on Sun Nov 4 10:22:07 2018 @author: wangf """ from urllib.request import urlopen import codecs from

python爬取百度旅遊的城市點評文字資料

以青島市為例，檢視網址主要欄位為 pn=0 n?rn=15&pn=0&style=hot#remark-contaier 最後一頁，即183頁 pn=2730 n?rn=15&pn=2730&style=hot#remark-contai

Python爬取百度貼吧圖片指令碼

新手，以下是爬取百度貼吧制定帖子的圖片指令碼，因為指令碼主要是解析html程式碼，因此一旦百度修改頁面前端程式碼，那麼指令碼會失效，權當爬蟲入門練習吧，後續還會嘗試更多的爬蟲。 # coding=ut

Python爬取百度貼吧回帖中的微訊號（基於簡單http請求）

作者：草小誠轉載請注原文地址：https://blog.csdn.net/cxcjoker7894/article/details/85685115 前些日子媳婦兒有個需求，想要一個任意貼吧近期主題帖的所有回帖中的微訊號，用來做一些微商的操作，你懂的。因為有些貼吧專門就是

Python爬取百度貼吧的圖片

Python是一個弱型別的動態語言下面是我的第一個簡單的爬蟲指令碼程式 #coding=gbk #匯入re和urlLib兩個庫 import re import urllib #定義一個有參的獲得圖片的方法,方法名為getImg def getImg(url):

Python爬取百度地圖API-查詢全國地址資訊-MySQL資料庫

1.採用轉檔案轉MySQL資料庫形式儲存內容 2.爬取速度不宜過快，太快容易失敗程式碼如下：先爬取並將其全部儲存至檔案中 import json, time import random import requests six_cities_list = ['北京

python爬取百度新聞所有的新聞的前1頁標題和URL地址

這是我自己寫的一個爬取百度新聞的一個程式碼，歡迎大家多來討論，謝謝！(自己已經測試可以使用，在文章最後見效果圖) ''' re模板：2.2.1 requests模板：2.18.4 bs4模板：4.

Python 爬取百度音樂

Python：3.5 歡迎加入學習交流QQ群：657341423 獲取歌曲資訊 post方式 import requests url='http://play.baidu.com/data/musi

Python 爬取百度圖片

百度圖片抓包資料: 引數詳情: 資料解析: from urllib import request, parse from http import cookiejar import

Python爬取百度實時熱點排行榜

相關推薦