requests庫和BeautifulSoup4庫爬取新聞列表

阿新 • • 發佈：2017-09-28

blog 結果分析代碼 ner eba etime 包裝 mat

畫圖顯示：

import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt

txt = open("zuihou.txt","r",encoding=‘utf-8‘).read()
wordlist = jieba.lcut(txt)

 

wl_split=" ".join(wordlist)
mywc = WordCloud().generate(wl_split)
plt.imshow(mywc)
plt.axis("off")
plt.show()

結果：

技術分享

用requests庫和BeautifulSoup4庫，爬取校園新聞列表的時間、標題、鏈接、來源、詳細內容

爬蟲，網頁信息

import requests
from bs4 import BeautifulSoup

gzccurl = ‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
res = requests.get(gzccurl)
res.encoding = ‘utf-8‘

soup = BeautifulSoup(res.text,‘html.parser‘)
for news in soup.select(‘li‘):
    if len(news.select(‘.news-list-title‘))>0:
        title = news.select(‘ 
.news-list-title‘)[0].text
        url = news.select(‘a‘)[0][‘href‘]
        print(title,url)

結果：

技術分享

加上時間：

for news in soup.select(‘li‘):
    if len(news.select(‘.news-list-title‘))>0:
        title = news.select(‘.news-list-title‘)[0].text
        url = news.select(‘a‘)[0][‘href‘]
        time = news.select(‘ 
.news-list-info‘)[0].contents[0].text
        
        print(time,title,url)

效果：

技術分享

將其中的時間str轉換成datetime類型。
將取得詳細內容的代碼包裝成函數。

import requests
from bs4 import BeautifulSoup
from datetime import datetime

gzccurl = ‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
res = requests.get(gzccurl)
res.encoding = ‘utf-8‘

soup = BeautifulSoup(res.text,‘html.parser‘)
def getdetail(url):
    resd = requests.get(url)
    resd.encoding= ‘utf-8‘
    soupd = BeautifulSoup(resd.text,‘html.parser‘)
    return(soupd.select(‘.show-content‘)[0].text)

for news in soup.select(‘li‘):
    if len(news.select(‘.news-list-title‘))>0:
        title = news.select(‘.news-list-title‘)[0].text
        url = news.select(‘a‘)[0][‘href‘]
        time = news.select(‘.news-list-info‘)[0].contents[0].text
        dt = datetime.strptime(time,‘%Y-%m-%d‘)
        source = news.select(‘.news-list-info‘)[0].contents[1].text
        detail = getdetail(url)
        print(dt,title,url,source,detail)

結果：

技術分享

選一個自己感興趣的主題，做類似的操作，為後面“爬取網絡數據並進行文本分析”做準備。

import requests
from bs4 import BeautifulSoup
from datetime import datetime

gzccurl = ‘http://www.lbldy.com/tag/gqdy/‘
res = requests.get(gzccurl)
res.encoding = ‘utf-8‘

soup = BeautifulSoup(res.text,‘html.parser‘)
def getdetail(url):
    resd = requests.get(url)
    resd.encoding= ‘utf-8‘
    soupd = BeautifulSoup(resd.text,‘html.parser‘)
    return(soupd.select(‘.show-content‘)[0].text)

for news in soup.select(‘h4‘):
    
        print(news)

結果：

技術分享

requests庫和BeautifulSoup4庫爬取新聞列表

用requests庫和BeautifulSoup4庫爬取新聞列表

ont contents req style quest 新聞列表 soup itl .html import requests from bs4 import BeautifulSoup jq=‘http://news.gzcc.cn/html/2017/xiaoyua

requests庫和BeautifulSoup4庫爬取新聞列表

blog 結果分析代碼 ner eba etime 包裝 mat 畫圖顯示： import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt txt = open("zui

基於requests庫和lxml庫爬取優信二手車

工具：lxml庫和requests庫 # _*_ coding:utf-8 _*_ import requests import re import time import MySQLdb import MySQLdb.cursors from lxml import etree from urlli

基於requests庫和lxml庫爬取瓜子二手車

工具：lxml和requests # coding:utf-8 import requests import time import MySQLdb import MySQLdb.cursors from lxml import etree from urllib import parse name

python語言用requests庫和BeautifulSoup庫爬取京東商品資訊

分析網頁程式碼後寫出程式碼程式碼如下： import requests from bs4 import BeautifulSoup def gettext(url): try: r=requests.get(url) r.e

利用requests庫和pyquery庫爬取指定頁數的京東商品資訊

大概思路：首先利用requests庫獲取京東商品搜尋的頁面資訊，然後利用pyquery庫對爬取的資料進行分析，然後利用格式化輸出的方法輸出所爬取的資料。要爬取的頁面截圖為對前幾頁的網址進行分析可觀察出相應的規律第一頁：https://search.jd.

筆記——用Requests庫和BeautifulSoup庫爬取酷狗音樂資料

酷狗音樂top500榜單鏈接：http://www.kugou.com/yy/rank/home/1-8888.html觀察每頁的url，將第一頁url中home/後的1改成2，就恰好是第二頁的url。首先匯入相應的庫，同時設定好瀏覽器的header：import reque

使用Requests庫和BeautifulSoup庫來爬取網頁上需要的文字與圖片

Pythone現在已經成為全球最火爆的語言了，它的強大之處想必不需要我多說吧。接下來我就Python網路爬蟲來談一談本渣渣的見解。 -----------------------------------------------------------------------

使用Python的BeautifulSoup庫實現一個可以爬取1000條百度百科數據的爬蟲

otto 提取數據 tps summary 簡介標題格式段落字典如果 BeautifulSoup模塊介紹和安裝 BeautifulSoup BeautifulSoup是Python的第三方庫，用於從HTML或XML中提取數據，通常用作於網頁的解析器 Beauti

python使用requests庫和re庫寫的京東商品資訊爬蟲

1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = requests.get(url, timeout=30) 7 r.raise_for_status()

python使用requests庫和re庫寫的京東商品信息爬蟲

fin 搜索 goods tle 爬取 val timeout stat for 1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = reques

requests和正則爬取貓眼的資料

用requests爬取貓眼網站的top100榜單，再用正則提取所需的資料，並將資料儲存 #成功抓取所有資料，並存為檔案（參考程式碼：https://github.com/Germey/TouTiao/blob/master/spider.py） import request

mac下給python3安裝requests庫和scrapy庫

眾所周知，Mac自帶python2，但無奈我們想使用新版本，因此我們需要安裝python3 安裝python3我使用了homebrew，網上也有很多教程，這裡不多說為python3安裝第三方庫其實也很簡單，在安裝python3之後，系統自動安裝了pip3 因此我們只需

利用requests和正則爬取貓眼電影top100榜單

環境：win10， anaconda3（python3.5）用python的requests庫和正則將貓眼電影top100榜單資訊抓取下來，儲存資料並做了點簡單的分析。（貓眼電影top100榜單網頁的原始碼可能發生了改變，程式里正則那邊可能不適用了，需要修改。）下面

scrapy爬蟲系列之四--爬取列表和詳情

ont str extra utf-8 book line col turn detail 功能點：如何爬取列表頁，並根據列表頁獲取詳情頁信息？爬取網站：東莞陽光政務網完整代碼：https://files.cnblogs.com/files/bookwed/yang

ffmpeg的ubuntu的編譯過程（編譯靜態庫和動態庫）

ffmpeg第一步源碼下載通過git下載git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg或者直接下載wget http://ffmpeg.org/releases/ffmpeg-3.3.tar.bz2如果是下載的ffmpeg-3.3.tar.bz2 需要進

關於Linux靜態庫和動態庫的分析

所在 mis color 先後 main 技術哪些共享協議關於Linux靜態庫和動態庫的分析關於Linux靜態庫和動態庫的分析 1.什麽是庫在windows平臺和linux平臺下都大量存在著庫。本質上來說庫是一種可運行代碼的二進制形式。能夠被操作系

靜態庫和動態庫的使用

window != 函數調用 img clas 處理 dllexport void 再看 1.靜態庫使用比較簡單，就兩步 @1包含靜態庫頭文件#include "文件名" @2預處理加載靜態庫 #pragma comment(lib,"庫文件名") 2.再看動態庫 @1首先

Requests+正則表達式爬取貓眼電影

movies core http status roc find apple ascii int 代碼: import re import json from multiprocessing import Pool import requests from reque

Windows靜態庫和動態庫的創建和使用（VS2005）

spec 占用內存庫文件工程存在中間開發程序文件的系統偶們在實際的編程開發中，經常會遇到運行時無法找到某個DLL文件或者鏈接時無法找到某個LIB文件。然後，我們就開始亂GOOGLE一下，然後將VS2005的設置改變一下，或許就Ok了，我們將別人開發的DLL或

requests庫和BeautifulSoup4庫爬取新聞列表

相關推薦