python抓取糗事百科文字內容

阿新 • • 發佈：2018-12-31

最近用python處理了蠻多資料，也自己稍微學習爬取了一些資料。主要是用requests和BeautifulSoup。以下例子是糗事百科的內容爬取，儲存的格式為：(user_name, user_picture, qiushi, [good_cmt])，good_cmt可能不存在。

程式碼如下:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@author: yefeng
"""

import requests
from bs4 import BeautifulSoup 
import re

if __name__ == "__main__":
    #糗事百科純文字字首
    root = "http://www.qiushibaike.com/text/page/"  #http://www.qiushibaike.com/text/\
    #page: 1 - 20  1 - 3  翻頁，可以自由設定。
    url_list = [] #url集合
    for i in range(1, 4):
        tmp = root + str(i)
        url_list.append(tmp)
    cnt = 0
    fout = open("qiushibaike_data.txt","w") #儲存路徑
    for url in  url_list:         
        response = requests.get(url)
        soup =    BeautifulSoup(response.text, 'html.parser') #每個頁面的內容都是以qiushi_tag_開頭
        cont_list = soup.find_all(id = re.compile(r'qiushi_tag_\d+')) #使用正則匹配
        for cont in cont_list:
            user_info = cont.find(class_ = 'author').find_all("a") #文字作者使用者資訊
            user_picture = "null"
            user_name = "null"
            if user_info is not None and len(user_info) >= 2:
                user_picture = user_info[0].find("img")["src"] #頭像連結
                user_name = user_info[1].find("h2").text #暱稱
    #        print(user_name)     
            
    #        print(user_picture)
        #    qiushi = cont.find(class_="content").find("span").text
            qiushi = cont.find("span").text 
    #        print(qiushi)
            good_cmt = cont.find(class_ = "indexGodCmt") #最佳評論
            if good_cmt is not None:           
                good_cmt = good_cmt.find(class_ = "main-text").get_text("|||", strip=True)
                fout.write("%s\t%s\t%s\t%s\n" % (user_name, user_picture, qiushi, good_cmt))
            else:
                fout.write("%s\t%s\t%s\n" % (user_name, user_picture, qiushi))
    #        print(good_cmt)
            cnt = cnt + 1
        print(url, cnt) 
    fout.flush()
    fout.close()

python抓取糗事百科文字內容

最近用python處理了蠻多資料，也自己稍微學習爬取了一些資料。主要是用requests和BeautifulSoup。以下例子是糗事百科的內容爬取，儲存的格式為：(user_name, user_picture, qiushi, [good_cmt])，good_cmt可能不存在。程式碼如

用python抓取糗事百科的小程式

直接上程式碼和執行結果 #by suwenhao #QQ 2487872782 import urllib import urllib2 import re page = 1 url = 'http

HtmlAgilityPack抓取糗事百科內容

console lag node document 24小時 ner readline collect ldo 本文實例講述了C#使用HtmlAgilityPack抓取糗事百科內容的方法。分享給大家供大家參考。具體實現方法如下： Console.WriteLine("**

爬蟲實戰1--抓取糗事百科段子

爬蟲1.提取某一頁的所有段子 # -*- coding:utf-8 -*- import urllib import urllib2 import re page = 1 url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page) user_agen

利用python爬取糗事百科的用戶及段子

我們什麽 roo urlopen gen 文件 addheader find 正則匹配最近正在學習python爬蟲，爬蟲可以做很多有趣的事，本文利用python爬蟲來爬取糗事百科的用戶以及段子，我們需要利用python獲取糗事百科一個頁面的用戶以及段子，就需要匹配兩次，

Python 爬取糗事百科段子

爬蟲 Python 百科段子直接上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- import re import urllib.request def gettext(url,page): headers=("User-Agen

bs4抓取糗事百科

tps quest mpi block ntp lap closed resp pan 抓取糗事百科內容及評論，不包含圖片信息。user-agent填入瀏覽器的即可。user-agent對應的value，360極速瀏覽器的話，可以在地址欄輸入about:version，回車

Python :爬取糗事百科段子

原始碼： import urllib import random def JokeSet(Url,UserAgent) ''' Url ：動態url網址 UserAgent :動態請求頭 ''' #設定請求頭 Headers ={ "User-Agent" : UserAgent

【爬蟲入門】【正則表示式】抓取糗事百科的段子3.0

在原有基礎上，增加寫入偽造瀏覽器的UserAgent fake_user_agent: pip install fake-useragent//這個第三方庫，維護了各種主流瀏覽器的UA標識，並且會定時更新這個庫，淘汰一些過期的UA。首先，在pycharm中安裝fake_userag

【爬蟲入門】【正則表示式】抓取糗事百科的段子2.0

在原有基礎上，增加寫入資料庫操作和網頁翻頁操作 import sqlite3, re from urllib.request import Request, urlopen class DBTool(object): """ 將資料儲存到資料庫的工具類，主要負責資料庫

【爬蟲入門】抓取糗事百科的段子1.0

爬取糗事百科資訊注意：爬取任何一個網站，首先要確定的就是這個網站是靜態網站還是動態網站。其次看看這個GET請求是否攜帶了特殊的引數。最後需要留意請求頭中的Cookie資訊。 class QSBKSpider(object): """ 爬蟲類 """

python爬取糗事百科資料並儲存到sqlite中，命令列讀出

import requests import sqlite3 from bs4 import BeautifulSoup class QSBK: def __init__(self): self.page=0 self.items=[

python爬蟲之糗事百科文字笑話

##執行環境 python：python3.6.5 IDE：pycharm ##依賴模組 request，re ##實現目的實現從糗事百科網站上爬取所有的文字笑話，以txt的文字儲存在程式所在資料夾內

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

scrapy框架爬蟲爬取糗事百科之 Python爬蟲從入門到放棄第不知道多少天（1）

Scrapy框架安裝及使用 1. windows 10 下安裝 Scrapy 框架：　　前提：安裝了python-pip 　　1. windows下按住win+R 輸入cmd 　　2. 在cmd 下輸入　　　　　　pip install scrapy 　　　　　　pip inst

Python爬蟲爬取糗事百科(xpath+re)

爬取糗事百科，用xpath、re提取 =================================================== ===================================================== 1 ''' 2 爬取醜事百科，頁面

使用python的requests、xpath和多執行緒爬取糗事百科的段子

程式碼主要使用的python中的requests模組、xpath功能和threading多執行緒爬取了糗事百科中段子的內容、圖片和閱讀數、段子作者的性別，年齡和頭像。 # author: aspiring import requests from lxml import

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

本文是Python爬蟲從入門到精通系列的第3篇。我們將總結BeautifulSoup這個解析庫以及常用的find和select方法。我們還會利用requests庫和BeauitfulSoup來爬取糗事百科上的段子, 並對比下單執行緒爬蟲和多執行緒爬蟲的爬取效率。什麼是

Python 爬去糗事百科內容講解

參考：http://blog.csdn.net/flyingfishmark/article/details/51251534 爬取前我們先看一下我們的目標： 1.抓取糗事百科熱門段子 2.過濾帶有圖片的段子 3.段子的釋出人，段子內容，好笑數，評論數 # -*

python—多協程爬取糗事百科熱圖

wow64 monk 根據 list 網址 real span 本地 uil 今天在使用正則表達式時未能解決實際問題，於是使用bs4庫完成匹配，通過反復測試，最終解決了實際的問題，加深了對bs4.BeautifulSoup模塊的理解。爬取流程前奏：分析糗事百科熱圖板塊

python抓取糗事百科文字內容

相關推薦