糗事百科實例

阿新 • • 發佈：2018-06-17

main img res apple def inpu code In pat

爬取糗事百科段子，頁面的URL是 http://www.qiushibaike.com/8hr/page/

使用requests獲取頁面信息，用XPath 做數據提取
獲取每個帖子裏的用戶頭像鏈接、用戶姓名、段子內容、點贊次數和評論次數

 1 # -*- coding:utf-8 -*-
 2 import requests
 3 from lxml import etree
 4 
 5 def loadPage(url):
 6     headers = {
 7         ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 
‘,
 8         ‘Accept-Language‘: ‘zh-CN,zh;q=0.8‘}
 9     try:
10         response = requests.get(url, headers=headers)
11         resHtml = response.text
12         html = etree.HTML(resHtml)
13         result = html.xpath(‘//div[contains(@id,"qiushi_tag")]‘)
14         for site in result:
15             item = {}
 
16             imgUrl = site.xpath(‘./div/a/img/@src‘)[0].encode(‘utf-8‘)
17             #username = site.xpath(‘.//img/@alt‘)[0].encode(‘utf-8‘)
18             username = site.xpath(‘.//h2‘)[0].text
19             content = site.xpath(‘.//div[@class="content"]/span‘)[0].text.strip().encode(‘utf-8‘)
20             # 
 投票次數
21             vote = site.xpath(‘.//i‘)[0].text
22             #print site.xpath(‘.//*[@class="number"]‘)[0].text
23             # 評論信息
24             comments = site.xpath(‘.//i‘)[1].text
25             print imgUrl, username, content, vote, comments
26     except Exception, e:
27         print e
28 
29 def qiushiSpider(url, beginPage, endPage):
30     """
31         作用：貼吧爬蟲調度器，負責組合處理每個頁面的url
32         url : 貼吧url的前部分
33         beginPage : 起始頁
34         endPage : 結束頁
35     """
36     for page in range(beginPage, endPage + 1):
37         pn = page
38         fullurl = url + str(pn)
39         #print fullurl
40         loadPage(fullurl)
41         #print html
42 
43 if __name__ == "__main__":
44     beginPage = int(raw_input("請輸入起始頁："))
45     endPage = int(raw_input("請輸入結束頁："))
46     #page = 1
47     url = ‘http://www.qiushibaike.com/8hr/page/‘
48     qiushiSpider(url, beginPage, endPage)

保存到 json 文件內

 1 #!/usr/bin/env python
 2 # -*- coding:utf-8 -*-
 3 
 4 import urllib2
 5 import json
 6 from lxml import etree
 7 
 8 
 9 def loadPage(url):
10     
11     headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
12 
13     request = urllib2.Request(url, headers = headers)
14     html = urllib2.urlopen(request).read()
15     # 響應返回的是字符串，解析為HTML DOM模式 text = etree.HTML(html)
16 
17     text = etree.HTML(html)
18     # 返回所有段子的結點位置，contains()模糊查詢方法，第一個參數是要匹配的標簽，第二個參數是標簽名部分內容
19     node_list = text.xpath(‘//div[contains(@id, "qiushi_tag")]‘)
20 
21     items ={}
22     for node in node_list:
23         # xpath返回的列表，這個列表就這一個參數，用索引方式取出來，用戶名
24         username = node.xpath(‘.//img/@alt‘)[0]
25         # 圖片連接
26         image = node.xpath(‘.//div[@class="thumb"]//@src‘)#[0]
27         # 取出標簽下的內容,段子內容
28         content = node.xpath(‘.//div[@class="content"]/span‘)[0].text
29         # 取出標簽裏包含的內容，點贊
30         zan = node.xpath(‘.//i‘)[0].text
31         # 評論
32         comments = node.xpath(‘.//i‘)[1].text
33 
34         items = {
35             "username" : username,
36             "image" : image,
37             "content" : content,
38             "zan" : zan,
39             "comments" : comments
40         }
41 
42         with open("qiushi.json", "a") as f:
43             f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "\n")
44 
45 def qiushiSpider(url, beginPage, endPage):
46     """
47         作用：貼吧爬蟲調度器，負責組合處理每個頁面的url
48         url : 貼吧url的前部分
49         beginPage : 起始頁
50         endPage : 結束頁
51     """
52     for page in range(beginPage, endPage + 1):
53         pn = page
54         fullurl = url + str(pn)
55         #print fullurl
56         loadPage(fullurl)
57         #print html
58 
59 if __name__ == "__main__":
60     beginPage = int(raw_input("請輸入起始頁："))
61     endPage = int(raw_input("請輸入結束頁："))
62     #page = 1
63     url = ‘http://www.qiushibaike.com/8hr/page/‘
64     qiushiSpider(url, beginPage, endPage)

糗事百科實例

python 糗事百科實例

except 參考 string headers esp window com -c -s 爬取糗事百科段子，假設頁面的URL是 http://www.qiushibaike.com/8hr/page/1 要求：使用requests獲取頁面信息，用XPath / re

糗事百科實例

main img res apple def inpu code In pat 爬取糗事百科段子，頁面的URL是 http://www.qiushibaike.com/8hr/page/ 使用requests獲取頁面信息，用XPath 做數據提取獲取每個帖子裏的用

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

爬蟲實戰1--抓取糗事百科段子

爬蟲1.提取某一頁的所有段子 # -*- coding:utf-8 -*- import urllib import urllib2 import re page = 1 url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page) user_agen

python 多線程糗事百科案例

wow64 案例 sts ascii starting 頁面 don 示意圖 utf-8 案例要求參考上一個糗事百科單進程案例 Queue（隊列對象） Queue是python中的標準庫，可以直接import Queue引用;隊列是線程間最常用的交換數據的形式 python

糗事百科正則爬蟲

.html == resp 加載初始 main findall print 錯誤參考博客：http://cuiqingcai.com/990.html # -*- coding:utf-8 -*- import urllib import urllib2 impor

HtmlAgilityPack抓取糗事百科內容

console lag node document 24小時 ner readline collect ldo 本文實例講述了C#使用HtmlAgilityPack抓取糗事百科內容的方法。分享給大家供大家參考。具體實現方法如下： Console.WriteLine("**

python 爬蟲--糗事百科段子

decode imp rst -a paragraph 糗事百科 mozilla ont ner import reimport urllib.requestfrom docx import Documentheader=("User-Agent",‘User-Agent:

Python爬蟲(十七)_糗事百科案例

exce html str window path {} zh-cn use src 糗事百科實例爬取糗事百科段子，假設頁面的URL是: http://www.qiushibaike.com/8hr/page/1 要求：使用requests獲取頁面信息，用XPath/

Python爬蟲(十八)_多線程糗事百科案例

.json afa 安全 rip down 退出交互 encode tar 多線程糗事百科案例案例要求參考上一個糗事百科單進程案例:http://www.cnblogs.com/miqi1992/p/8081929.html Queue(隊列對象) Queue是pyth

利用python爬取糗事百科的用戶及段子

我們什麽 roo urlopen gen 文件 addheader find 正則匹配最近正在學習python爬蟲，爬蟲可以做很多有趣的事，本文利用python爬蟲來爬取糗事百科的用戶以及段子，我們需要利用python獲取糗事百科一個頁面的用戶以及段子，就需要匹配兩次，

Python 爬蟲系列：糗事百科最熱段子

image .get headers BE write findall parse 調用 with open 1.獲取糗事百科url http://www.qiushibaike.com/hot/page/2/ 末尾2指第2頁 2.分析頁面，找到段子部分的位置，

python爬蟲基礎案例之糗事百科

alt 依靠 webdriver pytho 糗事百科代碼 web 分享圖片 sel 關於爬蟲也是剛接觸，案例是基於python3做的，依靠selenium的webdriver做的，所以python3必須有selenium這個包，如果是基於谷歌瀏覽器的話需要下載谷歌瀏

Python 爬取糗事百科段子

爬蟲 Python 百科段子直接上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- import re import urllib.request def gettext(url,page): headers=("User-Agen

多線程糗事百科案例

一個 tag except 入隊 run cep thread ont global Queue（隊列對象） Queue是python中的標準庫，可以直接import Queue引用;隊列是線程間最常用的交換數據的形式 python下多線程的思考對於資源，加鎖是個重要的環

案例_(多線線程)爬取糗事百科

false 內容圖片 nbsp strip 5.0 mpat 交流 strong 1 # 使用了線程庫 2 import threading 3 # 隊列 4 from queue import Queue 5 # 解析庫 6 from lxml

Android實戰——jsoup實現網絡爬蟲，糗事百科項目的起步

網絡數據標識爬蟲 android thumb 技術分享由於網絡數界面本篇文章包括以下內容：前言 jsoup的簡介 jsoup的配置 jsoup的使用結語對於Android初學者想要做項目時，最大的煩惱是什麽？毫無疑問是數據源的缺乏，當然可以選

bs4抓取糗事百科

tps quest mpi block ntp lap closed resp pan 抓取糗事百科內容及評論，不包含圖片信息。user-agent填入瀏覽器的即可。user-agent對應的value，360極速瀏覽器的話，可以在地址欄輸入about:version，回車

爬取糗事百科案例

from random import choice import requests import re user_agents=[ "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHT

Python爬取多頁糗事百科

這次帶來的是如何爬取糗事百科且寫入txt文件大家都知道，糗事百科可以帶給大家很多快樂。在這裡，我們將實現對其內容的爬取，將這些好笑的段子記錄下來，可以讓我們打發無聊或沒網時的時間。當爬取我們想要的內容時，幾乎都離不開這幾個步驟： 1、網站地址； 2、獲取其原始碼； 3、匹配

糗事百科實例

相關推薦