python3爬蟲 -----爬取鬥圖息-------www.doutula.com

阿新 • • 發佈：2018-10-21

run __init__ args gin uid == utf-8 date src

普通爬取：

 1 # -*- coding:utf-8 -*-
 2 # author:zxy
 3 # Date:2018-10-21
 4 import requests
 5 from lxml import etree
 6 import re
 7 import urllib
 8 import os
 9 import time
10 
11 
12 
13 def parse_page(url):
14     headers={
15         ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) ‘
16                      ‘ 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘,
17         ‘Cookie‘:‘__cfduid=ddb28ef1934faef742f7fb8911d7b33bd1540080067; UM_distinctid=16693ece9945b2-0b031da4b19f32-333b5602-1fa400-16693ece9958e4;‘
18                   ‘ _ga=GA1.2.1950184368.1540080070; _gid=GA1.2.1249143498.1540080070; _gat=1‘
19 
     }
20     response=requests.get(url,headers=headers)
21     text=response.text
22     html=etree.HTML(text)
23     imgs=html.xpath("//div[@class=‘page-content text-center‘]//img[@class!=‘gif‘]")
24     for img in imgs:
25         img_url=img.get(‘data-original‘)
26         alt=img.get(‘alt‘)
27         alt=re.sub(r‘ 
[\?？\.、.!！，,]‘,‘‘,alt)
28         suffix1=os.path.splitext(img_url)[1]
29         suffix=suffix1[0:4]
30         filename=alt+suffix
31         urllib.request.urlretrieve(img_url,‘D:\\我的圖片\\emoticon\\%s‘%filename)
32 
33 
34 
35 def main():
36     for x in range(1,101):
37         url="http://www.doutula.com/photo/list/?page=%d"%x
38         parse_page(url) 
39         time.sleep(1)
40     # url="http://www.doutula.com/photo/list/?page=1"
41     # parse_page(url)
42 
43 if __name__ == ‘__main__‘:
44     main()

多線程爬取：

 1 # -*- coding:utf-8 -*-
 2 # author:zxy
 3 # Date:2018-10-21
 4 
 5 from queue import Queue
 6 import requests
 7 from lxml import etree
 8 import re
 9 import urllib
10 import os
11 import time
12 import threading
13 from urllib import request
14 
15 class Procude(threading.Thread):
16     headers = {
17         ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) ‘
18                       ‘AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘,
19         ‘Cookie‘: ‘__cfduid=ddb28ef1934faef742f7fb8911d7b33bd1540080067; UM_distinctid=16693ece9945b2-0b031da4b19f32-333b5602-1fa400-16693ece9958e4;‘
20                   ‘ _ga=GA1.2.1950184368.1540080070; _gid=GA1.2.1249143498.1540080070; _gat=1‘
21     }
22     def __init__(self,page_queue,img_queue,*args,**kwargs):
23         super(Procude, self).__init__(*args,**kwargs)
24         self.page_queue=page_queue
25         self.img_queue=img_queue
26 
27     def run(self):
28         while True:
29             if self.page_queue.empty():
30                 break
31             url=self.page_queue.get()
32             self.parse_page(url)
33 
34     def parse_page(self,url):
35         response=requests.get(url,headers=self.headers)
36         text=response.text
37         html=etree.HTML(text)
38         imgs=html.xpath("//div[@class=‘page-content text-center‘]//img[@class!=‘gif‘]")
39         for img in imgs:
40             img_url=img.get(‘data-original‘)
41             alt=img.get(‘alt‘)
42             alt=re.sub(r‘[\?？\.、.!！，,\*]‘,‘‘,alt)
43             suffix1=os.path.splitext(img_url)[1]
44             suffix=suffix1[0:4]
45             filename=alt+suffix
46             self.img_queue.put((img_url,filename))
47 
48 class Consumer(threading.Thread):
49     def __init__(self,page_queue,img_queue,*args,**kwargs):
50         super(Consumer, self).__init__(*args,**kwargs)
51         self.page_queue=page_queue
52         self.img_queue=img_queue
53     def run(self):
54         while True:
55             if self.img_queue.empty() and self.page_queue.empty():
56                 break
57 
58             img_url,filename=self.img_queue.get() #元組解包
59             request.urlretrieve(img_url,‘D:\\我的圖片\\emoticon\\%s‘%filename)
60             print(filename+"下載完成")
61 
62 
63 def main():
64     page_queue=Queue(100)
65     img_queue=Queue(1000)
66     for x in range(1,101):
67         url="http://www.doutula.com/photo/list/?page=%d"%x
68         page_queue.put(url)
69         # time.sleep(1)
70 
71     for x in range(5):
72         t=Procude(page_queue,img_queue)
73         t.start()
74 
75     for x in range(5):
76         t=Consumer(page_queue,img_queue)
77         t.start()
78 
79 
80 if __name__ == ‘__main__‘:
81     main()

爬取效果如下所示：

技術分享圖片

python3爬蟲 -----爬取鬥圖息-------www.doutula.com

run __init__ args gin uid == utf-8 date src 普通爬取： 1 # -*- coding:utf-8 -*- 2 # author:zxy 3 # Date:2018-10-21 4 import requests 5 f

python爬蟲爬取鬥圖網最新表情包（第二篇）

上一篇文章爬的表情包是套圖，發現還有一千多頁的最新表情包。兩者的網頁結構有點區別，程式碼需要整改下，看下頁面，規律也比較好找。非常氣憤，上一個部落格被其他爬走了，還是一個培訓機構，插了自己的廣告！所有的表情圖片都是在標籤下，數了一下每一頁都是17行，

學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

最近總是有人跟我鬥圖，想了想17年中旬時在網上看過一篇關於爬取鬥圖網表情包的py程式碼，但是剛想爬的時候發現網頁結構發生了變化，而且鬥圖網還插入了很多廣告，變化其實挺大的，所以臨時寫了一個爬蟲，簡單的爬取了鬥圖網的表情包。從這連結上看，page表示的是第幾頁，我

python3爬蟲 -----爬取百思不得姐信息-------http://www.budejie.com/

chrom tree www cti mozilla from tar 2-0 sum 1 # -*- coding:utf-8 -*- 2 # author:zxy 3 # Date:2018-10-21 4 5 import request 6 from

爬蟲1.1爬取鬥圖啦圖片（關於open函式和urlretrieve函式）

文章只是我作為NewBird ٩꒰▽ ꒱۶⁼³₌₃ 學習的一小點小點的進步還請不要笑我⁄(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄ 我就直接貼程式碼了，我不會說很技術的話。 1.建立專案命令： scrapy startproject <project_nam

Python3爬蟲爬取淘寶商品數據

表格 name 錯誤處理 from [0 https iat turn 感覺這次的主要的目的是從淘寶的搜索頁面獲取商品的信息。其實分析頁面找到信息很容易，頁面信息的存放都是以靜態的方式直接嵌套的頁面上的，很容易找到。主要困難是將信息從HTML源碼中剝離出來，數據和網頁源碼

用Python多線程實現生產者消費者模式爬取鬥圖網的表情圖片

Python什麽是生產者消費者模式某些模塊負責生產數據，這些數據由其他模塊來負責處理（此處的模塊可能是：函數、線程、進程等）。產生數據的模塊稱為生產者，而處理數據的模塊稱為消費者。在生產者與消費者之間的緩沖區稱之為倉庫。生產者負責往倉庫運輸商品，而消費者負責從倉庫裏取出商品，這就構成了生產者消費者模式。生

python3爬蟲爬取煎蛋網妹紙圖片

port 商業技術分享爬取其中 lar c函數 base 技術其實之前實現過這個功能，是使用selenium模擬瀏覽器頁面點擊來完成的，但是效率實際上相對來說較低。本次以解密參數來完成爬取的過程。首先打開煎蛋網http://jandan.net/ooxx，查看網頁

python3爬蟲--爬取豆瓣Top250的圖書

from lxml import etree import requests import csv fp = open('doubanBook.csv', 'wt', newline='', encoding='utf-8') writer = csv.writer(fp) writer.

Python爬蟲-爬取鬥魚網頁selenium+bs

爬取鬥魚網頁（selenium+chromedriver得到網頁，用Beasutiful Soup提取資訊） ============================= ================================= =============================

爬取鬥圖網表情包之後鬥圖會輸？不存在的

前言：本文非常淺顯易懂，可以說是零基礎也可快速掌握。如有疑問，歡迎留言，筆者會第一時間回覆。一、分析表情包網址 1、進入鬥圖啦網址，點選**“最新表情”**，再點選第二、第三頁，得出規律

shell爬取鬥圖網

#!/bin/bash read -p "請輸入要爬取的頁面數(預設為10)：" page_num page_num=${page_num:-10} echo $page_num read -p "請輸入要儲存的目錄名稱(預設為img)：" save_path_name save_path_name=

python3爬蟲爬取網頁圖片簡單示例

本人也是剛剛開始學習python的爬蟲技術，然後本來想在網上找點教程來看看，誰知道一搜索，大部分的都是用python2來寫的，新手嘛，一般都喜歡裝新版本。於是我也就寫一個python3簡單的爬蟲，爬蟲一下貼吧的圖片吧。話不多說，我們開始。首先簡單來說說一下知識。一

多執行緒爬取鬥圖圖片

結果演示程式碼： #encoding:utf-8 # __author__ = 'donghao' # __time__ = 2018/12/24 15:20 import requests import threading import urllib.re

Python爬取鬥圖表情，讓你成為鬥圖大佬

話不多說，上結果（只爬了10頁內容）上程式碼：（可直接執行）用到Xpath #encoding:utf-8 # __author__ = 'donghao' # __time__ = 2018/12/24 15:20 import requests imp

Python 爬取鬥圖啦圖片

鬥圖啦 requests BeautifulSoup4 程式碼 # -*- coding:utf-8 -*- # pip install requests 框架 import requests # pip install beautifulsoup4 框架 # p

爬蟲基本介紹 && python3 爬蟲爬取網易新聞排行榜

爬蟲基本介紹 1. 什麼是爬蟲？爬蟲是請求⽹網站並提取資料的⾃自動化程式 2. 爬蟲的基本流程發起請求通過HTTP庫向目標站點發起請求，即傳送一個Request，請求可以包含額外的headers等資訊，等待伺服器器響應。解析內容

python3爬蟲 -----爬取大學資訊並通過matplotlib與numpy繪製結果-----from最好大學網

爬取大學資訊並通過matplotlib與numpy繪製多指標柱形圖就某一因素繪製餅圖並突出其中一個物件（本例中為江西理工大學） 1 # -*- coding:utf-8 -*- 2 # author:zxy 3 # date:2018-12-24 4 # upda

python3爬蟲爬取金庸小說所有角色

# -*- coding:utf-8 -*- import requests from bs4 import BeautifulSoup url = 'http://www.jinyongwang.c

用python爬取鬥圖啦圖片

一、程式碼部分 # -*- coding:utf-8 -*- '''1、python版本 python3.6 2、IDE PyCharm 2017.3 ''' import requests imp

python3爬蟲 -----爬取鬥圖息-------www.doutula.com

普通爬取：

多線程爬取：

爬取效果如下所示：

相關推薦