Python多執行緒爬蟲學習

阿新 • • 發佈：2018-12-10

此程式碼包含以下學習內容：

程式封裝
網頁獲取與解析
retry裝飾器與thread多執行緒

Anime code

# -*- coding: utf-8 -*-
import requests
import demjson
from retry import retry
from config import *


class DongMan:
    def __init__(self):
        """
        initialize the picture source and page wanna download
        """ 

        self.dongman_list = input('input content you wanna scratch, use \, to separate  >')
        self.dongman_list = self.dongman_list.split(',')
        self.page_list = input('input pages:>')

    @retry(3)
    def index(self):
        """
        scratch all the path on one page
        """ 

        for j in self.dongman_list:
            self.j = j
            for i in range(1, int(self.page_list) + 1):
                url = 'http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%s&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word=%s&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&pn=%s&rn=30&gsm=10e&1539825490240=' 
 % (
                    str(j), str(j), str(30 * i))
                headers = {
                    'Referer': 'http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1539824775271_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&hs=2&word=%E6%89%8B',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'
                }
                reponse = requests.get(url=url, headers=headers)
                img_res = reponse.text.encode('utf-8').decode('latin-1')
                img_res = demjson.decode(img_res)    # decode the json format
                all_img = img_res['data']   # get img link
                self.details(all_img)

    @retry(3)
    def details(self, all_img):
        """
        recursive download the image
        :param all_img: list of the picture
        :return:
        """
        for each_url in all_img:
            each_img_url = each_url.get('thumbURL')
            print(each_img_url)
            if each_img_url:
                headers = {
                    'Referer': 'http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1539824775271_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&hs=2&word=%E6%89%8B',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'
                }
                try:
                    reponse = requests.get(url=each_img_url, headers=headers, timeout=10)
                except BaseException as e:
                    print(e)
                img_res = reponse.content   # obtain the image bytes
                img_save(img_res)
                # insertDB((None, each_img_url, md5(each_img_url.encoding('utf-8')).hexdigest(), self.j), 'image_test', 'img_source')


if __name__ == '__main__':
    try:
        dong_man = DongMan()
        use_threadpool(dong_man.index())
    except TypeError as e:
        print('FINISH!!!')

config code

# -*- coding: utf-8 -*-
import sys
import random
import threadpool
import pytesseract
from PIL import Image
from io import BytesIO


def use_threadpool(method):
    """
    using threadpool
    :return:
    """
    task_pool = threadpool.ThreadPool(5)  # use 5 threads
    requests_pool = threadpool.makeRequests(url_down, url_all)# 假設共3000個url，reqeusts_pool為建立的3000個工作請求。
    print(len(url_all)) # 3000
    for req in requests_pool:			# 對於每一個工作請求，將請求放到執行緒池子裡面，由5個執行緒去取池子裡面的任務，若當前執行緒執行完此方法，則拿池子中下一個任務，直至池子中沒有任務為止。
        task_pool.putRequest(req)
    task_pool.wait()


def get_captch(captcha_content):
    """
    :param captcha_content: transfer as type
    :return:
    """
    image = Image.open(BytesIO(captcha_content))
    # convert as gray
    imgry = image.convert('L')
    table = [0 if i < 140 else 1 for i in range(256)]
    # emphasis the font
    out = imgry.point(table, '1')
    # out.show()
    captcha = pytesseract.image_to_string(out)
    captcha = captcha.strip()
    captcha = captcha.upper()
    return captcha


def img_save(img_content):
    # """
    # save image
    # param img_content: pic bytes format:# \xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x0...
    # """
    image = Image.open(BytesIO(img_content))
    # 1. write bytes format into RAM
    # 2. use Iamge Module def .open to upload the RAM bytes into image format

    if int((sys.getsizeof(image))) >= 11:   # getsizeof -> Return the size of an object in bytes.
        with open('./image/image%s.jpg' % random.random(), 'wb') as f:
            f.write(img_content)
    else:
        print('this pic less than 11 bytes')
        image.show()

Python多執行緒爬蟲學習

此程式碼包含以下學習內容：程式封裝網頁獲取與解析 retry裝飾器與thread多執行緒 Anime code # -*- coding: utf-8 -*- import requests import demjson from retry imp

python多執行緒爬蟲學習--去除html的標籤

import re import urllib page = urllib.urlopen("http://www.baidu.com") html = page.read() patter

python多執行緒爬蟲學習--Queue

Queue是python多執行緒安全的佇列實現，封裝了資料結構中的佇列，保證了執行緒之間使用佇列同步資料不會出錯。也就是說使用Queue就不用使用鎖去同步資料。 Queue預設構造的大小是無限的，也

python多執行緒爬蟲學習--去除字串中間空格

python去除字串中間空格的方法 1、使用字串函式replace >>> a = 'hello world' >>> a.replace(' ', '') 'helloworld'12341234 看上這種方法真的是很笨。 2、

python多執行緒爬蟲時，主執行緒一直等待錯誤。

1、抓取網站的專輯資訊時，遇到了加上守護執行緒，程式執行完主執行緒，唰的一下就結束了，子執行緒在哪裡？嗯，丟掉了 t.setDaemon(True) #設定守護執行緒把上邊那句註釋掉後，子執行緒…….emmmmm….. 執行了，然後就一直在等待………………………等一個不知道是

python多執行緒的學習

0x00.前言　　學了一下python的多執行緒，threading模組　　感覺挺有意思的，隨便練手寫了一個很粗陋的windows下多執行緒掃線上ip的指令碼　　指令碼沒什麼技術含量，純粹練手，掃一趟192的區域網要花個5分多鐘...主要是因為直接用的python呼叫system命令去ping..

python多執行緒爬蟲+批量下載鬥圖啦圖片專案（關注、持續更新）

python多執行緒爬蟲專案（）爬取目標：鬥圖啦（起始url：http://www.doutula.com/photo/list/?page=1）爬取內容：鬥圖啦全網圖片使用工具：requests庫實現傳送請求、獲取響應。　　　　　　　xpath實現資料解析、提取和清洗　　　　　　　thr

python爬蟲入門八：多程序/多執行緒 python佇列Queue Python多執行緒（2）——執行緒同步機制 python學習筆記——多程序中共享記憶體Value & Array python 之多程序 Python多程序 Python 使用multiprocessing 特別耗記

什麼是多執行緒/多程序引用蟲師的解釋：計算機程式只不過是磁碟中可執行的，二進位制（或其它型別）的資料。它們只有在被讀取到記憶體中，被作業系統呼叫的時候才開始它們的生命期。程序（有時被稱為重量級程序）是程式的一次執行。每個程序都有自己的地址空間，記憶體，資料棧以及其它記錄其執行軌跡的輔助資料

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

本文是Python爬蟲從入門到精通系列的第3篇。我們將總結BeautifulSoup這個解析庫以及常用的find和select方法。我們還會利用requests庫和BeauitfulSoup來爬取糗事百科上的段子, 並對比下單執行緒爬蟲和多執行緒爬蟲的爬取效率。什麼是

Python--多執行緒學習(11.3)

上Code #多執行緒應用 import _thread import time def print_time( threadName,delay): count = 0 whil

python爬蟲入門（四）利用多執行緒爬蟲

#!/usr/bin/env python # -*- coding:utf-8 -*- # 使用了執行緒庫 import threading # 佇列 from Queue import Queue # 解析庫 from lxml import etree # 請求處理 impor

python多執行緒、非同步、多程序＋非同步爬蟲

安裝Tornado 非同步用到了tornado，根據官方文件的例子修改得到一個簡單的非同步爬蟲類。可以參考下最新的文件學習下。 pip install tornado 非同步爬蟲 import time from datetime import ti

Python 爬蟲第三步 -- 多執行緒爬蟲爬取噹噹網書籍資訊

XPath 的安裝以及使用 1 . XPath 的介紹剛學過正則表示式，用的正順手，現在就把正則表示式替換掉，使用 XPath，有人表示這太坑爹了，早知道剛上來就學習 XPath 多省事啊。其實我個人認為學習一下正則表示式是大有益處的，之所以換成 XPa

Python多執行緒學習

Python程式碼程式碼的執行由python虛擬機器（也叫直譯器主迴圈）來控制。Python在設計之初就考慮到要在主迴圈中，同時只有一個執行緒在執行，就像單CPU的系統中執行多個程序那樣，記憶體中可以存放多個程式，但任意時候，只有一個程式在CPU中執行。同

python多執行緒圖片爬蟲

學習python後寫的第一個程式：抓妹子圖。。。環境：linux python2.7 預設放在/home/fantasy/pics/ 目錄下思路： 1.進入http://girl-atlas.com/後（咳咳，看一下快回來）發現頁面右側每個妹子都有一個標籤，就從這裡下

python app多執行緒爬蟲以及代理ip的使用

用的是執行緒池 import requests import json from multiprocessing import Queue from handle_mongo import mongo_info from concurrent.futures import

python多執行緒小爬蟲之練練手

序：叮咚叮咚，走過路過不要錯過啊。。。好久沒寫部落格，上次都是幾年前了，寫過一些android的文章，然而我連賬號都忘記了，真尼瑪。。。大資料時代，人工智慧，機器學習開始流行了，趕上了這個時代是好的，也是不那麼好的，好的是將會為社會帶來革命性的洗禮，不那麼好的是感覺出

【Python3.6爬蟲學習記錄】（十四）多執行緒爬蟲模板總結

前言：這幾天忙活的做個網頁玩玩，網上也沒有教程。買個域名又得解析，又得備案，真是麻煩，覺得一個簡單的HTML網頁應該用不到那麼麻煩吧。昨天又看了幾個關於多執行緒爬蟲的例子，覺得很好，提煉出來，總結幾

Python多執行緒學習(三、生產者與消費者)

生產者與消費者問題是典型的同步問題。這裡簡單介紹兩種不同的實現方法。 1，條件變數 import threading import time class Producer(threading.Thread): def __init__(self, t_name)

Python多執行緒、多程序和協程的例項講解

執行緒、程序和協程是什麼執行緒、程序和協程的詳細概念解釋和原理剖析不是本文的重點，本文重點講述在Python中怎樣實際使用這三種東西參考：程序、執行緒、協程之概念理解程序（Process）是計算機中的程式關於某資料集合上的一次執行活動，是系統進行資源分配和排程的基本單位，是作業系統結構的基礎。執

Python多執行緒爬蟲學習

Anime code

config code

相關推薦