Resquest、Bs4、多執行緒爬取全站圖片

阿新 • • 發佈：2021-12-20

　　#!/usr/bin/env python

　　# coding=utf-8

　　# author:Charles

　　# datetime:2021/03/23/0004 11:26

　　# software: meizitu

　　import requests, os, shutil

　　from bs4 import BeautifulSoup

　　from multiprocessing import Pool

　　# 封裝get方法

　　def geta(url, params=None, header=None):

　　session=requests.session()

　　ret={}

　　ret['success']=False

　　try:

　　if params:

　　session.params=params

　　if header:

　　session.headers=header

　　msg=session.get(url)

　　if msg:

　　ret['success']=True

　　ret['content']=msg.content

　　except Exception, e:

　　print e.message

　　finally:

　　if session:

　　session.close()

　　return ret

　　# 主頁面

　　def meizitu(kind, page):

　　# 程序池中執行緒數

　　pool=Pool(10)

　　for p in xrange(1, int(page) + 1):

　　pg='/page/%s/' % p

　　url='mzitu%s%s' % (kind, pg)

　　header={

　　'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

　　}

　　ret=geta(url=url, header=header)

　　if ret['success']==False:

　　return False

　　soup=BeautifulSoup(ret['content'], 'lxml')

　　listsoup=soup.find_all('ul', {"id": "pins"})

　　for i in listsoup:

　　if i is not None:

　　soup1=BeautifulSoup(str(i), 'lxml')

　　listsoup1=soup1.find_all('span')

　　soup2=BeautifulSoup(str(listsoup1), 'lxml')

　　listsoup2=soup2.find_all('a')

　　for g in listsoup2:

　　href=g['href'] # 獲取連結

　　title=g.text.decode('unicode_escape') # 標題

　　# print href

　　# 同步爬取

　　# detail(href, title)

　　# 進執行緒非同步爬取(非阻塞)

　　pool.apply_async(detail, args=(href, title))

　　print '*********************啦啦啦,已爬取%s螢幕啦*********************' % p

　　print '需要爬取的全站圖片寫入完成!!'

　　# 程序池關閉

　　pool.close()

　　# 等待程序池中的worker程序執行完畢，防止主程序在worker程序結束前結束。

　　pool.join()

　　# 詳細頁面

　　def detail(url, titles):

　　num=int(max_page(url))

　　title=titles.strip().replace('?', '').replace(':', '').replace(',', '').replace('@', '')

　　path='D:/meizitu/'

　　print u'檔案存放地址: ' + path + title

　　if os.path.exists(path + title):

　　raw_input('資料夾已經存在,按任意鍵刪除此資料夾!!!')

　　shutil.rmtree(path + title)

　　raw_input('資料夾已經刪除,按任意鍵執行爬取!!!')

　　os.makedirs(path + title)

　　os.chdir(path + title)

　　for i in xrange(1, num):

　　urls=url + '/' + str(i)

　　ret=geta(url=urls)

　　if ret['success']==False:

　　return False

　　soup=BeautifulSoup(ret['content'], 'lxml')

　　listsoup=soup.find('div', {'class': 'main-image'})

　　soup1=BeautifulSoup(str(listsoup), 'lxml')

　　listsoup1=soup1.find('img')

　　detail_href=listsoup1['src'] # 詳細連結

　　# print detail_href

　　header={

　　'Referer': 'mzitu/',

　　'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

　　}

　　ret=geta(url=detail_href, header=header)

　　if ret['success']==True:

　　tupian=ret['content']

　　with open('%s-%s.jpg' % (url[21:], i), 'wb')as f:

　　f.write(tupian)

　　print ('已爬取完成編號:%s----第%s張' % (url[21:], i))

　　print '編號為:%s===================>已經爬取完成!!!' % url[21:]

　　# 詳細頁面最大張數

　　def max_page(url):

　　ret=geta(url=url)

　　if ret['success']==False:

　　return False

　　soup=BeautifulSoup(ret['content'], 'lxml')

　　listsoup=soup.find('div', {'class': 'pagenavi'})

　　soup1=BeautifulSoup(str(listsoup), 'lxml')

　　listsoup1=soup1.find_all('span')

　　list=[]

　　for i in listsoup1:

　　list.append(i.text)

　　maxpage=list[-2]

　　return maxpage

　　if __name__=='__main__':

　　if os.name=='nt':

　　print(u'你正在使用win平臺')

　　else:

　　print(u'你正在使用linux平臺')

　　category={'1': '', '2': '/xinggan/', '3': '/japan/', '4': '/taiwan/', '5': '/mm/'}

　　num=raw_input('請選擇您要爬取的妹子圖種類: 1.Index 2.Sex 3.Japan 4.TaiWan 5.Pure

　　if num=='1' or num=='2' or num=='3' or num=='4' or num=='5':

　　page=raw_input('輸入爬取幾螢幕:')

　　if page.isdigit():

　　meizitu(category[num], page)

　　else:

　　raw_input('輸入錯誤!按任意鍵退出!!!')

　　else:

　　raw_input('輸入錯誤!按任意鍵退出!!!')

Resquest、Bs4、多執行緒爬取全站圖片

　　#!/usr/bin/env python 　　# coding=utf-8 　　# author:Charles 　　# datetime:2021/03/23/0004 11:26

多執行緒爬取網頁圖片

技術標籤：爬蟲python def get_photo_urls(q): #獲取該origin_url下所有圖片的url origin_url = \'http://www.win4000.com/wallpaper.html\'

Python多執行緒爬取豆瓣影評API介面

爬蟲庫使用簡單的requests庫，這是一個阻塞的庫，速度比較慢。解析使用XPATH表示式

Python使用requests xpath 並開啟多執行緒爬取西刺代理ip例項

我就廢話不多說啦，大家還是直接看程式碼吧！ import requests,random from lxml import etree

實戰單執行緒爬取，單執行緒+協程爬取，多執行緒爬取

一.目標網頁：https://lusongsong.com/default_2.html.爬取該頁面連結（有17個）下詳情內容並儲存到本地

dummy多執行緒爬取梨視訊例子

# _*_ coding:utf-8 _*_ \"\"\" @FileName:2.梨視訊資料爬取.py @CreateTime :2020/8/26 0026 15:26 @Author: Lurker Zhang

51job多執行緒爬取指定職業資訊資料

51job多執行緒爬取指定職業資訊資料 # datetime:2020/10/7 14:02 # 51job多執行緒 import requests

python多執行緒爬取桌布

開啟網站，這裡我選擇的是動漫專區的桌布，我們的目的是把所有動漫桌布爬下來，我們發現一共有98頁圖片

小米商品和騰訊招聘多執行緒爬取

應用場景 1、多程序：CPU密集程式2、多執行緒：爬蟲(網路I/O)、本地磁碟I/O 知識點回顧

python基礎爬蟲——單執行緒多執行緒爬取圖片

技術標籤：pythonpython 困於心衡於慮而後作今天的學習目標是：單執行緒與多執行緒爬取網頁圖片 python單執行緒：

爬蟲7-多執行緒爬取桌布族

# -*- coding: utf-8 -*- \"\"\" @Time:2022/3/22 16:57 @Author: Andrew @File: 多執行緒應用.py \"\"\" # 1.如何提取單個頁面的資料

多執行緒的概念、如何建立多執行緒

一、多執行緒介紹　　1、程序：程序指正在執行的程式。確切的來說，當一個程式進入記憶體執行，即變成一個程序，程序是處於執行過程中的程式，並且具有一定獨立功能。

python爬蟲開發之使用Python爬蟲庫requests多執行緒抓取貓眼電影TOP100例項

使用Python爬蟲庫requests多執行緒抓取貓眼電影TOP100思路：檢視網頁原始碼抓取單頁內容

多執行緒小案例---網路圖片下載

多執行緒小案例---網路圖片下載第一步--下載commons-io.jar包 commons-io是針對開發IO流功能的工具類庫

阿里、位元組：一套高效的iOS面試題（八 - 多執行緒 GCD）

多執行緒擼面試題中，文中內容基本上都是搬運自大佬部落格及自我理解，可能有點亂，不喜勿噴！！！

.NET進階篇06-async非同步、thread多執行緒2

知識需要不斷積累、總結和沉澱，思考和寫作是成長的催化劑內容目錄一、執行緒Thread

iOS之多執行緒：執行緒的生命週期，NSThread、GCD、NSOperation的使用與總結

前言：我負責努力，其餘交給運氣。正文：閒暇之餘，把執行緒的問題整理一下，感覺可能會有點長，所以先自分一下章節，我將會按照以下幾個小節來展開描述：

Python多執行緒Threading、子執行緒與守護執行緒例項詳解

本文例項講述了Python多執行緒Threading、子執行緒與守護執行緒。分享給大家供大家參考，具體如下：

Python多執行緒操作之互斥鎖、遞迴鎖、訊號量、事件例項詳解

本文例項講述了Python多執行緒操作之互斥鎖、遞迴鎖、訊號量、事件。分享給大家供大家參考，具體如下：

【C++多執行緒】std::future、std::async、std::promise、std::packaged_task、std::shared_future

　　如圖以下是標頭檔案<future>中的類容。 std::future<T> 　　future有兩個類模板，一個獨佔的std::future，也就是隻能被獲取一次，另一個是共享的std::shared_future。std::future<T>是一個類

Resquest、Bs4、多執行緒爬取全站圖片

相關推薦