1. 程式人生 > 實用技巧 >3分鐘Python爬取9000張表情包圖片

3分鐘Python爬取9000張表情包圖片

先看下我的爬取成果:

很多人學習python,不知道從何學起。
很多人學習python,掌握了基本語法過後,不知道在哪裡尋找案例上手。
很多已經做案例的人,卻不知道如何去學習更加高深的知識。
那麼針對這三類人,我給大家提供一個好的學習平臺,免費領取視訊教程,電子書籍,以及課程的原始碼!
QQ群:961562169


本視訊的演示步驟:

  1. 使用requests爬取200個網頁
  2. 使用BeautifulSoup實現圖片的標題和地址解析
  3. 將圖片下載到本地目錄

這2個庫的詳細用法,請看我的其他視訊課程

import requests
from bs4 import BeautifulSoup
import re

1、下載共200個頁面的HTML

def download_all_htmls():
    """
    下載所有列表頁面的HTML,用於後續的分析
    """
    htmls = []
    for idx in range(200):
        url = f"https://fabiaoqing.com/biaoqing/lists/page/{idx+1}.html"
        print("craw html:", url)
        r = requests.get(url)
        if r.status_code != 200:
            raise Exception("error")
        htmls.append(r.text)
    print("success")
    return htmls
# 執行爬取
htmls = download_all_htmls()
craw html: https://fabiaoqing.com/biaoqing/lists/page/1.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/2.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/3.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/4.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/188.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/189.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/190.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/191.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/192.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/193.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/194.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/195.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/196.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/197.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/198.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/199.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/200.html
success
htmls[0][:1000]
'<html>\n\n<head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n    <title>熱門表情_發表情,表情包大全fabiaoqing.com</title>\n    <meta name="Keywords" content="熱門表情,表情包,聊天表情,微信表情包,QQ表情包,發表情,表情包大全,表情包下載,表情下載,表情包大戰,貼吧表情包,表情包集中營,鬥圖">\n    <meta name="Description" content="全網熱門表情。發表情,最大最全的表情包網站,分享最新最熱的表情包、聊天表情、微信表情包、QQ表情包、金館長表情包、蘑菇頭表情包等各類表情。">\n    <meta name="referrer" content="no-referrer" />\n    <link rel="stylesheet" type="text/css" href="//lib.baomitu.com/semantic-ui/2.2.2/semantic.min.css" />\n    <link rel="stylesheet" type="text/css" href="/Public/css/fbq.css?v=2018" />\n    <script data-ad-client="ca-pub-5486123269162001" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>\n    <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.'

2、解析HTML得到所有的圖片標題和URL地址

def parse_single_html(html):
    """
    解析單個HTML,得到資料
    @return list((img_title, img_url))
    """
    soup = BeautifulSoup(html, 'html.parser')
    img_divs = soup.find_all("div", class_="tagbqppdiv")
    datas = []
    for img_div in img_divs:
        img_node = img_div.find("img")
        if not img_node: continue
        datas.append((img_node["title"], img_node["data-original"]))
    return datas
import pprint
pprint.pprint(parse_single_html(htmls[0])[:10])
[('阿彌陀佛,施主放下罵圖,立地成佛!',
  'http://ww2.sinaimg.cn/bmiddle/9150e4e5gy1g6qlfb10avj20d70f7gmf.jpg'),
 ('看見你就煩(草莓果醬ox白眼 GIF 動圖表情包)',
  'http://wx1.sinaimg.cn/bmiddle/006APoFYly1g68tiftpbmg30bh0bh4o5.gif'),
 ('我在哭', 'http://wx3.sinaimg.cn/bmiddle/006qir4ogy1g54eoes2q2j309q09cdgh.jpg'),
 ('我的人生只要這樣躺著混日子就很幸福了',
  'http://ww4.sinaimg.cn/bmiddle/9150e4e5gy1g6qm7x6fiuj20mw0mmt9y.jpg'),
 ('草莓果醬ox動圖表情包',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g64664qyc0g20bf0br4jn.gif'),
 ('噗呲 放屁(沙雕羊駝動圖表情包)',
  'http://wx1.sinaimg.cn/bmiddle/78b88159gy1g69cze2hkkg20bp0bpx0y.gif'),
 ('來群裡轉轉(熊貓頭旋轉 GIF 動圖)',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g68tzab8xng207608wwou.gif'),
 ('我愛你(草莓果醬oxGIF 動圖表情包)',
  'http://wx2.sinaimg.cn/bmiddle/006APoFYly1g68uwg8djlg30b60b6e57.gif'),
 ('鎖屏 帶薪拉屎',
  'http://wx3.sinaimg.cn/bmiddle/ceeb653ely1g654hwdsjkg20dc0avgm4.gif'),
 ('我要可愛死你(草莓果醬ox表情包)',
  'http://wx2.sinaimg.cn/bmiddle/bf976b12gy1g68hx2gtleg208c08bk8q.gif')]
# 執行所有的HTML頁面的解析
all_imgs = []
for html in htmls:
    all_imgs.extend(parse_single_html(html))
all_imgs[:10]
[('阿彌陀佛,施主放下罵圖,立地成佛!',
  'http://ww2.sinaimg.cn/bmiddle/9150e4e5gy1g6qlfb10avj20d70f7gmf.jpg'),
 ('看見你就煩(草莓果醬ox白眼 GIF 動圖表情包)',
  'http://wx1.sinaimg.cn/bmiddle/006APoFYly1g68tiftpbmg30bh0bh4o5.gif'),
 ('我在哭', 'http://wx3.sinaimg.cn/bmiddle/006qir4ogy1g54eoes2q2j309q09cdgh.jpg'),
 ('我的人生只要這樣躺著混日子就很幸福了',
  'http://ww4.sinaimg.cn/bmiddle/9150e4e5gy1g6qm7x6fiuj20mw0mmt9y.jpg'),
 ('草莓果醬ox動圖表情包',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g64664qyc0g20bf0br4jn.gif'),
 ('噗呲 放屁(沙雕羊駝動圖表情包)',
  'http://wx1.sinaimg.cn/bmiddle/78b88159gy1g69cze2hkkg20bp0bpx0y.gif'),
 ('來群裡轉轉(熊貓頭旋轉 GIF 動圖)',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g68tzab8xng207608wwou.gif'),
 ('我愛你(草莓果醬oxGIF 動圖表情包)',
  'http://wx2.sinaimg.cn/bmiddle/006APoFYly1g68uwg8djlg30b60b6e57.gif'),
 ('鎖屏 帶薪拉屎',
  'http://wx3.sinaimg.cn/bmiddle/ceeb653ely1g654hwdsjkg20dc0avgm4.gif'),
 ('我要可愛死你(草莓果醬ox表情包)',
  'http://wx2.sinaimg.cn/bmiddle/bf976b12gy1g68hx2gtleg208c08bk8q.gif')]
len(all_imgs)
8999

3、下載圖片到本地目錄

for idx, (title, img_url) in enumerate(all_imgs):
    # 移除標點符號,只保留中文、大小寫字母和阿拉伯數字
    reg = "[^0-9A-Za-z\u4e00-\u9fa5]"
    title = re.sub(reg, '', title)

    # 發現了超長的圖片標題,做截斷
    if len(title)>10: title = title[:10]

    # 得到jpg還是gif字尾
    post_fix = img_url[-3:]
    filename = f"./output/{title}.{post_fix}"

    print(idx, filename)
    img_data = requests.get(img_url)
    with open(filename,"wb")as f:
        f.write(img_data.content)

print("success")
0 ./output/阿彌陀佛施主放下罵圖.jpg
1 ./output/看見你就煩草莓果醬o.gif
2 ./output/我在哭.jpg
3 ./output/我的人生只要這樣躺著.jpg
4 ./output/草莓果醬ox動圖表情.gif
5 ./output/噗呲放屁沙雕羊駝動圖.gif
6 ./output/來群裡轉轉熊貓頭旋轉.gif
7 ./output/我愛你草莓果醬oxG.gif
8 ./output/鎖屏帶薪拉屎.gif
9 ./output/我要可愛死你草莓果醬.gif
10 ./output/我尼瑪傻了都.jpg
11 ./output/你今天表現蠻好10分.gif
12 ./output/真煩人得找個理由做她.gif
13 ./output/哇哦草莓果醬ox表情.jpg
14 ./output/哥哥又說笑了喬碧蘿表.gif
15 ./output/鎖屏帶薪拉屎.gif
16 ./output/我簡直難上加難麻將表.jpg