1. 程式人生 > 其它 >爬取精美桌布5w張,愛了愛了

爬取精美桌布5w張,愛了愛了

  近日接到一個需求——爬取某應用商店所有線上銷售桌布,這個任務起初讓我驚呆了。因為上級沒有給我解決風控問題,若爬取在售資源被人家廠商追責怎麼辦?若造成人家伺服器出問題怎麼辦?問的時候上級含糊其辭,唉!其實大家都懂。但為什麼我接了這個活兒呢,因為我隨便發了幾頁,發現好多小姐姐桌布,那種賊漂亮大波浪的你懂的,不愧是桌布級小姐姐。

  想要5萬張小姐姐桌布的單獨聯絡我哦~

  拋開資源本身,下面是具體的實現方法。這次不用scrapy,用requests實現的哦~

一. 登入

  這種資源只有登入才能獲取,所以我們得先過登入這一關。

  起初我不知道是數萬數量級的資源,人家給我說的自動化操作,我以為ui自動化可以實現,所以我用selenium的衍生庫來模擬點選。

import time
from common.utils import GetDriver
from page import SumsungPage
from selenium.common.exceptions import NoSuchElementException

page: SumsungPage


def verify(func):
    def wrapper(*args):
        func(*args)
        try:
            if page.log_out.is_displayed():
                return
True except NoSuchElementException: return False return wrapper class SumsungLogin: """ **應用商店登入,優先cookie登入,cookie登入失敗就用賬號登入 """ def __init__(self): global page page = SumsungPage(GetDriver().driver()) @verify def __account_login__
(self): """ 賬號密碼登入 """ page.get(page.url) page.entry.click() page.phone.send_keys(page.u) page.password.send_keys(page.p) page.login.click() page.not_now.click() # 更新cookie time.sleep(2) with open("./cookie.json", 'w+') as f: f.write(str(page.get_cookies())) @verify def __cookie_login__(self): """ cookie登入 """ page.get(page.url) with open("./cookie.json", 'r+') as f: cookie = eval(f.read()) page.add_cookies(cookie) time.sleep(2) page.driver.refresh() def trigger(self): flag = self.__cookie_login__() flag = flag if flag else self.__account_login__() print('登入成功狀態:', flag) return page if flag else "登入失敗"

  當我知道人家給我說錯了後,我起初是很無語的,畢竟人家不懂技術。。。按照模擬點選這種方法,就算一張圖片只用1秒,5萬張圖片得用5萬秒,得爬到地老天荒啊!

  然後我果斷棄用,因為這得走介面的方式爬。

  而走介面的方式有兩種,一是通過requests.session,二是直接攜帶cookie。因為人家給我配了專門的賬號,直接從瀏覽器上把cookie複製過來啦~

headers = {
    'Content-Type': "application/x-www-form-urlencoded",
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    'Referer': "https://seller.samsungapps.com/content/common/summaryContentList.as",
    'Accept-Language': "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7",
    'Cookie': "SCOUTER=x7o8fo083la90t; _hjid=cd3eb857-86e9-40e5-bba4-22b101dd0552; api_auth_sub=N; sellerLocale=zh_CN; _hjTLDTest=1; SLRJSESSIONID=SLm48RCYo2WqmgbfoOt5aBjlN1zbEFw10hrQrS2WFkkXrrDaJ1E5mCL2uKcxL99E.YXBwc19kb21haW4vc2VsbGVyMTE=; iPlanetDirectoryPro=NTbL8ooVcqrDnHlwgDxcABJZuzyR4xsNS9RrqQranfGKUkPTiZR/f8kop4j6ELBPc+TaRMW88Y3hUrTscvFk5t7je/dtPLfsDoVvZZn6oau1CY4NKsdQlNlSzkj3hdp1Lj7hRz2a3Q6BENPmpVewwbT3nqyI0Pb4/ZEyHVoXTaAGDzi7NzXwM54qizpIDXc8hkXAaXhtLcvra+DyVc72Undh9U31LyYlXO53LavOIFOYgPRvA9O5b4ed6xUxxa2etz9lpMwzlIayMvrOg8hdwDE5evVwpayZjXUv1cW4lMCbJhHaVhEucux23kcOuGbBgjraYYteAQ1ndWEvm0ipsg==; gss_auth_token=V/RK27AqVE6LV2BHAoKit6cWK+EWYKc8ERpYh2Wm7ESfGLw+oU/fgZmO2w1Oo1B+A/eYhyJu/XI9EagBTF+7Jg/5vRxPyCJW9hxMvNl4dhIRrWqNCSt6IrRQb0ZWPau8xMero/9FKGi7M3lrkuOMo8aUbTnl127zw1kj0Mgbfxw=; api_auth_token=Scri8yrXRz9kYmjyPxxyDOT8lKbIfKt2TQcwPwfO14U16J0AU1vAUMsqF/LtCfg7CtYw4hQIuO6SK1mCOJKPagKvmCyt7EjBEr1TTfIhjJPWbJkJ6pey9+Sj0CWEm/EuD/rjwUPc5THhiM7vpOkpWSqJLQYDmZ6AEy5cdtsOFIZSoZmbmZnMT8CIyig0GdE8FOPNJHpXrHxTgGSRQtsgyqYiXlFGKzfmjueeMC+x02ubCLNFdhReKZ2sfBzvFc8gxPyathHVXSYN+WCW/CTheiFwTP7D8N93agwEopZeLP79UvgwOyZi1VNwNl5GGe/le8qdiv10IhWjHMbNkt4lcA==; api_server_url=cn-auth2.samsungosp.com.cn; auth_server_url=cn-auth2.samsungosp.com.cn; _hjIncludedInPageviewSample=1; _hjAbsoluteSessionInProgress=1",
}

  cookie放在請求頭裡,我們請求頭裡還要新增上基本的資訊,爬蟲時要養好這個習慣。

二.特徵提取

  要爬取的是圖片的id和圖片的url,然後根據圖片的url來進行下載,所以總共有兩次爬取操作。開啟谷歌瀏覽器頁面除錯臺,定位到圖片位置後,觀察特徵。

  發現id都在這個td標籤下面,並且id的長度為12位,能觀察到的id都是以00開頭,我需要的資訊就在<td>標籤裡面;而url都是以前面的網址開頭,需要的資訊在img標籤的src屬性裡面。

  於是確定了它們的特徵,提取出來的xpath表示式為:

img_id = s.xpath("//td[string-length(text())=12 and starts-with(text(), '00')]/text()")
img_url = s.xpath("//td/img[contains(@src,'https://img.**apps.com/content')]/@src")

  翻頁的特徵不在位址列,被我注意到在查詢字串裡面。於是將payload寫活。

payload = "statusTab=all&pageNo=" + str(page_number) + "&isOpenTheme=true&isSticker=false&hidContentType=all&serviceStatus=sale&serviceStatusSub=forSale&contentType=all&ctntId=&contentName="

三.爬取id和url

  根據id和url的特徵,以及實現翻頁的技巧,獲取所有的id和url自然不在話下。

  一個一個爬取太慢,多執行緒走起~

import threading
from data import url, payload_pre, payload_bac, headers
import time
from lxml import etree
import requests

thread_max = threading.BoundedSemaphore(10)


def send_req(page):
    with thread_max:
        page = str(page)
        payload = payload_pre + str(page) + payload_bac
        response = requests.request("POST", url, data=payload, headers=headers)
        text = response.text
        s = etree.HTML(text)
        img_id = s.xpath("//td[string-length(text())=12 and starts-with(text(), '00')]/text()")
        img_url = s.xpath("//td/img[contains(@src,'https://img.samsungapps.com/content')]/@src")
        a = len(img_id)
        b = len(img_url)
        s_ = page + " " + str(a) + " " + str(b)

        with open("1.txt", "a") as f:
            for c, d in zip(img_id, img_url):
                f.write(c + " " + d + "\n")
        print("ok " + s_) if a and a == b else print("not ok " + s_ + 60 * "!")


def start_work(s, e):
    thread_list = []
    for i in range(s, e):
        thread = threading.Thread(target=send_req, args=[i])
        thread.start()
        thread_list.append(thread)
    for thread in thread_list:
        thread.join()


if __name__ == '__main__':
    star, end = 1, 1001
    t1 = time.time()
    start_work(star, end)
    print("[INFO]: using time %f secend" % (time.time() - t1))

四.根據url下載

  如果前面的看懂了,那麼這裡自然而然就懂啦~

  讀取url檔案,就像一個一個種子(老司機你懂的),然後批量下載~

import threading
import urllib.request as ur

thread_max = threading.BoundedSemaphore(10)


def get_inf():
    ids = []
    urls = []
    with open("img_1.txt", "r") as f:
        while True:
            con = f.readline()
            if con:
                ids.append(con[:12])
                urls.append(con[13:-1])
            else:
                break
    print(len(ids), len(urls))
    return ids, urls


def down_pic(id_, url_):
    with thread_max:
        try:
            ur.urlretrieve(url_, "./img_1/" + id_ + ".jpg")
        except Exception as e:
            print(e)
            print(id_, url_)


def start_work(id_, url_):
    thread_list = []
    for i, j in zip(id_, url_):
        thread = threading.Thread(target=down_pic, args=[i, j])
        thread.start()
        thread_list.append(thread)
    for thread in thread_list:
        thread.join()


if __name__ == '__main__':
    i_, u_ = get_inf()
    start_work(i_, u_)

五.注意事項

  起初沒有合理控制頻次,人家頁面提示這個:

即便控制好了,但超過了人家的閾值,又提示了這個:

通過 threading.BoundedSemaphore() 控制多執行緒最大數量,危險時協調部門資源解決風控,合理爬取資源噢~