1. 程式人生 > >關與今日頭條app的爬蟲介紹

關與今日頭條app的爬蟲介紹

這段時間忙於工作,主要針對新聞資訊內容的爬取

主要爬的有今日頭條,鳳凰,網易,騰訊,大型網站的爬取,的總結,

1,必須熟悉手機抓包軟體的配置,才可以有效的抓取到介面

2,從介面處尋找規律,

3,明確自己需要哪些內容,

4.寫爬蟲

我通過介面找到了所有的類目:

classify_url = 'https://is.snssdk.com/article/category/get_subscribed/v4/?iid=45032656046&device_id=43306941482&ac=wifi&channel=update&aid=13&app_name=news_article&version_code=693&version_name=6.9.3&device_platform=android&ab_version=425531%2C511489%2C512527%2C421244%2C486953%2C494121%2C513028%2C519225%2C239095%2C500091%2C467914%2C170988%2C493249%2C398175%2C519895%2C442127%2C374116%2C437000%2C478532%2C517767%2C489317%2C501961%2C519804%2C276206%2C519509%2C459645%2C500387%2C416055%2C510641%2C392461%2C470730%2C495896%2C378451%2C471406%2C510754%2C519795%2C516760%2C509305%2C512393%2C512914%2C468954%2C271178%2C424178%2C326524%2C326532%2C496389%2C508197%2C345191%2C519949%2C516309%2C518639%2C515800%2C489801%2C510935%2C455646%2C424176%2C214069%2C497615%2C507003%2C482355%2C510710%2C519295%2C442255%2C519259%2C519017%2C520601%2C512958%2C489514%2C280447%2C520688%2C281294%2C513401%2C325616%2C515839%2C498551%2C520553%2C386888%2C520089%2C498375%2C516137%2C513578%2C467513%2C515673%2C513283%2C444465%2C304488%2C261581%2C403270%2C484178%2C457480%2C502680%2C512027%2C510536&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_group=94570%2C102754%2C181429&ab_feature=94570%2C102754&abflag=3&ssmix=a&device_type=NX563J&device_brand=nubia&language=zh&os_api=25&os_version=7.1.1&uuid=864460031530349&openudid=f1082e56b1908c9c&manifest_version_code=692&resolution=1080*1920&dpi=480&update_version_code=69305&_rticket=1538042842567&fp=GSTqFS4MLrx7FlPZc2U1Flx7P24M&tma_jssdk_version=1.3.0.1&pos=5r_-9Onkv6e_eBEKeScxeCUfv7G_8fLz-vTp6Pn4v6esrKuzr6WpqKSxv_H86fTp6Pn4v6eupLOlrqmtqqSxv_zw_O3e9Onkv6e_eBEKeScxeCUfv7G__PD87dHy8_r06ej5-L-nrKyrs6mkrKWoqrG__PD87dH86fTp6Pn4v6eupLOkrKmqpKTg&rom_version=25&plugin=26894&ts=1538042842&as=a2d5ea8a7aed3bfbec7259&mas=00f531ef9a8037a65e770c80d5e613fbf128caa4888a605ed5'

然後找到列表頁的介面

base_url = 'https://is.snssdk.com/api/news/feed/v88/?list_count=17&category={}&refer=1&refresh_reason=5&session_refresh_idx=1&count=20&min_behot_time=1537635643&last_refresh_sub_entrance_interval=1538041336&loc_mode=0&loc_time=1537701890&latitude=39.834079&longitude=116.28459&city=%E5%8C%97%E4%BA%AC%E5%B8%82&tt_from=enter_auto&lac=4282&cid=7752303&plugin_enable=3&iid=45032656046&device_id=43306941482&ac=wifi&channel=update&aid=13&app_name=news_article&version_code=693&version_name=6.9.3&device_platform=android&ab_version=425531%2C511489%2C512527%2C421244%2C486953%2C494121%2C513028%2C519225%2C239095%2C500091%2C467914%2C170988%2C493249%2C398175%2C519895%2C442127%2C374116%2C437000%2C478532%2C517767%2C489317%2C501961%2C519804%2C276206%2C519509%2C459645%2C500387%2C416055%2C510641%2C392461%2C470730%2C495896%2C378451%2C471406%2C510754%2C519795%2C516760%2C509305%2C512393%2C512914%2C468954%2C271178%2C424178%2C326524%2C326532%2C496389%2C508197%2C345191%2C519949%2C516309%2C518639%2C515800%2C489801%2C510935%2C455646%2C424176%2C214069%2C497615%2C507003%2C482355%2C510710%2C519295%2C442255%2C519259%2C519017%2C520601%2C512958%2C489514%2C280447%2C520688%2C281294%2C513401%2C325616%2C515839%2C498551%2C520553%2C386888%2C520089%2C498375%2C516137%2C513578%2C467513%2C515673%2C513283%2C444465%2C510536%2C304488%2C261581%2C403270%2C484178%2C457480%2C502680%2C512027&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_group=94570%2C102754%2C181429&ab_feature=94570%2C102754&abflag=3&ssmix=a&device_type=NX563J&device_brand=nubia&language=zh&os_api=25&os_version=7.1.1&uuid=864460031530349&openudid=f1082e56b1908c9c&manifest_version_code=692&resolution=1080*1920&dpi=480&update_version_code=69305&_rticket=1538041336618&fp=GSTqFS4MLrx7FlPZc2U1Flx7P24M&tma_jssdk_version=1.3.0.1&pos=5r_-9Onkv6e_eBEKeScxeCUfv7G_8fLz-vTp6Pn4v6esrKuzr6WpqKSxv_H86fTp6Pn4v6eupLOlrqmtqqSxv_zw_O3e9Onkv6e_eBEKeScxeCUfv7G__PD87dHy8_r06ej5-L-nrKyrs6mkrKWoqrG__PD87dH86fTp6Pn4v6eupLOkrKmqpKTg&rom_version=25&plugin=26894&ts=1538041336&as=a2d56aba88bfab35ec7222&mas=00b339523bce59cab47cb99ee6d66e76d36864a4888a8080da&cp=58b0a9cfaa5f8q1'

注意:category ={} 為所對應的類目

category 所對應的欄位可以從類目的介面獲取

欄位匹配的程式碼如下:

        res = requests.get(classify_url)
        html = json.loads(res.text)
        datas = html['data']['data']
        print(len(datas))
        for data in datas:
            # 欄目
            column = data['name']
            print(column)
            #類目
            category = data['category']

然後進行欄位拼接就可以找到所對應的列表頁,得到列表頁然後就要獲取到詳情頁的地址

詳情頁的地址也只找的介面

這就簡單多了,有好幾種可行方案,我就在這裡說一種

我通過抓包軟體找到介面

text_url = "http://a3.pstatp.com/article/content/21/1/{}/{}/1/0/?iid=37457543399&device_id=55215909025&ac=wifi&channel=tengxun2&aid=13&app_name=news_article&version_code=682&version_name=6.8.2&device_platform=android&ab_version=261581%2C403271%2C197606%2C293032%2C405731%2C418881%2C413287%2C271178%2C357705%2C377637%2C326524%2C326532%2C405403%2C415915%2C409847%2C416819%2C402597%2C369470%2C239096%2C170988%2C416198%2C390549%2C404717%2C374117%2C416708%2C416648%2C265169%2C415090%2C330633%2C297058%2C410260%2C276203%2C413705%2C320832%2C397738%2C381405%2C416055%2C416153%2C401106%2C392484%2C385726%2C376443%2C378451%2C401138%2C392717%2C323233%2C401589%2C391817%2C346557%2C415482%2C414664%2C406427%2C411774%2C345191%2C417119%2C377633%2C413565%2C414156%2C214069%2C31211%2C414225%2C411334%2C415564%2C388526%2C280449%2C281297%2C325614%2C324092%2C357402%2C414393%2C386890%2C411663%2C361348%2C406418%2C252782%2C376993%2C418024&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_feature=102749%2C94563&abflag=3&ssmix=a&device_type=MI+3C&device_brand=Xiaomi&language=zh&os_api=19&os_version=4.4.4&uuid=99000549116036&openudid=efcc6d4284c6c458&manifest_version_code=682&resolution=1080*1920&dpi=480&update_version_code=68210&_rticket=1532142082952&rom_version=miui_v7_5.12.4&plugin=32&pos=5r_88Pzt0fzp9Ono-fi_p66ps6-oraylqrG__PD87d706eS_p794Iw14KgN4JR-_sb_88Pzt0fLz-vTp6Pn4v6esrKqzrKSvqq6k4A%3D%3D&fp=z2T_L2mOLSxbFlHIPlU1FYweFzKe&ts=1532142082&as=a255cac5b2208bd2a23862&mas=00e35bc961329fe4e2da0242394f32b692264a2c00d8a582a8"
       

注意:{}{}這個也是所需要匹配的可以從列表頁獲取,列表頁獲取的這個欄位有的時候有有的時候沒有,所以我用的異常處理

#獲取這個欄位的程式碼如下:

            res = requests.get(base_url, headers=self.headers)
            html = json.loads(res.text)
            print(res.status_code, '-------')
            datas = html['data']
            for data in datas:
                try:
                    # 詳情頁的id
                    group_id = (json.loads(data["content"]))["group_id"]
                except:
                    group_id = 0
                if group_id != 0:
                    print(group_id)

#接下來就是拼接詳情頁的地址了

在然後就是匹配獲取標題還有內容了在這裡就不多說了,沒有什麼技術含量:

想要原始碼>>>>>>>>>>>>>>>>>>>>>>>可以聯絡本主。。。希望你們自己通過抓包軟體,找到介面,然後按照我的思路去完成??他的反爬主要是介面的訪問量,還有要換ua,還有ip。。後續會有其他新聞類的介紹,謝謝關注!!!!