python爬取知乎話題的精華問題下的使用者資訊

阿新 • • 發佈：2019-02-07

今天試著用自己的爬蟲程式碼爬取了知乎【同性戀】話題下的所有精華問題的使用者位置資訊
程式碼：

__author__ = 'yang'
# -*- coding: utf-8 -*-

import configparser
import requests
import time
import re
import string


def curTime():
    curTime = time.strftime('%Y-%m-%d %H:%M:%S')
    timeStr = '\n<!--'+curTime+'-->'
    return timeStr

def 
 loginInfo(): #獲取使用者名稱，密碼
    filename = 'test.ini' #test.ini中有知乎賬號、密碼及瀏覽器cookies
    config = configparser.ConfigParser()
    config.read(filename)
    cookies = config.items('COOKIES')
    cookies = dict(cookies)
    username = config.get("USER","username")
    password = config.get("USER","password" 
)
    #print username
    return username,password,cookies

def create_session():
    username, password, cookies = loginInfo()
    session = requests.session()
    login_data = {'email':username, 'password':password}

    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36' 
,
        'Host': 'www.zhihu.com',
        'Referer': 'http://www.zhihu.com/'
    }

    r = session.post('http://www.zhihu.com/login/email', data=login_data, headers=header)

    if r.json()['r'] == 1:
        print 'Login Failed, reason is:',
        for m in r.json()['data']:
            print r.json()['data'][m]
        print 'Use cookies to login...'
        has_cookies = False
        for key in cookies:
            if key != '__name__' and cookies[key] != '':
                has_cookies = True
                break
            if has_cookies is False:
                raise ValueError('請填寫config.ini檔案中的cookies項')
            else:
                r = session.get('http://www.zhihu.com/login/email', cookies=cookies)

    with open('login.html', 'w') as fp:
        fp.write(r.content)

    return session, cookies

def writeFile(name,content):
    with open(name,'w') as fp:
        fp.write(content)

if __name__ == '__main__':

    requests_session, requests_cookies = create_session()

    with open('tong.html','w') as fp:
        fp.write(curTime())

    for page in range(0,49):
        url = 'https://www.zhihu.com/topic/19552984/top-answers?'+str(page)
        content = requests_session.get(url, cookies=requests_cookies).content
        f = file('tong.html', 'a+')
        f.write(content)
    #f = file('url.html', 'a+')
    #f.write(curTime())


    #匹配問題連線字串
    str = re.compile(r'<a class="question_link.*?href="(.*?)">')
    with open('url.html') as file:
        content = file.read()
    questionLinks = str.findall(content)
    print (questionLinks)
    with open('resultLink.html','w') as fp:
        fp.write('\n'.join(questionLinks))


    with open('resultLink.html') as fp:
        questionLinks = fp.readlines()

    #獲取使用者連結

    usrRegex = re.compile(r'<a class="author-link.*?href="(.*?)">')
    for link in questionLinks:
        num = link.strip()
        url = 'https://www.zhihu.com'+str(num)
        page = requests_session.get(url,cookies=requests_cookies).content #獲取頁面內容

        #過濾使用者連結
        usrLinks = usrRegex.findall(page)
        f = file('usrLinks.html','a+')
        f.write('\n'.join(usrLinks))


    with open('usrLinks.html') as fp:
        ls = fp.readlines()

    links = []
    for link in ls:
        links.append(link.strip())
    #print len(links)

    links = list(set(links))
    #print len(links)

    #獲取使用者個人頁面
    locationRegex = re.compile(r'<span class="location item.*?title="(.*?)"')
    locations = []
    for link in links:
        url = 'https://www.zhihu.com'+str(link)
        page = requests_session.get(url, cookies=requests_cookies).content

        #獲取位置資訊
        #locations.append(locationRegex.findall(page))
        location = locationRegex.findall(page)+'\n'
        if (location):
            f = file('locations.html','a+')
            f.write('\n'.join(location))

python爬取知乎話題的精華問題下的使用者資訊

今天試著用自己的爬蟲程式碼爬取了知乎【同性戀】話題下的所有精華問題的使用者位置資訊程式碼： __author__ = 'yang' # -*- coding: utf-8 -*- import configparser import requests

python爬取知乎專欄使用者評論資訊

工具：python3，pycharm，火狐瀏覽器模組：json，requests，time 登入知乎，進入專欄。進入後隨便選擇一個專欄，我們選擇一個粉絲比較多的。點選進去。其實，我們可以爬取這個專欄的所有文章，開啟開發者工具F12，點選重新整理找

爬取知乎話題async使用協程

ret header tps mob ans print __name__ next and import requests import json import time from pyquery import PyQuery import pandas as pd f

用python爬取知乎中的圖片

首先，我們檢視一下知乎的robots協議。 User-agent: * Disallow: / 知乎是不允許爬取其根目錄的。但是，我們只是用於實驗，而且訪問頻率和正常訪問差距不大，所以可以爬取。先明確目的：對手動輸入的網址進行解析把爬取到的圖片儲存到指定目

python爬取知乎專欄文章標題及URL

# -*- coding:utf-8 -*- from selenium import webdriver import time # 執行程式要安裝selenium模組，並下載Chrome瀏覽器驅動

Python爬取知乎日報，推送到kindle

最近刷知乎上癮，剛好手頭有一臺kindle，搞一波事情。 1.分析頁面知乎日報的網頁端結果比較清晰，每篇的文章的連結都在 link-button 這個 a 標籤中。用requests + BeautifulSoup 庫可以比較輕鬆的解析。 i

Python爬取知乎“神回覆”，笑得根本停不下來（附程式碼）

def get_answers_by_page(topic_id, page_no): offset = page_no * 10 url = <topic_url> # topic_url是這個話題對應的url headers = { "User-Agent":

python爬取知乎回答

1. 安裝庫 htmlparser用來解析html。 Beautiful Soup 是一個可以從 HTML 或 XML 檔案中提取資料的 Python 庫。 pip install beautifulsoup4 Selenium 是瀏覽器自動化測

python抓取知乎首頁文字資訊的簡單實現

利用requests提供的方法得到網頁中的html檔案，然後用beautifulsoup提供的方法解析網頁資訊。 find_all('a',{"class":"question_link"}):找出網頁a標籤中class為question_link的標籤。 get_text

用於爬取知乎某個話題下的精華問題中所有回答的爬蟲

思路我的整個演算法的思路還是很簡單的，文字版步驟如下：1、通過話題廣場進入某個話題的頁面，避免了登陸註冊頁面的驗證，查詢到對應要爬取的話題，從 url 中得到話題id2、該頁面的所有資源採用了延遲載入，如果採用模擬瀏覽器進行載入的話還是很麻煩，經研究後發現知乎有前後端資料傳輸的api，所以獲取資料方面

爬取知乎Python中文社區信息

urlencode RR amp AD pos LV off In encoding 爬取知乎Python中文社區信息，https://zhuanlan.zhihu.com/zimei 1 import requests 2 from urllib.parse

python scrapy爬取知乎問題和收藏夾下所有答案的內容和圖片

上文介紹了爬取知乎問題資訊的整個過程,這裡介紹下爬取問題下所有答案的內容和圖片,大致過程相同,部分核心程式碼不同. 爬取一個問題的所有內容流程大致如下: 一個問題url 請求url,獲取問題下的答案個數(我不需要,因為之前獲取問題資訊的時候儲存了問題的回答個數) 通過答案的介面去獲取答案(如果一次獲取5

教程+資源,python scrapy實戰爬取知乎最性感妹子的爆照合集(12G)!

一.出發點：之前在知乎看到一位大牛（二胖）寫的一篇文章：python爬取知乎最受歡迎的妹子（大概題目是這個，具體記不清了），但是這位二胖哥沒有給出原始碼，而我也沒用過python,正好順便學一學,所以我決定自己動手搞一搞. 爬取已經完成,文末有 python的原始碼和妹子圖片的百度雲地址二.準備：

python3爬取知乎某話題下的若干個問題及其回答

思路：通過selenium的webdriver實現頁面的點選、下來等操作。頁面完全載入後通beautifulsoup來查詢相應的標籤。將資料放到列表中，然後通過xlwt建立excel，並把資料存進去。缺點：容易遇到效能問題。可能一個話題有很多的回覆，而對於往excel中插

python爬蟲——爬取知乎上自己關注的問題

與之前爬的網站圖片的不同的是，現在爬取的是要自己個人的關注的東西，所以需要做到模擬登入。模擬登入的原理是登入網站後，在瀏覽器上獲取儲存的cookies資訊，填充之後與請求一起傳送。如果前面的爬取圖片的會爬取了，對於這個解析字串的也沒有多大問題了。一直看著知乎上很多程式設計師把

【Python資料分析】簡單爬蟲，爬取知乎神回覆

歡迎加入Python學習交流QQ群：535993938 禁止閒聊！名額有限！非喜勿進！看知乎的時候發現了一個 “如何正確地吐槽” 收藏夾，

Python爬蟲（入門+進階）學習筆記 1-6 瀏覽器抓包及headers設定（案例一：爬取知乎）

爬蟲的一般思路：抓取網頁、分析請求解析網頁、尋找資料儲存資料、多頁處理本節課主要講授如何通過谷歌瀏覽器開發者工具分析真實請求的方法。尋找真實請求的三個步驟分析：使用谷歌瀏覽器開發者工具分析網頁的請求測試：測試URL請求中每個引數的作用，找出控制翻頁等功能的引數重複：多次重複

Python爬蟲之爬取知乎帖子並儲存到mysql（以及遇到問題和解決方法）

爬取問題標題並儲存到資料庫：程式碼： # coding=utf-8 import urllib import urllib2 import re import MySQLdb #co

爬取知乎熱榜標題和連接（python，requests，xpath）

app 分享圖片 dea -a mar margin 瀏覽器判斷 agen 用python爬取知乎的熱榜，獲取標題和鏈接。環境和方法：ubantu16.04、python3、requests、xpath 1.用瀏覽器打開知乎，並登錄 2.獲取cookie

Python爬蟲爬取知乎小結

最近學習了一點網路爬蟲，並實現了使用python來爬取知乎的一些功能，這裡做一個小的總結。網路爬蟲是指通過一定的規則自動的從網上抓取一些資訊的程式或指令碼。我們知道機器學習和資料探勘等都是從大量的資料出發，找到一些有價值有規律的東西，而爬蟲則可以幫助我們解決

python爬取知乎話題的精華問題下的使用者資訊

相關推薦