第一個爬蟲小程式(攜帶登陸後的cookie)

阿新 • • 發佈：2019-01-02

import requests
class TiebaSpider:
	def __init__(self,tieba_name):
		"""
		初始化引數,完成基礎配置
		"""
		self.tieba_name = tieba_name
		self.url_base = "https://tieba.baidu.com/f?kw=" + tieba_name_crawl + "&ie=utf-8&pn={}"
		self.headers = {"User-Agent":"WSF"}
		
	def make_url_lists(self):
		"""
		生成下載列表
		"""
		return [self.url_base.format(i) for i in range(1,11)]
	
	def download_url(self,url_str):
		"""
		使用requests get方法下載指定頁面,並返回頁面效果
		"""
		result = requests.get(url_str,headers = self.headers)
		return result.content
	
	def save(self,result,page_num):
		"""
		儲存下載內容
		"""
		file_path = "{}-第{}頁.html".format(self.tieba_name,page_num)
		with open(file_path,"wb") as f:
			f.write(result)
		
	def run(self):
		"""
		下載主執行緒,實現主要的下載邏輯
		"""
		url_lists = self.make_url_lists()
		for url_str in url_lists:
			result_str = self.download_url(url_str)
           		 p_num      = url_lists.index(url_str) + 1
            		self.save_result(result_str,p_num)
            		
if __name__ == '__main__':
    tieba_spider = TiebaSpider("薛之謙")
    tieba_spider.run()

理解 session 和 cookie

session:當用戶訪問http-server時,會生成一個sessionID(唯一標識),在一定訪問週期內可用,在瀏覽網頁時會將記錄儲存在cookie中,下次訪問有快取記錄.

session 伺服器端生成一個字串儲存在某個使用者的唯一標識.用來唯一標識客戶端的訪問(如健身中心會員卡)

cookie 儲存在客戶機的資料,其中含有sessionID,傳送給伺服器後表明使用者身份.

import lxml.html

import requests
import re

def parse_form(html):
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

def get_cookie():
    s = requests.session()
    result = s.get('http://example.webscraping.com/places/default/user/login?_next=/places/default/index')
    post_data = parse_form(result.text)
    print(s.cookies.get_dict())
    login_url ='http://example.webscraping.com/places/default/user/login?_next=/places/default/index'
    post_data['email']= ' 
[email protected]'
    post_data['password'] = '2336517498'
    s.post(login_url,post_data)
    rs = s.post('http://example.webscraping.com/places/default/user/login?_next=/places/default/index')

    with open('login1.html','w+') as f:
        f.write(rs.text)

if __name__ == '__main__':
    get_cookie()

第一個爬蟲小程式(攜帶登陸後的cookie)

import requests class TiebaSpider: def __init__(self,tieba_name): """ 初始化引數,完成基礎配置 """ self.tieba_name = tieba_name self.url_base = "ht

我的第一個上線小程式，三篇其一

LayaBox案例分享小程式開篇（1）不知不覺後端搬磚六年了，回想過去，什麼也沒留下，突然覺得是時候寫點什麼了。為什麼我要選擇小程式呢，主要是覺得上手簡單，易於傳播，同時可以投放廣告。我的第一個小程式主要是想總結下這幾年的程式設計積累，同時分享出來。分為基礎內容和實戰篇，目前僅上線了基礎內容。目前我

Python圈，第一個答題小程式，近70份獎品等你來拿

推薦來自我的好夥計leoxin的小程式，碼題達人。微信小程式非常火爆，我想做小程式已經很久了，幾

我的第一個上線小程式，三篇其二

LayaBox案例分享小程式開篇（2）感謝大家的捧場，這一篇我們先聊點乾貨。關於小程式小遊戲怎麼變現。我們每天都看抖音、新聞、玩小遊戲等，暴露在手機螢幕上的廣告，產生的流量都轉換成別人的分紅了。然而作為開發出來這些軟體的人們，卻又反過來成了消費者。。。 2018年，我做了很多嘗試，嘗試過各平臺發文

我的第一個上線小程式，三篇其三

LayaBox案例分享小程式開篇（3）不知不覺已經是上線小程式基礎篇的最後一篇了，今天我會把原始碼發到本文的底部，有需要的可以拿去練手。大家可以體驗一下，請掃碼：這個頁面我們主要用到的知識有；佈局依然是WEUI；資料解析外掛WxParse，下面會額外分享一下WxParse的兩

JAVA第一個窗體小程式

import java.awt.*; public class Day1015_Frame { public static void main(String[] args) { Frame f

重寫第一個爬蟲程式

第一個爬蟲程式是利用scrapy命令建立好之後，直接編寫程式碼實現的。文章見 [ scrapy 從第一個爬蟲開始]，本文將利用item，pipeline以及檔案儲存重寫此程式，從而使大家更好的理解。一、首先是image.py程式 # -*- coding: utf-8 -*-

python第一個爬蟲程式

轉載https://www.cnblogs.com/Axi8/p/5757270.html 把python2的部分改成python3了，爬取百度貼吧某帖子內的圖片。 #coding:utf-8 import urllib.request#python3 i

Python網路爬蟲學習筆記——第一個爬蟲程式

執行環境語言 Python3 第三方庫 pip install reqeusts pip install BeautifulSoup4 pip install jupyter 線上編輯器安裝 jupyter 模組後，在cmd視窗中執行命令jupyte

用Python第一個爬蟲程式（urllib.request)

這是博主第一個小爬蟲程式，紀念一下 2018/09/20 之前在ubuntu裡面已經實現，不過今天開始使用pycharm，折騰了一上午…終於打出來了。話不多說… 目標：爬取博主一篇博文(Path of Python – 爬蟲)裡面的遊覽數。 import r

Python爬蟲入門——2. 1 我的第一個爬蟲程式

第一個爬蟲程式就寫的簡單一點，我們用requests庫以及BeautifulSoup庫來完成我們的第一個程式（我們所用的python版本為 3.x）。我們爬取豆瓣圖書（https://book.douban.com/top250?start=25）Top1

第一個爬蟲程式，基於requests和BeautifulSoup

斷斷續續學了1年多python，最近總算感覺自己入門了，記錄下這幾天用requests和BeautifulSoup寫的爬蟲。 python的環境是anaconda+pycharm。直接上程式碼 @requires_authorization """

第一個爬蟲

itl shee 整理 sam 一個 ext select article pen import requests import pandas as pd from bs4 import BeautifulSoup import json import pandas

第一個爬蟲程序

head cache max app 爬蟲 ofa conn parser quest from urllib import request from urllib import parse from bs4 import BeautifulSoup req =req

第一個爬蟲代碼

/usr wow64 print exc reg mozilla getc idt size # !/usr/bin/python#coding=GBKimport urllib.requestimport re#file=open("F:/python_workspace

python第一個爬蟲的例子抓取數據到mysql，實測有數據

入mysql數據庫 nor gecko /usr png 支持 web local webkit python3.5 先安裝庫或者擴展 1 requests第三方擴展庫 pip3 install requests 2 pymysql pip3 install pym

Struts2 第一個入門小案例

str ges struts nbsp 配置加載類 src 第一個 alt 1.加載類庫 2 配置web.xml文件 3.開發視圖層 4.開發控制層Action 5.配置struts.xml 6.部署運行 Struts2 第一個入門小案例

python第一個爬蟲腳本

python -c get makedirs www 腳本 data close htm import urllib.requestimport reimport os url = "http://www.budejie.com/" # 爬的地址 def get_page

我的第一個爬蟲，爬取北京地區短租房信息

爬取 connect except links 效率 chrom cti clas 爬蟲 # 導入程序所需要的庫。import requestsfrom bs4 import BeautifulSoupimport time# 加入請求頭偽裝成瀏覽器headers = {

Intellij Idea12第一個安卓程式開發（HelloWorld）及簡單講解Android

一、前言本helloworld只有3行程式碼，皆為讓沒做過安卓的朋友看看安卓的目錄結構以及基本的開發方式。 &

第一個爬蟲小程式(攜帶登陸後的cookie)

理解 session 和 cookie

相關推薦