為了週末帶女神一起去看電影，我用Python爬取上萬部電影的排名

阿新 • • 發佈：2021-12-14

文章目錄

一、寫在前面

好不容易女神喊我去看電影，但是她問我準備看什麼，那我得好好準備準備~

二、準備工作

1、使用的軟體

很多人學習蟒蛇，不知道從何學起。

很多人學習python，掌握了基本語法之後，不知道在哪裡尋找案例上手。

很多已經做了案例的人，卻不知道如何去學習更多高深的知識。

那麼針對這三類人，我給大家提供一個好的學習平臺，免費獲取視訊教程，電子書，以及課程的原始碼！

QQ群：101677771

歡迎加入，一起討論一起學習！

python 3.8 （來源免費的）
Pycharm （YYDS python最好用的編輯器不接受反駁…）

軟體及安裝講解：軟體安裝包/安裝視訊講解

2、使用的模組

requests >>> 資料請求模組 pip install requests
parsel >>> 資料解析模組 pipinstall parsel
csv
模組安裝：如何安裝python模組, python模組安裝失敗的原因以及解決辦法

3、爬蟲思路

無論你爬取任何網站資料，都是可以按照這個幾步驟來。

1. 明確需求

我們要爬取的內容是什麼

https://movie.douban.com/top250

要分析我們想要的資料，可以從哪裡獲取，哪裡來的… (資料來源分析)

用這個開發者工具進行抓包(資料包)分析

靜態網頁：網頁上面看到的資料內容,在網頁原始碼裡面都有

動態網站：抓包分析

2. 傳送請求 (開發者工具裡面headers)

對於那個網址傳送請求，傳送什麼樣的請求，攜帶那些請求頭引數。

3. 獲取資料

獲取伺服器返回的資料內容，看伺服器的資料格式是什麼樣的，或者說我們想要的資料是什麼樣的；

獲取文字資料 response.text ；

獲取伺服器json字典資料 response.json() ；

獲取二進位制資料 response.content 儲存視訊/音訊/圖片/特定格式的檔案內容，都是獲取二進位制資料；

4. 解析資料

提供我們想要的資料內容

5. 儲存資料

儲存本地

三、程式碼解析

1、匯入模組

匯入一下我們需要用到的模組

import 
 requests  # 資料請求模組 pip install requests
import parsel  # 資料解析模組 pip install parsel
import csv  # 儲存表格資料

2、建立檔案

快速批量替換全文內容使用正則表示式替換內容

f = open('豆瓣資料.csv', mode='a', encoding='utf-8', newline='')  

csv_writer = csv.DictWriter(f, fieldnames=[
    '電影名字',
    '導演',
    '主演',
    '年份',
    '國家',
    '電影型別',
    '評論人數',
    '評分',
    '概述',
    '詳情頁',
])
csv_writer.writeheader()

3、獲取資料

for page in range(0, 250, 25):
    url = f'https://movie.douban.com/top250?start={
   page}&filter='
    # headers 請求頭 用來偽裝python程式碼 為了防止爬蟲程式被伺服器識別出來,
    # User-Agent 瀏覽器的基本標識 使用者代理 直接複製貼上的
    # 披著羊皮的狼  爬蟲程式 >>> 狼  headers >>> 羊皮  伺服器資料 >>> 羊圈
    headers = {
 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
    }
    # 傳送請求 get
    # 通過requests資料請求模組裡面get請求方法 對於 url以及傳送請求並且攜帶header請求頭, 最後用response變數接收返回資料
    response = requests.get(url=url, headers=headers)
    # 獲取資料
    # print(response.text)
    # 解析資料 re正則表示式 css選擇器 xpath 那一種更方便 更加舒服 就用哪一個
    # json 鍵值對取值
    # 提取我們想要的資料內容
    # 把獲取下來的 response.text 網頁字串資料 轉成selector 物件
    selector = parsel.Selector(response.text)
    # <Selector xpath=None data='<html lang="zh-CN" class="ua-windows ...'> 物件
    # css選擇器 根據標籤屬性提取資料
    # 第一次解析資料, 獲取所有li標籤
    lis = selector.css('.grid_view li')  # css選擇器語法
    # selector.xpath('//*[@class="grid_view"]/li') # xpath寫法
    # [] 列表, 如果說我想要一一提取列表裡面的元素 怎麼做?

4、解析資料

for li in lis:
    try:
        # span:nth-child(1) 組合選擇器  表示的選擇第幾個span標籤
        # 1 選擇第一個span標籤 text 獲取標籤文字資料
        title = li.css('.hd a span:nth-child(1)::text').get()
        href = li.css('.hd a::attr(href)').get()  # 詳情頁
        # li.xpath('//*[@class="hd"]/a/span(1)/text()').get()
        # get返回字串資料 getall 是返回列表資料
        # get獲取第一個標籤資料 getall 獲取所有的
        move_info = li.css('.bd p::text').getall()
        actor_list = move_info[0].strip().split('   ')  # 列表索引位置取值
        # print(actor_list)
        date_list = move_info[1].strip().split('/')  # 列表索引位置取值
        director = actor_list[0].replace('導演: ', '').strip()  # 導演
        actor = actor_list[1].replace('主演: ', '').replace('/', '').replace('...', '') # 演員
        date = date_list[0].strip()  # 年份
        country = date_list[1].strip()  # 國家
        move_type = date_list[2].strip()  # 電影型別
        comment = li.css('.star span:nth-child(4)::text').get().replace('人評價', '')  # 評論人數
        star = li.css('.star span:nth-child(2)::text').get()  # 星級
        world = li.css('.inq::text').get()  # 概述
        # 字串的高階方法
        # replace() 字串替換的方法  strip() 去除字串左右兩端的空格 split() 分割 之後返回的列表
        # 字串如何去除空格呢?
        # print(title, actor_list, date_list)
        dit = {
 
            '電影名字': title,
            '導演': director,
            '主演': actor,
            '年份': date,
            '國家': country,
            '電影型別': move_type,
            '評論人數': comment,
            '評分': star,
            '概述': world,
            '詳情頁': href,
        }

5、寫入資料

csv_writer.writerow(dit)

6、輸出資料

print(title, director, actor, date, country, move_type, comment, star, world, href, sep=' | ')
except:
    pass

所有程式碼

所有程式碼我就沒寫到目錄裡面，獎勵給全部看完的你，不然有些人看到目錄就直接點過來看全部程式碼了，哈哈~

import requests  
import parsel  
import csv  

f = open('豆瓣資料.csv', mode='a', encoding='utf-8', newline='')  
csv_writer = csv.DictWriter(f, fieldnames=[
    '電影名字',
    '導演',
    '主演',
    '年份',
    '國家',
    '電影型別',
    '評論人數',
    '評分',
    '概述',
    '詳情頁',
])
csv_writer.writeheader()

for page in range(0, 250, 25):
    url = f'https://movie.douban.com/top250?start={
   page}&filter='

    headers = {
 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
    }

    response = requests.get(url=url, headers=headers)
    selector = parsel.Selector(response.text)
    lis = selector.css('.grid_view li')  # css選擇器語法

    for li in lis:
        try:

            title = li.css('.hd a span:nth-child(1)::text').get()
            href = li.css('.hd a::attr(href)').get()  # 詳情頁

            move_info = li.css('.bd p::text').getall()
            actor_list = move_info[0].strip().split('   ')  # 列表索引位置取值

            date_list = move_info[1].strip().split('/')  # 列表索引位置取值
            director = actor_list[0].replace('導演: ', '').strip()  # 導演
            actor = actor_list[1].replace('主演: ', '').replace('/', '').replace('...', '') # 演員
            date = date_list[0].strip()  # 年份
            country = date_list[1].strip()  # 國家
            move_type = date_list[2].strip()  # 電影型別
            comment = li.css('.star span:nth-child(4)::text').get().replace('人評價', '')  
            star = li.css('.star span:nth-child(2)::text').get()  # 星級
            world = li.css('.inq::text').get()  # 概述

            dit = {
 
                '電影名字': title,
                '導演': director,
                '主演': actor,
                '年份': date,
                '國家': country,
                '電影型別': move_type,
                '評論人數': comment,
                '評分': star,
                '概述': world,
                '詳情頁': href,
            }
            csv_writer.writerow(dit) 

            print(title, director, actor, date, country, move_type, comment, star, world, href, sep=' | ')
        except:
            pass