Python爬取網站文章資料並存到資料庫

阿新 • • 發佈：2021-07-16

2. 資料庫表的準備（MySQL）
1）確認要儲存的欄位：
本文旨在獲取媽媽網網站文章的資料，因此需要文章標題（title）、文章連結（href）、文章內容（content）和內容圖片（imgs）

2）建立資料庫表

CREATE TABLE `mamawang_info` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title` varchar(255) DEFAULT NULL,
  `href` varchar(255) DEFAULT NULL,
  `content` text,
  `imgs` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=627 DEFAULT CHARSET=utf8;

　　3）連線資料庫：先安裝ptyhon的pymysql包

import pymysql.cursors
connect = pymysql.Connect(
    host='localhost',
    port=3306,
    user='root',
    passwd='admin',
    db='baby_info',
    charset='utf8'
)

　　3. 爬取網站資料
1）確認需要爬取的網站資料

2）研究網頁結構

url = 'http://www.mama.cn/z/t1183/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
div = soup.find(class_='list-left')

　　

3）編寫python程式碼爬取網頁內容（完整程式碼）

import requests
from bs4 import BeautifulSoup
import datetime
import pymysql.cursors
import time
import os
# 連線資料庫
connect = pymysql.Connect(
    host='localhost',
    port=3306,
    user='root',
    passwd='admin',
    db='baby_info',
    charset='utf8'
)
def get_one_page():
    headers  
= {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
    }
    # 開始時間
    start_time = datetime.datetime.now()
    url = 'http://www.mama.cn/z/t1183/'
    # 圖片儲存路徑
    root = "D://reptile//images//"
    # 若不存在該目錄，就建立該目錄
    if not os.path.exists(root):
        os.mkdir(root)
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    div = soup.find(class_='list-left')
    lists = div.find_all('li')
    for list in lists:
        title = list.find('a').string
        href = list.find('a')['href']
        time.sleep(1)
        # 通過文章的url獲取文章網頁內容
        page = requests.get(href, headers=headers)
        web_text = BeautifulSoup(page.text, "html.parser")
        contents = web_text.find_all('p')
        content = ''
        # 由於文章內容存到資料庫，每條開頭都有“退出”，末位都有none，因此，利用count忽略拼接第一個string和最後一個string
        count = 0
        for i in contents:
            if count != 0 and count != len(contents) - 1:
                content = '{}{}'.format(content, i.string)
            count += 1
            try:
                div_imgs = web_text.find('div', class_='detail-mainImg')
                imgs = div_imgs.find('img')['src']
                path = root + imgs.split("/")[-1]
                with open(path, "wb") as f:  # 開始寫檔案，wb代表寫二進位制檔案
                    f.write(requests.get('http:' + imgs).content)
            except(Exception):
                print("抱歉，找不到圖片")
        inset_spec_code(title, href, content, path)
    end_time = datetime.datetime.now()
    print((end_time - start_time).seconds)
# 獲取遊標
cursor = connect.cursor()
def inset_spec_code(title, href, content, imgs):
    try:
        # 插入資料
        sql = "INSERT INTO mamawang_info(title,href,content,imgs) VALUES ('%s','%s','%s','%s')"
        
        data = (title, href, content, imgs)
        cursor.execute(sql % data)
        connect.commit()
        print('成功插入', cursor.rowcount, '條資料')
    except Exception:
        print("插入失敗")
if __name__ == '__main__':
   get_one_page()

4. 執行python檔案
1）在該python檔案的同級目錄下開啟cmd命令，輸入：pythonmamawang.py

2）結果
圖片下載結果

資料庫結果（626條）

Python爬取網站文章資料並存到資料庫

2. 資料庫表的準備（MySQL）1）確認要儲存的欄位：本文旨在獲取媽媽網網站文章的資料，因此需要文章標題（title）、文章連結（href）、文章內容（content）和內容圖片（imgs）2）建立資料庫表

用python爬取網站的資料

技術標籤：pythonpython 用python爬取網站的資料匯入庫輸入要爬取的網站訪問設定爬取——整理——縮小爬取範圍定義列表——for迴圈有選擇性輸出列表（遍歷？）結果展示顯示Response 418的解決方法

用python爬取歷史天氣資料的方法示例

某天氣網站（www.數字.com）存有2011年至今的天氣資料，有天看到一本爬蟲教材提到了爬取這些資料的方法，學習之，並加以改進。

python 爬取古詩文存入mysql資料庫的方法

使用正則提取資料，請求庫requests,看程式碼，在存入資料庫時，報錯ERROR 1054 (42S22): Unknown column ‘title\' in ‘field list\'。原來是我寫sql 有問題，sql = “insert into poem(title,author,content,creat