爬取學校新聞網站文章

阿新 • • 發佈：2021-01-05

技術標籤：python

爬取學校新聞網站文章

爬取思路
遇到的問題

爬取思路

第一步，用requests獲取新聞目錄的網頁原始碼。

def get_page(url):   #頁面原始碼
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        print("Fail to get page")
        
url = "http://news.fzu.edu.cn/html/fdyw/" 
 + str(offset) + ".html"
html = get_page(url)

第二步，獲取每一篇文章的url，並先提取日期、標題

def get_articles(html, new_list):
    doc = pq(html)
    articles = doc('.list_main_content li')
    
get_articles(html, new_list)

第三步，通過日期限制爬取範圍，並對每一則新聞的url發起get請求

if new["date"][:4] == "2020":   #只爬2020 
年
     new["title"] = article('a').text()  #標題
     url = 'http://news.fzu.edu.cn' + article('a').attr('href')
     html_new = get_page(url)
     get_other_data(html_new, new)
     new_list.append(new)
elif new["date"][:4] == "2021":
     continue
else:
     global flag
     flag = 
 1
     return

第四步，在每則新聞網頁的原始碼中獲取剩下的資訊，即作者、正文、瀏覽數

def get_other_data(html, new):
    doc = pq(html)
    data = doc('.detail_main_content')

    author = data('#author').text()  #作者
    new["author"] = author

    page_views_str = data('script').text()  #閱讀數
    a1 = page_views_str.find("url")
    a2 = page_views_str.find("timeout")
    page_views_url = page_views_str[a1 + 5:a2 - 2]
    page_views_url = "http://news.fzu.edu.cn" + page_views_url
    page_views = requests.post(page_views_url).text
    new["page_views"] = page_views

    content = ""    #正文
    paragraphs = doc('#news_content_display')
    for p in paragraphs('p').items():
        content += p.text() + "\n"
    new["content"] = content

第五步，存入資料庫

db = pymysql.connect(host='localhost', user='root', password='beli3579', port=3306, db='fzu_new')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS news")
sql = '''create table news(
        date varchar(20),
        title varchar(70),
        author varchar(50),
        page_views varchar(20),
        content varchar(3000)
    )'''
cursor.execute(sql)
for new in new_list:
      sql = 'insert into news(date,title,author,page_views,content) values(%s,%s,%s,%s,%s)'
      try:
          if cursor.execute(sql, tuple(new.values())):
            print('Success to the database')
            db.commit()
      except:
            print('Fail to the database')
            db.rollback()
db.close()

遇到的問題

在這裡插入圖片描述
在chrome的檢查功能中，新聞的瀏覽數有顯示，但是爬不下來
最終發現是Ajax 請求

爬取學校新聞網站文章

技術標籤：python 爬取學校新聞網站文章爬取思路遇到的問題爬取思路第一步，用requests獲取新聞目錄的網頁原始碼。

Python爬取學校文章並儲存mysql

Python爬取學校文章並儲存mysql python爬取學校文章並儲存mysql 問題簡介爬取福⼤要⽂（http://news.fzu.edu.cn/html/fdyw/）要求： 1.包含釋出⽇期，作者，標題，閱讀數以及正⽂。 2.可⾃動翻⻚。 3

使用Puppeteer爬取微信文章的實現

一朋友在群裡問有沒有什麼辦法能夠一次性把這個連結裡的文章儲存下來。點開可以看到，其實就是一個文章合集。所以需求就是，把這個文件中的連結裡的文章挨個儲存下來。儲存形式可以有很多種，可以是圖片，也可以是網

爬取求職網站的相關資訊

程式碼如下： import requests import openpyxl import time from bs4 import BeautifulSoup #用於解析和提取網頁資料的

辦公自動化24-爬取CMB網站理財產品的基本資訊（產品程式碼、產品名稱、收益率、淨值）

#匯入包import re import time import pandas as pd import numpy as np from selenium import webdriver from selenium.webdriver.common.keys import Keys

辦公自動化25-爬取CMB網站理財產品的投資報告並格式化輸出

# -*- coding: utf-8 -*- \"\"\" Created on Aug 5 2020 @author: lizitingxue \"\"\" #基礎包 import numpy as np

使用scrapy爬取jian shu文章

settings.py中一些東西的含義可以看一下這裡 python的scrapy框架的使用和xpath的使用 && scrapy中request和response的函式引數 && parse()函式執行機制

python如何爬取動態網站

python有許多庫可以讓我們很方便地編寫網路爬蟲，爬取某些頁面，獲得有價值的資訊！但許多時候，爬蟲取到的頁面僅僅是一個靜態的頁面，即網頁的原始碼，就像在瀏覽器上的“檢視網頁原始碼”一樣。一些動態的東西如j

python爬取12306網站獲取火車票資訊

利用requests傳送請求，prettytable表格輸出，需要安裝requests，prettytable python -m pip install requests

scrapy與selenium結合爬取資料(爬取動態網站)的示例程式碼

scrapy框架只能爬取靜態網站。如需爬取動態網站，需要結合著selenium進行js的渲染，才能獲取到動態載入的資料。

Python爬取素材網站3000多條音訊素材檔案

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

Request爬取各類網站的資料（例項爬取）

1. 先上程式碼 1 # !/usr/bin/env python 2 # ! _*_ coding:utf-8 _*_ 3 # @TIME: 2020/10/1213:29 4 # @Author : Noob

基於Python爬取素材網站音訊檔案

基本環境配置 python 3.6 pycharm requests parsel 相關模組pip安裝即可目標網頁請求網頁 import requests

爬取電影網站

code import time import sys,os import requests import shutil from selenium import webdriver from selenium.webdriver.common.keys import Keys

Python爬取招聘網站資料並做資料視覺化處理

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

爬取學校課程表

技術標籤：筆記python爬蟲資料庫因為在爬取過程中遇見了一個困難，所以寫下來記錄一下。過程很簡單就是用request.session將賬號密碼post登入。其中具體過程推薦這篇文章

python爬取學校新聞

這是我做的第一個python爬蟲專案，在這裡與大家分享出來~ 目標網站：https://news.fzu.edu.cn/html/fdyw/

python爬蟲實現爬取同一個網站的多頁資料的例項講解

對於一個網站的圖片、文字音視訊等，如果我們一個個的下載，不僅浪費時間，而且很容易出錯。Python爬蟲幫助我們獲取需要的資料，這個資料是可以快速批量的獲取。本文小編帶領大家通過python爬蟲獲取獲取總頁數並更改

python協程爬取某網站的老賴資料

import re import json import aiohttp import asyncio import time import pymysql from asyncio.locks import Semaphore

爬蟲實戰：爬取相親網站，看看當下年輕小姐姐的擇偶觀。

技術標籤：爬蟲爬蟲python 前言到了一定年齡，父母可能會催你找女朋友，結婚。大多數的父母催婚，是父母漸漸老了，想讓你找個人照顧你，有熱飯吃，生病了有人照顧。在外面不被人欺負。當然，也有一部分來自周

爬取學校新聞網站文章