廣州商學院新聞獲取

阿新 • • 發佈：2018-04-16

AR start -c sts htm getc href __main__ hit

import re
import xlwt
import time
import pandas
import requests
from multiprocessing import Process,Pool
from bs4 import BeautifulSoup


def getClickCount(newUrl):

    """
    獲取新聞的點擊次數
    :param newUrl:
    :return: int
    """
    new_id = re.findall(r‘\_(.*).html‘,newUrl)
    new_id = new_id[0].split(‘/‘)[1]
    url = ‘http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80‘.format(new_id)
    content = requests.get(url)
    clickCount = int(re.search("hits‘\).html\(‘(.*)‘\);", content.text).group(1))
    return clickCount

def getNewDetail(newsUrl):

    """
    獲取廣州商學院的新聞詳情
    :param newsUrl:
    :return: Dict
    """
    content=‘‘
    web=requests.get(newsUrl)
    web.encoding=‘utf-8‘
    soup=BeautifulSoup(web.text, ‘html.parser‘)
    structure=soup.find(‘div‘,{‘class‘:‘show-content‘}) #正文
    for string in structure.stripped_strings:
        content=content+string

    list=[]
    info=soup.find(‘div‘,{‘class‘:‘show-info‘})
    info=info.text.replace(‘\xa0‘,‘n‘).split(‘n‘)#細節信息
    for string in info:
        if len(string)>3:
            if string.find(‘發布時間‘)!=-1:
                string=string.replace(‘:‘,‘：‘,1)
                string=string.strip()
            if string.find(‘次‘)!=-1:
                string=‘點擊：{}次‘.format(getClickCount(newsUrl))

            list.append(string.split(‘：‘))
    list=dict(list)
    list[‘鏈接‘]=newsUrl
    list[‘正文‘]=content
    list[‘發布時間‘]=time.strptime(list[‘發布時間‘],‘%Y-%m-%d %H:%M:%S‘)
    return list
def getNewsUrl(url):

    """
    獲取廣州商學院新聞列表頁的所有新聞頁的鏈接
    :param url:
    :return: List
    """

    newsList=[]
    web=requests.get(url)
    web.encoding=‘utf-8‘

    soup=BeautifulSoup(web.text,‘html.parser‘)
    soup=soup.find(‘ul‘,{‘class‘:‘news-list‘})
    for child in soup.children:
        if len(child)>1:
            newsList.append(child.a[‘href‘])
    return newsList

def getPage(url):

    """
    獲取廣州商學院新聞頁數
    :param url:
    :return: int
    """
    web=requests.get(url)
    web.encoding=‘utf-8‘

    soup=BeautifulSoup(web.text,‘html.parser‘)
    soup=soup.find(‘a‘,{‘class‘:‘a1‘}).string[:-1]

    page=int(soup)//10+1

    return page

def getnews(url):
    print(‘start in %s‘%url[39:])
    newsurllist = getNewsUrl(url)
    for url in newsurllist:
        news.append(getNewDetail(url))
    print(‘ end ‘ ,end=‘‘)

if __name__==‘__main__‘:

    news=[]

    url=‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
    newsurl=getNewsUrl(url)
    page=getPage(url)
    for i in range(1,page+1):
        if i==1:
            url=‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
        else:
            url=‘http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html‘.format(i)
        getnews(url)
    df=pandas.DataFrame(news)
    df.to_excel(‘gzccnews.xls‘)

廣州商學院新聞獲取

獲取廣州商學院各頁新聞標題及點擊次數

show bsp pre tex port nbsp list range sel import requests import re from bs4 import BeautifulSoup url=‘http://news.gzcc.cn/html/xiaoyuan

廣州商學院新聞獲取

AR start -c sts htm getc href __main__ hit import re import xlwt import time import pandas import requests from multiprocessing import P

Python 爬取廣州商學院新聞----測試版

bsp png src 目的 nbsp easy height width 測試版　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Python 爬取廣州商學院新聞----測試版程序簡述：抓取廣州商學院新聞欄目的全部新聞內容開發環境：PyCharm Co

微信網頁授權獲取用戶信息等機制

json 開發者 userinfo 技術分享 nal amp 分隔 response unionid 參考官方文檔 https://mp.weixin.qq.com/wiki/17/c0f37d5704f0b64713d5d2c37b468d75.html 1.用戶進入授權

jquery 獲取 outerHtml

原生獲取內置方法屬性 pos jquery att ont 在開發過程中，jQuery.html() 是獲取當前節點下的html代碼，並不包括當前節點本身的代碼，然後我們有時候確須要。找遍jQuery api文檔也沒有不論什麽方法能夠拿到。看到有的人通過pa

每天一個JavaScript實例-展示設置和獲取CSS樣式設置

width func height nts style scrip meta on() 屬性 <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" conte

js 獲取鼠標坐標

繪制 dev 哈哈 color lis func tex style text var x=...; var y=...; return {‘x‘:x,‘y‘:y}; div.x=...;div.y=...; 哈哈，原來是賦值。所以一個數值變量不能當屬性？？？？？好像可以吧

angular js 多處獲取ajax數據的方法

list ont listctrl lct module 方法 detail car 獲取 angular js 多處獲取ajax數據的方法var app=angular.module("cart",[]);app.service("getData",function ($

python實現QQ機器人(自己主動登錄，獲取群消息，發送群消息)

keep fine 繼續 fill iss data while [0 ssi 一次偶然的機會我看見了一個群裏的一個QQ號總是依據你所發的消息自己主動回復，當時非常感覺到奇妙。我知道能夠模擬登錄站點，沒想到居然也能模擬登錄QQ，首先自己想到的就是怎樣實現模擬登錄PC端的

PHP獲取隨機數的函數rand()和mt_rand()

max min target targe () 不定效率獲取 php rand()函數用戶獲取隨機數，具體用法如下： rand()可以設置0個參數或者兩個參數，如rand($min,$max)，$min表示從XX開始取值，$max表示最大只能為XX 例如： &

java工具類,在Windows,Linux系統獲取電腦的MAC地址、本地IP、電腦名

copy iter 去掉m [] equals linu stat cli catch package com.cloudssaas.util; import java.io.BufferedReader; import java.io.IOException;

獲取選擇的當前天、周、月、年的時間段

日期格式化 sta ret date fwe 設置 .info || ted /** * Created by Administrator on 2017/5/6. */ /** * options:{"type":2,"date":"2017-5-6","conn

Android獲取日期及星期的方法

ext text () oid get erro pre 公歷 format Calendar calendar=Calendar.getInstance(); SimpleDateFormat simpleDateFormat=new SimpleDateFormat("

2534 渡河 2013年市隊選拔賽廣州

ide put 單位過河 efault 100% art 時間 mat 2534 渡河 2013年市隊選拔賽廣州時間限制: 1 s 空間限制: 128000 KB 題目等級 : 黃金 Gold 題目描述 D

【Jquery】jQuery獲取URL參數的兩種方法

ont ras mil scrip line 兩種方法 lower quest request jQuery獲取URL參數的關鍵是獲取到URL，然後對URL進行過濾處理，取出參數。 location.href是取得URL。location.search是取得URL“？

WordPress基礎：get_page_link獲取頁面地址

cnblogs word 必須 code blog 編號否則 lin 指定函數：get_page_link(頁面id編號) 作用：獲取指定頁面的鏈接地址用法： $link = get_page_link(2); 如在循環裏則不用填寫id參數，否則必須指

Robot Framework XPATH元素的定位（如何獲取一個動態或具體的元素）

添加 nbsp clas 初學者提取一位驗證方法 work 前提部分（可略過）：對於初學者來說，元素定位的方式相對直接、粗糙一點。比如，用鼠標放在一個字符上點擊右鍵查看元素，或者先點擊F12再查看元素，大多情況下這種方式都是可行的。而我們最需要關註的也是容易阻塞我們

JavaScript中提供獲取HTML元素位置的屬性：

瀏覽器 var func set height cti fse 屬性 turn HTMLElement.offsetLeft HTMLElement.offsetHeight 但是需要註意的是，這兩個屬性所儲存的數值並不是該元素相對整個瀏覽器畫布的絕對位置，而是相對於其父

Linux下獲取可執行程序的絕對路徑

出錯處理其他但是源文件位置應該 out 絕對路徑 class 編寫的程序中如果需要讀取配置文件，或者需要輸出log文件打印日誌，或者讀取其他文件的時候會出現一個問題：可執行程序在讀取文件路徑的時候使用什麽路徑？我們一般項目的結構就是： project/

百度地圖拖動標註後獲取坐標

acc route wheel onclick 權限 ctype 在線 inpu initial 本來想用圖吧的API來做的，結果弄了下，在手機上弄不了。換用百度地圖了。。本功能個人覺得在很多地方用到，先記下來，省得每次都得翻地圖API文檔一點一點弄。功能表現為：

廣州商學院新聞獲取

相關推薦