requests-beautifulsoup爬取大學排名網站

阿新 • • 發佈：2018-12-17

1.根據url爬取頁面內容

def getHTMLText(url):
	try:
		r = requests.get(url,timeout=30)#設定超時時長為30s
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except:
		return ""

2.使用beautifulsoup尋找排名所在的頁面的標籤位置，並將找到的結果儲存到list中

def fillUnivList(ulist,html):
        #beautifulsoup解析html程式碼，把html程式碼中的元素變成python可以呼叫的物件
	soup = BeautifulSoup(html,'html.parser')
	for tr in soup.find('tbody').children:
		if isinstance(tr,bs4.element.Tag):#<tag>.string可能返回註釋內容
			tds = tr('td') #等同呼叫了find_all() 方法，找到tr下的所有td標籤
			ulist.append([tds[0].string,tds[1].string,tds[2].string])
	pass

3.使用format方法格式化輸出列表內容

def printUnivList(ulist,num):
        if num > len(ulist):
		num = len(ulist)
	tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"#{3}表示使用format函式的第三個變數進行填充
	print(tplt.format("排名","學校名稱","總分",chr(12288)))
	for i in range(num):
		u = ulist[i]
		print(tplt.format(u[0],u[1],u[2],chr(12288)))

全程式碼:

import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
	try:
		r = requests.get(url,timeout=30)#設定超時時長為30s
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except:
		return ""

def fillUnivList(ulist,html):
	soup = BeautifulSoup(html,'html.parser')
	for tr in soup.find('tbody').children:
		if isinstance(tr,bs4.element.Tag):#<tag>.string可能返回註釋內容
			tds = tr('td')
			ulist.append([tds[0].string,tds[1].string,tds[2].string])
	pass

def printUnivList(ulist,num):
        if num > len(ulist):
		num = len(ulist)
	tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"#{3}表示使用format函式的第三個變數進行填充
	print(tplt.format("排名","學校名稱","總分",chr(12288)))
	for i in range(num):
		u = ulist[i]
		print(tplt.format(u[0],u[1],u[2],chr(12288)))
	
def mian():
	uinfo = []
	url = "http://www.gaokaopai.com/paihang-otype-2.html"
	html = getHTMLText(url)
	fillUnivList(uinfo,html)
	printUnivList(uinfo,20)
mian()

輸出結果：

值得注意的是request-beautifulsoup只能爬取鑲嵌在html頁面裡程式碼的內容，如果想要頁面是通過js加載出來的，就無法獲取。

本例中由於頁面中只鑲嵌了25所大學的排名其餘均為js載入所得，所以只能爬取25所大學排名。

完整功能待續.....

requests-beautifulsoup爬取大學排名網站

1.根據url爬取頁面內容 def getHTMLText(url): try: r = requests.get(url,timeout=30)#設定超時時長為30s r.raise_for_status() r.encoding = r.apparent_

python 爬蟲（一） requests+BeautifulSoup 爬取簡單網頁代碼示例

utf-8 bs4 rom 文章都是 Coding man header 文本以前搞偷偷摸摸的事，不對，是搞爬蟲都是用urllib，不過真的是很麻煩，下面就使用requests + BeautifulSoup 爬爬簡單的網頁。詳細介紹都在代碼中註釋了，大家可以參閱。

python 爬蟲 requests+BeautifulSoup 爬取巨潮資訊公司概況代碼實例

pan 字符 selenium 5.0 target 自我 color list tails 第一次寫一個算是比較完整的爬蟲，自我感覺極差啊，代碼low，效率差，也沒有保存到本地文件或者數據庫，強行使用了一波多線程導致數據順序發生了變化。。。貼在這裏，引以為戒吧。 #

用Excel爬取大學排名資訊

Excel有強大的功能，我們可以用其做簡單的資料爬取，具體步驟如下： 1.選擇要獲取資訊的網站：http://www.zuihaodaxue.com/biyeshengjiuyelv2018.html 2.開啟Excel軟體，選擇資料欄來自網站，如圖一： 3.點選自網站後會出現圖二： 4.在

requests+beautifulsoup爬取豆瓣圖書

使用Xpath和BeautifulSoup來解析網頁可以說真的很簡便。 import requests from bs4 import BeautifulSoup from random import choice url = 'https://book.douban.com/tag/%E7%BC%96%

python利用selenium+requests+beautifulsoup爬取12306火車票資訊

在高速發展的時代。乘車出遠門是必不可少的，有些查詢資訊是要收費的。這裡打造免費獲取火車票資訊想要爬取12306火車票資訊，訪問12306官方網站，輸入出發地，目的地，時間之後點選確定，這是我們開啟谷歌瀏覽器開發者模式找到 https://

[筆記]python網路爬蟲：一個簡單的定向爬取大學排名資訊示例

爬取的網站資訊 <div class="section"><a id="zhb" name="zhb"></a> <div class="title t2"><h1><img

python3.x爬蟲：爬取大學排名資料

import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url, timeout=30)

Python開發爬蟲之BeautifulSoup解析網頁篇：爬取安居客網站上北京二手房數據

澳洲 pytho 目標 www. 委托 user info .get web 目標：爬取安居客網站上前10頁北京二手房的數據，包括二手房源的名稱、價格、幾室幾廳、大小、建造年份、聯系人、地址、標簽等。網址為：https://beijing.anjuke.com/sale/

Python 利用 BeautifulSoup 爬取網站獲取新聞流

lxml odi creat times 對比文件中 lse win 危機 0. 引言　　介紹下 Python 用 Beautiful Soup 周期性爬取 xxx 網站獲取新聞流；圖 1 項目介紹 1. 開發環境　　Python：　　　　

requests與BeautifulSoup爬取嗅事百科

爬取嗅事百科今天我們利用requests和bs4來爬取嗅事百科的內容。爬取步驟：分析網頁結構利用request來獲取網頁內容利用bs4來篩選網頁內容列印或者儲存網頁內容接下來，我們一步一步來完成這些事 1.分

requests與BeautifulSoup爬取網頁圖片

requests+BeautifulSoup爬取網頁圖片最近一直抽時間在看requests+BeautifulSoup爬取網頁內容這一塊的內容，所以，打算把自己看的總結一下，分享也是一種學醫，給自己做做筆記。 1.首先，我們看一下requests庫 requests

爬蟲系列3：Requests+Xpath 爬取租房網站信息並保存本地

imp 情侶 http \n 頻率 lazy desktop 火車 mode 數據保存本地參考前文爬蟲系列1：https://www.cnblogs.com/yizhiamumu/p/9451093.html 參考前文爬蟲系列2：https://www.cnblo

requests爬取去哪兒網站

閒來無事，所以爬下去哪兒網站的旅遊景點資訊，爬取網頁之前，最重要的是分析網頁的架構。1. 選擇要爬取的網頁及定位自己要爬取的資訊 url=http://piao.qunar.com/ 爬取全國熱門城市的境內門票首先要得到全國熱門城市的城市名及它們背後的連結2. 根據獲得

用BeautifulSoup爬取網站部分內容

BeautifulSoup通過解析文件(lxml/xml)來為使用者提供需要抓取的資料。 BeautifulSoup是一種比正則表示式更簡便的方式，來從網頁文件中提取出所需要的特定內容的方法。爬蟲最關鍵的點是搞清楚網頁結構。以爬取糗事百科內的文欄位子為例： 1

使用BeautifulSoup爬取“0daydown”站點的信息（2）——字符編碼問題解決

snippet sni 結束編碼錯誤 charset utf 教程作者 request 上篇中的程序實現了抓取0daydown最新的10頁信息。輸出是直接輸出到控制臺裏面。再次改進代碼時我準備把它們寫入到一個TXT文檔中。這是問題就出來了。最初我的代碼例如以

scrapy爬取西刺網站ip

close mon ins css pro bject esp res first # scrapy爬取西刺網站ip # -*- coding: utf-8 -*- import scrapy from xici.items import XiciItem clas

Python爬蟲之利用BeautifulSoup爬取豆瓣小說（三）——將小說信息寫入文件

設置 one 行為 blog 應該 += html uil rate 1 #-*-coding:utf-8-*- 2 import urllib2 3 from bs4 import BeautifulSoup 4 5 class dbxs: 6 7

requests, Beautifusoup 爬取新浪新聞資訊

int 爬取 eight tex import soup imp encoding 資訊 import requestsfrom bs4 import BeautifulSoupres = requests.get(‘http://news.sina.com.cn/chin

思路——根據網站鏈接爬取整個圖片網站

requests 方式效率 java實現 rap html 進行 os模塊 pat 八月入職新公司，發現公司的爬蟲系統主要用Java實現的偶爾用一些python，為此又看了下Java爬蟲，順便用之前同事推薦我的美女圖片網站練手（之前推薦時候python爬蟲勉強算經

requests-beautifulsoup爬取大學排名網站

相關推薦