Python爬資料之全國中小學資訊

阿新 • • 發佈：2019-01-09

技術路線：requests + BeautifulSoup

貌似這個網站反爬蟲還挺牛的，經常就返回自動跳入的139網站，隨意得換著IP試試

需要準備中國市名稱拼音存在EXCEL中，顯示是第一列：市民；第二列：拼音；到市級就可以。

需要挖掘哪些城市就放哪些，如果挖全國，就要放所有市名。

如：

輸出是一個EXCEL，包括：

城市

型別

學習名稱

地址

電話

網址

如：

直接上程式碼：

from bs4 import BeautifulSoup
import requests
import re
import sys
import xlwt
import xlrd
from xlutils.copy import copy

#獲取html
def getHtmlText(url, code="GBK"):
    try:
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36'}
        r = requests.get(url,headers = headers,timeout = 30)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return "獲取html異常"
#解析地區，返回地區清單
'''
def getGroundList(htext):
    try:
        grounddict = {}
        soup = BeautifulSoup(htext, "html.parser")
        gdname = soup.find('dl', attrs={'class':'nobackground'})
        keyList = gdname.find_all('a')
        for i in range(1,len(keyList)):
            key = keyList[i].text
            val = keyList[i].get('href')
            grounddict[key] = val
        return grounddict
    except:
        print("getGroundList異常")
'''
#解析頁碼
def getPageCode(htext,typeitem):   
    try:
        soup = BeautifulSoup(htext, "html.parser")
        s1 = soup.find('a', attrs={'class':'last'})
        if (s1):
            pat = re.compile(typeitem + r'pn([0-9]+).html')
            if(s1.get('href')):
               code = pat.search(s1.get('href'))
               if(code):
                   return code.group(1)
        else:
            return 0
            
    except:
        print("getPageCode異常")
    

#解析學校資訊，返回學校名稱、地址、電話、網址
def getSchoolList(htext,fileAddress,cityitem,typeitem):
    try:
        schoolDict = {}
        soup = BeautifulSoup(htext, "html.parser")
        sclist1 = soup.find_all('dl',attrs={'class':'left'})
        sclist2 = soup.find_all('dl',attrs={'class':'right'})
        sclist = sclist1 + sclist2
        for item in sclist:
            schoolDict['城市'] = cityitem
            schoolDict['型別'] = typeitem
            schoolDict['學習名稱'] = item.find('p').text
            sl = item.find_all('li')
            schoolDict['地址'] = sl[0].text
            schoolDict['電話'] = sl[1].text
            schoolDict['網址'] = sl[2].text
            #f = open(fileAddress, 'a', encoding='utf-8')
            #f.write(str(schoolDict)  + '\n' )
            savefile(schoolDict,fileAddress)
    except:
        print("getSchoolList異常")

#儲存到excel
def savefile(schoolDict,fileAddress):
    workbook = xlrd.open_workbook(fileAddress,'w+b')
    sheet = workbook.sheet_by_index(0)
    wb = copy(workbook)
    ws = wb.get_sheet(0)
    rowNum = sheet.nrows
    ws.write(rowNum,0,schoolDict['城市'])
    ws.write(rowNum,1,schoolDict['型別'])
    ws.write(rowNum,2,schoolDict['學習名稱'])
    ws.write(rowNum,3,schoolDict['地址'])
    ws.write(rowNum,4,schoolDict['電話'])
    ws.write(rowNum,5,schoolDict['網址'])
    wb.save(fileAddress)
        
#獲取城市列表,城市由EXCEL檔案儲存
def getCityList():
    try:
        cityFileAddress = r'D:\中國省市名稱拼音.xls'
        file = xlrd.open_workbook(cityFileAddress)
        sheet = file.sheet_by_name('city')
        cityDic = {}
        for i in range(sheet.nrows):
            key = sheet.col_values(0)[i]
            value = sheet.col_values(1)[i].lower()
            cityDic[key] = value
        return cityDic
    except:
        print("getCityList失敗")
            
def main():
    cityList = getCityList()
    typeList = {'小學':'/xiaoxue/','初中':'/chuzhong/','高中':'/gaozhong/'}
    for cityitem in cityList:
        for typeitem in typeList:
            searchUrl = 'http://'+ cityList[cityitem] + '.xuexiaodaquan.com'
            fileAddress = 'D:/school.xls'
            htext = getHtmlText(searchUrl+typeList[typeitem])
            getSchoolList(htext,fileAddress,cityitem,typeitem)
            pagecode = int(getPageCode(htext,typeList[typeitem]))
            if pagecode != 0:
                for i in range(2,pagecode+1):
                    h1text = getHtmlText(searchUrl+typeList[typeitem]+'pn'+str(i)+'.html')
                    getSchoolList(h1text,fileAddress,cityitem,typeitem)
       
main()

Python爬資料之全國中小學資訊

技術路線：requests + BeautifulSoup貌似這個網站反爬蟲還挺牛的，經常就返回自動跳入的139網站，隨意得換著IP試試需要準備中國市名稱拼音存在EXCEL中，顯示是第一列：市民；第二列：拼音；到市級就可以。需要挖掘哪些城市就放哪些，如果挖全國，就要放所有市名

Python爬codefores所有的題目資訊

直接貼程式碼 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib import request import pymysql.cursors import re num = 0 for i

Python爬取安居客經紀人資訊

Python爬取安居客經紀人資訊 Python2.7.15 今天我們來爬取安居客經紀人的資訊。這次我們不再使用正則，我們使用beautifulsoup。不瞭解的可以先看一下這個文件，便於理解。https://beautifulsoup.readthedocs.io/zh_CN/v4.4.

Python爬取淘寶商品資訊

頁面分析開啟淘寶搜尋衛衣男檢視原始碼我們這裡可以找到幾個關鍵詞通過分析我們可以找到價格，郵費，商家地址，付款人數，商家ID，店鋪名稱。分析URL 我們可以看到第一頁url：https://s.taobao.com/search?q=%E5%8D%AB%E

用python爬取二手房交易資訊並進行分析

用python爬取二手房交易資訊並分析第一步：編寫爬蟲爬取某平臺上海市十個區共900條二手房的交易資訊 #爬取上海十個區的二手房價資訊 import requests from bs4 import BeautifulSoup import csv #

使用python爬取8684.cn公交資訊

使用庫如果庫缺失請自行下載 import requests import time from bs4 import BeautifulSoup import json 原始碼 import requests import time from bs4

Python爬取淘寶店鋪資訊

1.採用Chrome無頭瀏覽模式，後臺自動執行 2.函式結構化，易於擴充套件改變 3.異常重啟，防止崩潰已經封裝完畢 import re from selenium import webdriver from selenium.webdriver.common.by

python爬資料，天氣預報返回403

爬資料的時候，一般網站獲取資料訪問的動態js檔案是可以直接訪問的，用普通 request = urllib.request.urlopen(url) res = request.read() 就可以獲取而有的不能直接獲取，返回403報錯，意思是訪問不到，而這種明明有資料，伺服

Python爬鏈家網租房資訊

爬去鏈家網的租房資訊然後儲存到資料庫中。 #-*- coding:utf-8 -*- import requests import re import random import MySQLdb from bs4 import BeautifulSoup class h

python實戰之網路爬蟲（爬取新聞內文資訊）

（1）前期準備：開啟谷歌瀏覽器，進入新浪新聞網國內新聞頁面，點選進入其中一條新聞，開啟開發者工具介面。獲取當前網頁資料，然後使用BeautifulSoup進行剖析，程式碼： import requests from bs4 import BeautifulSoup res = requests.

python爬蟲學習之爬取全國各省市縣級城市郵政編碼

例項需求：運用python語言在http://www.ip138.com/post/網站爬取全國各個省市縣級城市的郵政編碼，並且儲存在excel檔案中例項環境：python3.7　　　　　　 requests庫(內建的python庫，無需手動安裝)　　　　　　 xlwt庫(需要自己手動安裝) 例項網站：

Python的scrapy之爬取鏈家網房價資訊並儲存到本地

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面爬蟲結構：其中封裝了一個數據庫處理模組，還有一個user-agent池。。先看mylian

Python網路爬蟲之製作股票資料定向爬蟲以及爬取的優化可以顯示進度條！

候選網站：新浪股票：http://finance.sina.com.cn/stock/ 百度股票：https://gupiao.baidu.com/stock/ 選取原則：無robots協議非js網頁資料在HTMLK頁面中的 F12，檢視原始

Python網路爬蟲之股票資料Scrapy爬蟲例項介紹，實現與優化！（未成功生成要爬取的內容！）

結果TXT文本里面竟然沒有內容！cry~ 編寫程式：步驟： 1. 建立工程和Spider模板 2. 編寫Spider 3. 編寫ITEM Pipelines 程式碼：成功建立 D:\>cd pycodes D:\pycodes>

Python資料爬蟲學習筆記（21）爬取京東商品JSON資訊並解析

一、需求：有一個通過抓包得到的京東商品的JSON連結，解析該JSON內容，並提取出特定id的商品價格p，json內容如下： jQuery923933([{"op":"7599.00","m":"9999.00","id":"J_5089253","p":"7099.00"}

python爬蟲學習之定向爬取股票資訊

一、功能描述目標：獲取上交所和深交所所有股票的名稱和交易資訊輸出：儲存到檔案中技術路線：requests-bs4-re 二、選取原則：股票資訊靜態存在於HTML頁面中，非js程式碼生成，沒有robots協議限制三、程式的結構設計

Python 爬下的必勝客資料背後，藏著什麼樣的資訊？

筆者從大學開始就接觸 Python，起初是好奇為什麼 Python 不需要瀏覽器就能抓取網站資料。深感奇妙之餘，也想親身體驗這種抓取資料的樂趣，所以寫了很多爬蟲程式。後隨著知識面的拓展，開始瞭解到資料分析這一領域，方知道爬取到的資料背後原來還隱藏著一些資訊。自己也是在學習這

利用Python爬取房產資料！並在地圖上顯示！Python乃蒂花之秀！

JiwuspiderSpider.py # -*- coding: utf-8 -*- from scrapy import Spider,Request import re from jiwu.items import JiwuItem clas

python爬蟲例項之爬取智聯招聘資料

這是作者的處女作，輕點噴。。。。實習在公司時領導要求學習python，python的爬蟲作為入門來說是十分友好的，話不多說，開始進入正題。主要是爬去智聯的崗位資訊進行對比分析出java和python的趨勢，爬取欄位：工作地點，薪資範圍，要求學歷，

Python爬蟲實戰之爬取B站番劇資訊(詳細過程)

目標：爬取b站番劇最近更新輸出格式:名字+播放量+簡介那麼開始擼吧~ 用到的類庫： requests:網路請求 pyquery:解析xml文件，像使用jquery一樣簡單哦~ 1.分析頁面佈局，找到需要爬取的內

Python爬資料之全國中小學資訊

相關推薦