Python爬取考研必備單詞

阿新 • • 發佈：2018-12-20

參考連結：(https://blog.csdn.net/OnlyloveCuracao/art0icle/details/80768334)
原博主的程式碼可能因為單詞發音的音訊爬取有問題，導致無法將單詞存入資料庫，不過也非常感謝原博主，我根據原始碼做了一定刪減和更改，下次可能會根據使用者的需求，輸入需求，爬取相應的單詞。
主要步驟：
1.連線資料庫
2.建立word單詞表
3.獲取網頁主介面HTML程式碼
4.獲取class（課程型別）介面HTML程式碼，如“考研課程”
5.獲取course（課時內容，一門課程有多節課時）介面HTML，在course中得到word（單詞）、pronunciation（發音）、translation（翻譯）
ps：發音音訊暫不考慮，好吧，博主我時間來不及了，老師在催了。。。
6.將資料存入資料庫，over！
7.最後放主函式

ps：博主是個菜鳥，若有錯歡迎指正。如果還有比博主還菜的同學，不知道某些變數是啥子，就print，程式碼裡應該也有，只是註釋掉了

1連線資料庫：
輸入相關資料，我資料庫沒有密碼所以沒寫，db中的只是建立資料庫時單詞寫錯了，尬了

def conn():  # 連線資料庫
    db = pymysql.connect(host='localhost', user='root', password='', db='phython', port=3306)
    print('已連線資料庫')
    return db

2建立word表

def create_table(db):  # 建立一個單詞表
    cursor = db.cursor()  # 建立遊標
    sql = 'CREATE TABLE IF NOT EXISTS word (id VARCHAR(255) NOT NULL,word VARCHAR(255) NOT NULL,' \
          'trans VARCHAR(255) NOT NULL,word_type VARCHAR(255) NOT NULL, PRIMARY KEY (id))'
    cursor.execute(sql)  # 執行sql語句
    print('建表完成！')
    db.close()

3獲取網頁主介面HTML
首先加一個請求頭，有些網站會遮蔽爬蟲的請求，所以加一個請求頭起偽裝作用
然後就看註釋吧
主頁的HTML結構

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'
}

def get_html(url):  # 爬取主頁
    try:
        html = requests.get(url, headers=header)  # 使用requests庫爬取
        if html.status_code == 200:  # 如果狀態碼是200，則表示爬取成功
            print(url + '獲取成功')
            return html.text  # 返回H5程式碼
        else:  # 否則返回空
            print('獲取失敗')
            return None
    except:  # 發生異常返回空
        print('獲取失敗')
        return None

4獲取class的HTML，有效資訊為class_id
class_id在div（類名‘main_l’）下的ul（類名‘cl’）標籤裡面的li中
ps：一門課程可能有多個class_id, 比如大學英語四級，就分為‘四級必備詞彙’和‘四級救命詞彙’，它們的class_id不同

def get_url(html):  # 解析首頁得到所有的網頁和課程id
    class_list = []  # 定義存放class_id的列表
    class_info = BeautifulSoup(html, "html.parser")
    class_div = class_info.find('div', {'class': 'main_l'})  # 找到存放class_id的div
    class_li = class_div.find('ul', {'class': 'cl'}).find_all('li')  # 找到div下的ul標籤內的所有li
    for class_id in class_li:
        class_list.append(class_id.get('class_id'))  # 得到class_id
    return class_list

5獲取course的HTML，有效資訊為：單詞、發音、翻譯
字典後面會有簡略介紹，看註釋或者百度一手嘛

def get_info(word_html, type_name):  # 爬取所有的單詞、發音、翻譯
    word_all = {}  # 字典，存放詞彙所有相關內容
    word_info = BeautifulSoup(word_html, "html.parser")
    word_div = word_info.find_all('div', class_="word_main_list_w")  # 單詞div內容
    pronunce_div = word_info.find_all('div', {'class': 'word_main_list_y'})  # 發音div內容
    trans_div = word_info.find_all('div', {'class': 'word_main_list_s'})  # 翻譯div內容
    for i in range(1, len(word_div)):
        key = word_div[i].span.get('title')  # 獲取單詞
        pronunce = pronunce_div[i].strong.string.split()  # 獲取發音
        trans = trans_div[i].span.get('title')  # 獲取翻譯
        if len(pronunce) < 1:  # 無發音則跳過本次迴圈
            continue
        word_all[key] = [pronunce[0], trans, type_name]  # 字典結構:字典名={'key': ['value_1','value2_',....,'value_n'],}
    print('建立資料成功')
    return word_all

6將資料存入資料庫
註釋裡有單詞儲存到字典的結構：字典名={key：[value1，value2，value3]}, key在本專案中就是word單詞，value1為發音，value2為翻譯，value3為詞彙型別。如果要獲取字典中的value，則為：字典名【key】【下標】，如word_dict【key】【0】就是value1的值（部落格裡不能連用2個英文方括號，便用中文的代替了）
構造sql語句，Python裡佔位符為%s

def insert_words(word_dict, db):  # 爬取資料到資料庫
    cursor = db.cursor()  # 建立一個遊標
    # print(word_dict)
    # word_dict是一個字典，模型：{'distant': ['[ˈdistənt]', 'adj. 遠的；遙遠的；疏遠的；不親近的', '考研必備詞彙']}
    for key in word_dict:
        sql = 'INSERT INTO word(word, pronunciation, trans, word_type) values(%s, %s, %s, %s)'  # 構造sql語句
        try:
            cursor.execute(sql, (key, word_dict[key][0], word_dict[key][1], word_dict[key][2]))
            # key就是單詞，word_dict[key][0]就是發音...
            db.commit()  # 插入資料
        except:
            db.rollback()  # 回滾
    print('資料插入成功')
    db.close()  # 關閉資料庫
    print('資料庫成功關閉')

7主函式：
for迴圈中if語句可以改的，13是考研必備詞彙，可以換成其他id，但是六級id的好像不可以，我也不知道什麼問題，希望有大神能指教一下
就看註釋吧

def main():
    db = conn()
    create_table(db)  # 建立一個表
    base_url = 'http://word.iciba.com/'  # 主頁網址
    base_html = get_html(base_url)  # 得到首頁的H5程式碼
    class_id = get_url(base_html)  # 得到所有class_id值
    #print(class_id)
    print('爬取主頁')

    for id in class_id:  # word_all為class_id所有可能的取值
        # print(id)
        if id == '13':  # 考研詞彙class_id
            class_url = 'http://word.iciba.com/?action=courses&classid=' + str(id)  # 利用字串拼接起來，得到URL網址
            html = get_html(class_url)
            class_info = BeautifulSoup(html, "html.parser")  # 課程資訊

            # 獲取課程中所有課時，其中li的長度就是課時的數量
            course_li = class_info.find('ul', {'class': 'study-speed-m cl'}).find_all('li')
            name_info = class_info.find('div', {'class': 'word_h2'})  # 得到顯示單詞型別的div內容
            # print(name)
            r = re.compile(".*?</div>(.*?)</div>")  # 從div中匹配單詞型別
            name = re.findall(r, str(name_info))  # 得到單詞型別：六級必備詞彙, 並存入name列表
            name = name[0]  # 由於列表的值都相同，所以取第一個就好啦
            print('開始爬取' + name)

            # 課時的數量即li標籤的數量，就是course的值，考研詞彙為1到274
            for course_id in range(1, len(course_li) + 1):
                # 拼接單詞的URL
                course_url = 'http://word.iciba.com/?action=words&class=' + str(id) + '&course=' + str(course_id)
                word_html = get_html(course_url)
                print('開始爬取資料')
                word_dict = get_info(word_html, name)  # 得到資料字典
                print('開始儲存資料')
                db = conn()
                insert_words(word_dict, db)  # 儲存資料
                #print(word_dict)


if __name__ == '__main__':
    main()

歡迎指正！“能用程式碼解決的問題，怎麼能直接用軟體喃”，裝b多好！over！

Python爬取考研必備單詞

Python爬取考研必備單詞

python爬取20000個單詞音訊

python requests做爬蟲爬取oxford詞典單詞音標

用Python爬取了考研吧1000條帖子，原來他們都在討論這些！

python爬取網易雲音樂歌單音樂

python 爬取qidian某一頁全部小說

Python爬取今日頭條段子

利用python爬取龍虎榜數據及後續分析

python爬取豆瓣小組700+話題加回復啦啦啦python open file with a variable name

Node.js/Python爬取網上漫畫

python爬取百度搜索圖片

Python爬取百度貼吧數據

python爬取七星彩的開獎歷史記錄

python爬取網頁圖片

python爬取百度搜索結果ur匯總

python 爬取微博信息

Python爬取貓眼top100排行榜數據【含多線程】

python 爬取京東手機圖

Python 爬取數據入庫mysql

python爬取煎蛋妹子圖（老司機養成之路）

Python爬取考研必備單詞

相關推薦