Pytorch-中文文字分類

阿新 • • 發佈：2020-08-20

1. 爬蟲

JD.py

import requests
from urllib.parse import quote
from urllib.parse import urlencode
from lxml import etree
import logging
import json
import time

class JDSpider:
    # 爬蟲實現類：傳入商品類別（如手機、電腦），構造例項。然後呼叫getData爬取資料
    def __init__(self, categlory):
        self.startUrl = "https://search.jd.com/Search?keyword=%s&enc=utf-8" % (quote(categlory))  # jD起始搜尋頁面
        self.commentBaseUrl = "https://club.jd.com/comment/productPageComments.action?"
        self.headers = {
            "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
        }
        self.productsId = self.getId()
        self.comtype = {0: "nagetive", 1: "medium", 2: "positive"}
        self.categlory = categlory
        self.iplist = {
            'http': [],
            'https': []
        }

    def getParamUrl(self, productid, page, score):
        # 用於控制頁數，頁面資訊數的資料，非常重要，必不可少，否則會被JD識別出來，爬不出相應的資料。
        params = {     
            "productId" : "%s" % (productid),
            "score": "%s" % (score),         # 1: 差評, 2: 中評, 3: 好評
            "page": "%s" % (page),
            "sortType": "5",
            "pageSize": "10",
            "isShadowSku": "0",
            "rid": "0",
            "fold": "1"
        }
        url = self.commentBaseUrl + urlencode(params) 
        return params, url

    # 和初始的self.header不同，爬取某個商品的header，加入了商品id
    def getHeaders(self, productid):
        header = {
            "Referer": "https://item.jd.com/%s.html" % (productid),
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
        }
        return header

    # 獲取商品id，為了得到具體商品頁面的網址，結果保持在self.productId的數組裡
    def getId(self):
        response = requests.get(self.startUrl, headers=self.headers)
        
        if response.status_code != 200:
            logging.warning("狀態碼錯誤，爬蟲異常！")

        html = etree.HTML(response.text) 
        return html.xpath('//li[@class="gl-item"]/@data-sku')

    # maxPage是爬取評論的最大頁數，每頁10條資料
    def getData(self, maxPage, score, ): 
        # 差評 和 好評 的 最大一般頁碼不相同，一般情況下：好評 >> 差評 > 中評
        # score是指 那種評價型別： 好評3、中評2、差評1
        comments = []
        scores = []

        for j in range(len(self.productsId)):
            id = self.productsId[j]
            header = self.getHeaders(id)

            for i in range(1, maxPage):
                param, url = self.getParamUrl(id, i, score)          
                print(">>>>>>>>>>>>>>>>第：%d 個，第 %d 頁" % (j, i))
                
                try:
                    response = requests.get(url, headers=header, params=param)
                except Exception as e:
                    logging.warning(e)
                    break
                
                if response.status_code != 200:
                    logging.warning("狀態碼錯誤，爬蟲連線異常")
                    continue
                time.sleep(2)  # 設定時延
                if response.text == '':
                    logging.warning("未爬取到資訊")
                    continue
                try:
                    res_json = json.loads(response.text)
                except Exception as e:
                    logging.warning(e)
                    continue
                if len((res_json['comments'])) == 0:
                    logging.warning("頁面次數已到：%d,超出範圍" % (i))
                    break
                logging.info("正在爬取%s %s 第 %d" % (self.categlory, self.comtype[score], i))
                for cdit in res_json['comments']:
                    comment = cdit['content'].replace("\n", ' ').replace('\r', ' ')
                    comments.append(comment)
                    scores.append(cdit['score'])
                    print(comment)

        savepath = './data/' + self.categlory + '_' + self.comtype[score] + '.csv'
        logging.warning("已爬取%d 條 %s 評價資訊" % (len(comments), self.comtype[score]))
        with open(savepath, 'a+', encoding='utf8') as f:
            for i in range(len(comments)):
                f.write("%d\t%s\t%s\n" % (i, scores[i], comments[i]))
        logging.warning("資料已儲存在 %s" % (savepath))


if __name__=='__main__':

    list = ['電腦','手機','耳機']
    for item in list:
        spider = JDSpider(item)
        spider.getData(10, 2)     # 好評
        spider.getData(10, 1)     # 中評
        spider.getData(10, 0)     # 差評

Pytorch-中文文字分類

1. 爬蟲 JD.py import requests from urllib.parse import quote from urllib.parse import urlencode from lxml import etree

hugging face-基於pytorch-bert的中文文字分類

1、安裝hugging face的transformers pip installtransformers 2、下載相關檔案字表： wget http://52.216.242.246/models.huggingface.co/bert/bert-base-uncased-vocab.txt

利用TfidfVectorizer進行中文文字分類（資料集是復旦中文語料）

1、對語料進行分析基本目錄如下：其中train存放的是訓練集，answer存放的是測試集，具體看下train中的檔案：

利用RNN進行中文文字分類（資料集是復旦中文語料）

利用TfidfVectorizer進行中文文字分類（資料集是復旦中文語料） 1、訓練詞向量資料預處理參考利用TfidfVectorizer進行中文文字分類（資料集是復旦中文語料），現在我們有了分詞後的train_jieba.txt和test_jieba.tx

文字分類（一）：使用Pytorch進行文字分類——BiLSTM+Attention

一、架構圖二、程式碼 class TextBILSTM(nn.Module): def __init__(self, config:TRNNConfig, char_size = 5000,

文字分類（二）：使用Pytorch進行文字分類——TextCNN

一、架構圖二、程式碼實現 class TextCNN(nn.Module): def __init__(self, config:TCNNConfig, char_size = 5000, pinyin_size=5000):

文字分類（三）：使用Pytorch進行文字分類——Transformer

一、前言文字分類不是生成式的任務，因此只使用Transformer的編碼部分（Encoder）進行特徵提取。如果不熟悉Transformer模型的原理請移步。

Pytorch實現基於CharRNN的文字分類與生成示例

1 簡介本篇主要介紹使用pytorch實現基於CharRNN來進行文字分類與內容生成所需要的相關知識，並最終給出完整的實現程式碼。

使用pytorch和torchtext進行文字分類的例項

文字分類是NLP領域的較為容易的入門問題，本文記錄我自己在做文字分類任務以及復現相關論文時的基本流程，絕大部分操作都使用了torch和torchtext兩個庫。

pytorch+huggingface實現基於bert模型的文字分類（附程式碼）

從RNN到BERT 一年前的這個時候，我逃課了一個星期，從澳洲飛去上海觀看電競比賽，也順便在上海的一個公司聯絡了面試。當時，面試官問我對RNN的瞭解程度，我回答“沒有了解”。但我把這個問題帶回了學校，從

基於TorchText的PyTorch文字分類

作者|DR. VAIBHAV KUMAR 編譯|VK 來源|Analytics In Diamag 文字分類是自然語言處理的重要應用之一。在機器學習中有多種方法可以對文字進行分類。但是這些分類技術大多需要大量的預處理和大量的計算資源。在這篇文章

Pytorch-LSTM+Attention文字分類

語料連結：https://pan.baidu.com/s/1a1J_LigAig-80W6IenCyZg 提取碼：hbx1 train.txt pos/neg各500條，一共1000條（用於訓練模型）dev.txtpos/neg各100條，一共200條（用於調引數）test.txtpos/neg各150條，一共300