對商品的評論進行資料探勘得到評論標籤（商品屬性+評論觀點），以及使用者的分組資訊

阿新 • • 發佈：2021-01-16

對商品的評論進行資料探勘得到評論標籤（商品屬性+評論觀點），以及使用者的分組資訊：

第一步：對文字進行預處理，分詞並進行語義角色標註

# -*- coding:utf-8 -*-
import os
from pyltp import Segmentor, Postagger, Parser, NamedEntityRecognizer, SementicRoleLabeller
from gensim.models import Word2Vec
import pandas as pd
import numpy as 
 np
import heapq
import re
import emoji

class Sentence_Parser:
    def __init__(self):
        #LTP_DIR = 'F:\project support\ltp_data_v3.4.0'
        LTP_DIR = './ltp_data_v3.4.0'
        # 分詞
        self.segmentor = Segmentor()
        self.segmentor.load(os.path.join(LTP_DIR, 'cws.model'))

        # 詞性標註 

        self.postagger = Postagger()
        self.postagger.load(os.path.join(LTP_DIR, 'pos.model'))

        # 依存句法分析
        self.parser = Parser()
        self.parser.load(os.path.join(LTP_DIR, 'parser.model'))

        # 命名實體識別（人名、地名、機構名等）
        self.recognizer = NamedEntityRecognizer()
        self. 
recognizer.load(os.path.join(LTP_DIR, 'ner.model'))

        # 詞義角色標註（施事、受事、時間、地點）
        self.labeller = SementicRoleLabeller()
        self.labeller.load(os.path.join(LTP_DIR, 'pisrl_win.model'))


    def format_labelrole(self, words, postags):
        """
        詞義角色標註
        """
        arcs = self.parser.parse(words, postags)
        roles = self.labeller.label(words, postags, arcs)
        roles_dict = {}
        for role in roles:
            roles_dict[role.index] = {arg.name: [arg.name, arg.range.start, arg.range.end] for arg in role.arguments}
        # for item in roles_dict.items():
        #     print(item)
        return roles_dict


    def bulid_parser_child_dict(self, words, postags, arcs):
        """
        句法分析---為句子中的每個詞語維護一個儲存句法依存子節點的字典
        """
        child_dict_list = []
        format_parse_list = []
        for index in range(len(words)):
            child_dict = dict()
            for arc_index in range(len(arcs)):
                if arcs[arc_index].head == index + 1:
                    if arcs[arc_index].relation not in child_dict:
                        child_dict[arcs[arc_index].relation] = []
                        child_dict[arcs[arc_index].relation].append(arc_index)
                    else:
                        child_dict[arcs[arc_index].relation].append(arc_index)
            child_dict_list.append(child_dict)
        rely_id = [arc.head for arc in arcs]
        # print(rely_id)
        relation = [arc.relation for arc in arcs]
        # for i in range(len(relation)):
        #     print(words[i], '_', postags[i], '_', i, '_', relation[i])
        heads = ['Root' if id == 0 else words[id-1] for id in rely_id]
        # print(heads)
        for i in range(len(words)):
            a = [relation[i], words[i], i, postags[i], heads[i], rely_id[i]-1, postags[rely_id[i]-1]]
            format_parse_list.append(a)
        return child_dict_list, format_parse_list


    def parser_main(self, sentence):
        """
        parser主函式
        """
        words = list(self.segmentor.segment(sentence))
        postags = list(self.postagger.postag(words))
        arcs = self.parser.parse(words, postags)
        child_dict_list, format_parse_list = self.bulid_parser_child_dict(words, postags, arcs)
        roles_dict = self.format_labelrole(words, postags)
        return words, postags, child_dict_list, roles_dict, format_parse_list

    def select(self, words, postags):
        """
        篩選出名詞和形容詞
        """
        co_model = Word2Vec.load('coseg_text.model')
        n_list0 = []
        a_list = []
        for i in range(len(postags)):
            if postags[i] == 'n':
                if len(words[i]) >= 2:
                    n_list0.append(words[i])
            if postags[i] == 'a':
                # if len(words[i]) >= 2:
                a_list.append(words[i])
        n_list0 = list(set(n_list0))
        a_list = list(set(a_list))
        # print(n_list0)
        # print(a_list)
        si_p = []
        for n in n_list0:
            try:
                s = co_model.similarity(n, '手機')
                si_p.append(s)
            except Exception as e:
                si_p.append(0)
        index_list = list(map(si_p.index, heapq.nlargest(int(0.8*len(si_p)), si_p))) #取出和手機相關度最高的n
        n_list = []
        for index in index_list:
            n_list.append(n_list0[index])
        # print(n_list)
        return n_list, a_list


    def simlarity(self, n_list0, a_list):
        """
        計算相似度,進行正逆向匹配，篩選出名詞和形容詞的最佳搭配
        """
        n_list0 = n_list0
        a_list = a_list
        co_model = Word2Vec.load('coseg_text.model')
        si_p = []
        for n in n_list0:
            try:
                s = co_model.similarity(n, '手機')
                si_p.append(s)
            except Exception as e:
                si_p.append(0)
        index_list = list(map(si_p.index, heapq.nlargest(int(0.8*len(si_p)), si_p))) #取出和手機相關度最高的n
        n_list = []
        for index in index_list:
            n_list.append(n_list0[index])

        # 名詞正向匹配
        comment1_df = pd.DataFrame(columns=['comment_tag', 'similarity'], index=[np.arange(100)])
        index = 0
        for i in range(len(n_list)):
            f_si = 0
            for j in range(len(a_list)):
                try:
                    si = co_model.similarity(n_list[i], a_list[j])
                    if si >= f_si:
                        f_si = si
                        comment_tag = n_list[i] + a_list[j]
                    else:
                        f_si = f_si
                except Exception as e:
                    print('語料庫中缺少該詞', e)
            comment1_df.loc[index, ] = [comment_tag, f_si]
            index += 1
        comment1_df = comment1_df.sort_values(by='similarity', ascending=False, ignore_index=True)
        comment1_df.dropna(subset=['comment_tag'], inplace=True)
        # comment1_df = comment1_df.iloc[0: int(0.2*len(comment_df)), ]

        # 形容詞匹配逆向匹配
        comment2_df = pd.DataFrame(columns=['comment_tag', 'similarity'], index=[np.arange(100)])
        index = 0
        for i in range(len(a_list)):
            f_si = 0
            for j in range(len(n_list)):
                try:
                    si = co_model.similarity(n_list[j], a_list[i])
                    if si >= f_si:
                        f_si = si
                        comment_tag = n_list[j] + a_list[i]
                    else:
                        f_si = f_si
                except Exception as e:
                    print('語料庫中缺少該詞', e)
            comment2_df.loc[index, ] = [comment_tag, f_si]
            index += 1
            comment2_df = comment2_df.sort_values(by='similarity', ascending=False, ignore_index=True)
            comment1_df.dropna(subset=['comment_tag'], inplace=True)
        comment_df = pd.merge(comment1_df, comment2_df, on='comment_tag', how='inner')
        comment_df.dropna(subset=['comment_tag'], inplace=True)
        return comment_df

    def cleandata(self, x):
        """
        對資料進行清洗，替換一些不規則的標點符號
        """
        pat = re.compile("[^\u4e00-\u9fa5^.^a-z^A-Z^0-9]")  # 只保留中英文，去掉符號
        x = x.replace(' ', ',')
        emoji.demojize(x)  # 去掉表情表情符號
        x = re.sub(pat, ',', x)
        return x

第二步：提取實體和相關實體資訊


```python
# -*- coding:utf-8 -*-
import os
from pyltp import Segmentor, Postagger, Parser, NamedEntityRecognizer, SementicRoleLabeller
from gensim.models import Word2Vec
from cixing import Sentence_Parser
import pandas as pd
import numpy as np
import heapq
import re
import emoji

class Extractor:
    def __init__(self):
        self.co_model = Word2Vec.load('coseg_text.model')
        self.parser = Sentence_Parser()

    def get_seginfo(self, comment_list):
        for c in range(len(comment_list)):
            if len(comment_list[c]) <= 200:
                sentence = comment_list[c]
            else:
                sentence = comment_list[c][0: 201]
            if sentence != '':
                sentence = self.parser.cleandata(sentence)
                words, postags, child_dict_list, roles_dict, format_parse_list = self.parser.parser_main(sentence)
                n_list, a_list = self.parser.select(words, postags)

                tags = []
                for j in range(len(a_list)):
                    # print(child_dict_list[j])
                    p = words.index(a_list[j])
                    if child_dict_list[p]:
                        # print(child_dict_list[p])
                        # 構成的是主謂關係
                        if 'SBV' in child_dict_list[p]:
                            # print(child_dict_list[p])
                            si_p = []
                            for po in child_dict_list[p]['SBV']:
                                try:
                                    si = self.co_model.similarity(words[po], '手機')
                                    si_p.append(si)
                                except Exception as e:
                                    si_p.append(0)
                                id = list(map(si_p.index, heapq.nlargest(1, si_p)))  # 和該形容詞最高的名詞

                            s = child_dict_list[p]['SBV'][id[0]]
                            w1 = words[s] + a_list[j]
                            if child_dict_list[s]:
                                # print(child_dict_list[s])
                                if 'ATT' in child_dict_list[s]:
                                    if postags[child_dict_list[s]['ATT'][0]] == 'n':
                                        w2 = words[child_dict_list[s]['ATT'][0]] + w1
                                        tags.append(w2)
                                    else:
                                        tags.append(w1)
                            else:
                                tags.append(w1)

                        if 'ATT' in child_dict_list[p]:
                            # print(child_dict_list[p])
                            s = child_dict_list[p]['ATT'][0]
                            if 'SBV' in child_dict_list[s]:
                                w3 = words[child_dict_list[s]['SBV'][0]]
                                w4 = w3 + a_list[j]
                                id1 = words.index(w3)
                                if child_dict_list[id1]:
                                    if 'ATT' in child_dict_list[id1]:
                                        if postags[child_dict_list[id1]['ATT'][0]] == 'n':
                                            w5 = words[child_dict_list[id1]['ATT'][0]] + w4
                                            tags.append(w5)
                                else:
                                    tags.append(w4)

                with open('F:\pycharm project data\\taobao\phone\\tags.txt', 'a') as t:
                    t.writelines(' '.join(tags))
                    t.writelines('\n')
                    # f.close()
                print(tags)


                # 獲取相關的名詞和使用者組
                n_list = list(set(n_list))
                if n_list:
                    with open('F:\pycharm project data\\taobao\phone\\noun.txt', 'a') as f:
                        f.writelines(' '.join(n_list))
                        f.writelines('\n')
                        # f.close()
                si_p = []
                u_list = ['小孩子', '作業', '高中', '初中', '兒童', '學校', '小孩', '老師', '網癮', '中學生', '小學', '女兒', '小學生', '孩子', '閨女', '兒子', '學生', '網課', '小朋友',
                            '同事', '表弟', '親戚', '姐妹', '表哥', '鄰居', '同學', '朋友', '盆友', '連結',
                            '姥姥', '老太太', '老人', '岳母', '父親', '老孃', '小姨', '老丈人', '舅舅', '岳父', '親人', '老媽子', '老頭兒', '婆婆', '老太', '老頭子', '父母', '家婆', '老父親', '老爹', '長輩', '大人', '外爺', '爺爺', '我爸', '老頭', '老媽', '老爺子', '爸媽', '奶奶', '老伴', '老爸', '母親', '老人家', '媽媽', '公公', '爸爸', '丈母孃', '姥爺', '家裡人', '家人',
                            '老奶奶', '小夥子', '阿姨', '娘娘', '小姑子', '姐姐', '老妹', '嬸嬸', '大姐', '外孫', '小屁孩', '孫子', '姨媽', '棉襖', '伯母', '孝心',
                            '媳婦', '妹妹', '男朋友', '物件', '生日', '女朋友', '男票', '老婆', '弟弟', '情人節', '爹媽', '麻麻', '老公', '外甥', '老弟'
                ]
                # print(n_list)
                # print(n_list)
                for n in range(len(n_list)):
                    for u in range(len(u_list)):
                        try:
                            s = self.co_model.similarity(n_list[n], u_list[u])
                            si_p.append(s)
                        except Exception as e:
                                si_p.append(0)
                index_list = list(map(si_p.index, heapq.nlargest(1, si_p)))  # 取出和手機相關度最高的n
                # print(index_list)
                user_list = []
                for index in index_list:
                    index = int(index/len(u_list))
                    user_list.append(n_list[index])
                # print(user_list)
                with open('F:\pycharm project data\\taobao\phone\\user.txt', 'a') as u:
                    u.writelines(user_list)
                    u.writelines('\n')
                    # f.close()
            t.close()
            f.close()
            u.close()

第三步：測試資料以及測試模型

# -*- coding:utf-8 -*-
import os
from pyltp import Segmentor, Postagger, Parser, NamedEntityRecognizer, SementicRoleLabeller
from gensim.models import Word2Vec
import pandas as pd
import numpy as np
import heapq
import re
import emoji
from extractor import Extractor

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 5000)
pd.set_option('max_colwidth', 30)
pd.set_option('display.width', 1000)
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)

# 一、資料處理
# 匯入資料
df = pd.read_csv('F:\pycharm project data\\taobao\phone\\comment1.csv', encoding='utf-8-sig')
# 提取評論資料
co_df = df[['content']]
co_df = co_df.loc[co_df['content'] != '15天內買家未作出評價', ['content']]
co_df = co_df.loc[co_df['content'] != '評價方未及時做出評價,系統預設好評!', ['content']]
comment_list = co_df['content'].tolist()




if __name__ == '__main__':
    myextractor = Extractor()
    #myextractor.get_seginfo(comment_list)

對商品的評論進行資料探勘得到評論標籤（商品屬性+評論觀點），以及使用者的分組資訊

技術標籤：ppython自然語言處理觀點抽取評論資料ltp依存分析對商品的評論進行資料探勘得到評論標籤（商品屬性+評論觀點），以及使用者的分組資訊：

抓取鏈家官網北京房產資訊並用python進行資料探勘

從2014年對樓市的普遍唱衰，到2015年的價格回暖，到底發生了怎樣的改變？本文就嘗試通過大資料來和豐富的圖表，為大家展現資料背後的資料。

使用sklearn進行資料探勘

目錄 1 使用sklearn進行資料探勘　　1.1 資料探勘的步驟　　1.2 資料初貌　　1.3 關鍵技術

資料探勘演算法和實踐（二十三）：XGBoost整合演算法案列（鳶尾花資料集）

技術標籤：機器學習/資料探勘實戰python機器學習深度學習人工智慧演算法本節繼續探討整合學習演算法，上一節介紹的是LGB的使用和調參，這裡使用datasets自帶的鳶尾花資料集介紹XGB，關於整合學習演算法的介紹可

資料探勘演算法和實踐（二十二）：LightGBM整合演算法案列（癌症資料集）

技術標籤：機器學習/資料探勘實戰Python與資料分析資料探勘機器學習python人工智慧演算法

資料探勘演算法和實踐（二十）：sklearn中通用資料集datasets

技術標籤：機器學習/資料探勘實戰資料探勘機器學習資料分析python 作為資料探勘工具包sklearn不但提供演算法實現，還通過sklearn.datasets模組提供資料集使用，根據需要有3種資料集API介面來獲取資料集，分別是

資料探勘中對Categorical特徵的處理

Categorical特徵常被稱為離散特徵、分類特徵，資料型別通常是object型別，而我們的機器學習模型通常只能處理數值資料，所以需要對Categorical資料轉換成Numeric特徵。

《最終幻想14》總監吉田直樹譴責第三方外掛對資料探勘零容忍

對於《最終幻想14》的最新龍歌重生（終極）突襲，已經有突襲團隊通過了這次任務，這促使總監和製作人吉田直樹向勝利者表示祝賀。然而，他的祝賀同時也伴隨著一個嚴厲的提醒，《最終幻想14》對第三方外掛和資料探勘採

python適合做資料探勘嗎

Python語言的崛起讓大家對web、爬蟲、資料分析、資料探勘等十分感興趣。資料探勘就業前景怎麼樣？關於這個問題的回答，大家首先要知道什麼是資料探勘。所謂資料探勘就是指從資料庫的大量資料中揭示出隱含的、先前未知

資料分析筆記：財政收入預測資料探勘分析

1、背景在我國現行的分稅制財政管理體制下，地方財政收入不僅是國家財政收入的重要組成部分，而且具有其相對獨立的構成內容。地方財政收入是區域國民經濟的綜合反映，也是市場經濟國家的政府進行巨集觀調控的基礎。

python文字處理資料探勘停用詞檢索

簡單描述程式功能：python+flask 1.停用詞為csv檔案 2.原始檔為txt檔案 3.文字處理，將原檔案中出現的停用詞去除

工資分配與資料探勘

摘要工資總額分配是與企業人力資源戰略緊密聯絡的管理要素。企業的工資總額對一個企業的未來發展至關重要，本文以2018年26個省市分公司年運營的統計資料作為研究物件，在合理假設的基礎上，綜合考慮國企對各省市分公

人工智慧之資料探勘：如何使用sklearn做資料探勘

目錄 1 使用sklearn進行資料探勘　　1.1 資料探勘的步驟　　1.2 資料初貌　　1.3 關鍵技術2 並行處理　　2.1 整體並行處理　　2.2 部分並行處理3 流水線處理4 自動化調參5 持久化6 回顧7 總結8 參考資料

資料分析與資料探勘 - 07資料處理

一 pandas基本資料型別 1 Series型別 Pandas是資料處理中非常常用的一個庫，是資料分析師、AI的工程師們必用的一個庫，對這個庫是否能夠熟練的應用，直接關係到我們是否能夠把資料處理成我們想要的樣子。Pandas是基於

資料探勘領域十大經典演算法之—K-鄰近演算法/kNN（超詳細附程式碼）

簡介又叫K-鄰近演算法，是監督學習中的一種分類演算法。目的是根據已知類別的樣本點集求出待分類的資料點類別。

[資料分析-資料探勘]BI-data analytics-data science

*資料分析**是一個相當廣的領域，其中包含了資料科學。資料科學是最近比較火的一個名詞，與傳統的資料分析相比都是從資料中找到知識和見解，只是在使用的技能和方式下有一定差異。並不是所有的資料分析都是

天池 - “零基礎入門資料探勘 - 二手車交易價格預測”TOP 2%開原始碼

本文是天池的零基礎入門資料探勘之“二手車交易價格預測大賽”的相關baseline與後續提分的完整程式碼思路分享。目前score在446，名次在200名以內，使用的模型為LightGBM，個人電腦對面15w+的資料量太慢了，

資料探勘實訓週報week3

本週主要學習了xgboost。 XGBoos是在AdaBoost和GBDT等提升演算法基礎上進行了優化的演算法，一般來說，演算法都是由模型、引數和目標函式三部分組成。模型可以理解為基函式(一個函式的固定形式，也就是函式

資料探勘相關知識與工具總結

>>> 資料探勘相關知識與工具 Python 資料分析工具庫：陣列處理：Numpy 簡介：python 強大的陣列庫

0基礎大資料學習：資料探勘的作用

在大資料準確營銷和大資料洞察力等一系列熱門詞彙的背後，資料探勘和分析技術在各行業發揮著重要作用，隨著資料資源的爆炸性增長，資料探勘技術不僅成為政府部門提高治理能力的重要手段，而且成為提升各行業

對商品的評論進行資料探勘得到評論標籤（商品屬性+評論觀點），以及使用者的分組資訊

相關推薦