Trie樹詞頻統計例項

阿新 • • 發佈：2019-01-04

Trie樹簡介

Trie樹，也叫字首字典樹，是一種較常用的資料結構。常用於詞頻統計，
字串的快速查詢，最長字首匹配等問題以及相關變種問題。

資料結構表現形式如下圖所示：
這裡寫圖片描述

Trie樹的根為空節點，不存放資料。每個節點包含了一個指標陣列，陣列大小通常為26，即儲存26個英文字母（如果要區分大小則陣列大小為52，如果要包括數字，則要加上0-9，陣列大小為62）。
可以想象它是一棵分支很龐大的樹，會佔用不少記憶體空間；不過它的樹高不會唱過最長的字串長度，所以查詢十分快捷。典型的用空間換取時間。

全英聖經詞頻統計

全英聖經TXT檔案大小有4m，若要對它進行詞頻統計等相關操作，可以有許多方法解決。
我覺得可以用如下方式：

pthon字典資料結構解決
在linux下利用sed & awk 文字處理程式解決
C++ STL map解決
Trie樹解決

前三種實現比較簡單快捷，不過通過自己封裝Trie樹可以練習一下資料結構！感受一下資料結構帶來的效率提升，何樂而不為。

下面則是我的具體實現，如有紕漏，敬請指正！

1）自定義標頭檔案

WordHash用來記錄不重複的單詞及其出現次數
TrieTree類封裝得不太好，偷懶把很多屬性如行數，單詞總數等都放在public域

#ifndef _WORD_COUNT_H
#define _WORD_COUNT_H

#include<stdio.h> 

#include<string.h>
#include<string>
#include<fstream>
#include<sstream>
#include<vector>
#include<iterator>
#include<algorithm>
#include<iostream>

using std::string;
using std::vector;

typedef struct tag {
    char word[50];  //單個單nt show_times; //出現次數
    int 
 show_times; //出現次數
}WordHash;

const int child_num = 26;

//字典樹節點
typedef struct Trie {
    int count;
    struct Trie *next_char[child_num];
    bool is_word;

    //節點建構函式
    Trie(): is_word(false) {
        memset(next_char,NULL,sizeof(next_char));
    }
}TrieNode;

class TrieTree {
 public:
    TrieTree();
    void insert(const char *word);
    bool search(const char *word);
    void deleteTrieTree(TrieNode *root);
    inline void setZero_wordindex(){ word_index = 0; }

    int word_index;
    WordHash *words_count_table; //詞頻統計表
    int lines_count;
    int all_words_count; //單詞總數
    int distinct_words_count;  //不重複單詞數

 private:
    TrieNode *root; //字典樹根節點
};

//文字詞頻統計類
class WordStatics {
 public:
    void open_file(string filename);
    void write_file();

    void set_open_filename(string input_path);
    string& get_open_filename();

    void getResult();
    void getTopX(int x);

 private:
    vector<string> words;  //儲存文字中所有單詞
    TrieTree dictionary_tree; //字典樹

    vector<WordHash> result_table; //結果詞頻表
    string open_filename; //將要處理的文字路徑
    string write_filename; //詞頻統計結果檔案
};



#endif

具體類成員函式cpp檔案
1）字典樹建構函式

#include<iostream>
#include "word_count.h"

using namespace std;


//字典樹建構函式
TrieTree::TrieTree() {
    root = new TrieNode();
    //詞頻統計表,記錄單詞和出現次數
    word_index = 0;
    lines_count = 0;
    all_words_count = 0;
    distinct_words_count = 0;
    words_count_table = new WordHash[30000];
}

2）讀取文字中的單詞，逐個插入到字典樹中，建立字典樹。
（僅實現了能夠處理全為小寫字母的文字，本人先將聖經檔案做了一些簡單處理）

//建立字典樹，將單詞插入字典樹
void TrieTree::insert(const char *word) {
    TrieNode *location = root; //遍歷字典樹的指標

    const char *pword = word;

    //插入單詞
    while( *word ) {
        if ( location->next_char[ *word - 'a' ] == NULL ) {
            TrieNode *temp = new TrieNode();
            location->next_char[ *word - 'a' ] = temp;
        }    

        location = location->next_char[ *word - 'a' ];
        word++;
    }
    location->count++;
    location->is_word = true; //到達單詞末尾
    if ( location->count ==1 ) {
        strcpy(this->words_count_table[word_index++].word,pword);
        distinct_words_count++;
    }
}

3）按單詞查詢字典樹，獲取其出現次數

//查詢字典樹中的某個單詞
bool TrieTree::search(const char *word) {
    TrieNode *location = root;

    //將要查詢的單詞沒到末尾字母，且字典樹遍歷指標非空
    while ( *word && location ) {
        location = location->next_char[ *word - 'a' ];
        word++;
    }

    this->words_count_table[word_index++].show_times = location->count;
    //在字典樹中找到單詞，並將其詞頻記錄到詞頻統計表中
    return (location != NULL && location->is_word);
}

4）刪除字典樹

//刪除字典樹,遞迴法刪除每個節點
void TrieTree::deleteTrieTree(TrieNode *root) {
    int i;
    for( i=0;i<child_num;i++ ) {
        if ( root->next_char[i] != NULL ) {
            deleteTrieTree(root->next_char[i]);
        }
    }
    delete root;
}

5）WordStatics類相關成員函式定義

void WordStatics::set_open_filename(string input_path) {
    this->open_filename = input_path;
}

string& WordStatics::get_open_filename() {
    return this->open_filename;
}

void WordStatics::open_file(string filename) {
    set_open_filename(filename);
    cout<<"檔案詞頻統計中...請稍後"<<endl;

    fstream fout;
    fout.open(get_open_filename().c_str());  

    const char *pstr;
    while (!fout.eof() ) { //將檔案單詞讀取到vector中
        string line,word;
        getline(fout,line);
        dictionary_tree.lines_count++;

        istringstream is(line);  
        while ( is >> word ) {
            pstr = word.c_str();
            dictionary_tree.all_words_count++;
            words.push_back(word);
        }
    } 

    //建立字典樹
    vector<string>::iterator it;
    for ( it=words.begin();it != words.end();it++ ) {
        if ( isalpha(it[0][0]) ) { 
           dictionary_tree.insert( (*it).c_str() );
        }
    }

}

void WordStatics::getResult() {
    cout<<"文字總行數："<<dictionary_tree.lines_count<<endl;
    cout<<"所有單詞的總數 : "<<dictionary_tree.all_words_count-1<<endl;
    cout<<"不重複單詞的總數 : "<<dictionary_tree.distinct_words_count<<endl;

    //在樹中查詢不重複單詞的出現次數
    dictionary_tree.setZero_wordindex();
    for(int i=0;i<dictionary_tree.distinct_words_count;i++) {
        dictionary_tree.search(dictionary_tree.words_count_table[i].word);
        result_table.push_back(dictionary_tree.words_count_table[i]);
    }
}

6）對統計結果進行排序，依照使用者輸入輸出前N詞頻的單詞

bool compare(const WordHash& lhs,const WordHash& rhs) {
    return lhs.show_times > rhs.show_times ;
}

void WordStatics::getTopX(int x) {
    sort(result_table.begin(),result_table.end(),compare);
    cout<<"文字中出現頻率最高的前5個單詞："<<endl;
    for( int i = 0; i<x; i++) {
        cout<<result_table[i].word<<": "<<result_table[i].show_times<<endl;
    }
}

執行結果：

這裡寫圖片描述

僅供參考，記錄自己的學習歷程。
還有許多地方不太合理，需要改進，慢慢提升自己的程式設計能力！

Trie樹詞頻統計例項

Trie樹簡介 Trie樹，也叫字首字典樹，是一種較常用的資料結構。常用於詞頻統計，字串的快速查詢，最長字首匹配等問題以及相關變種問題。資料結構表現形式如下圖所示： Trie樹的根為空節點，不存放資料。每個節點包含了一個指標陣列，陣列大小通常為2

Trie樹：統計詞頻、排序、查詢

Trie樹利用字串的公共字首降低了查詢時間的開銷，提高了查詢的效率。字典樹的插入，刪除和查詢都非常簡單，用一個一重迴圈即可。 1. 從根節點開始一次搜尋 2. 取得要查詢關鍵詞的第一個字母，並根據該字母選擇對應的子樹並轉到該子樹繼續進行檢索 3. 在相應的子樹上，取得要查

Spark環境安裝部署及詞頻統計例項

Spark是一個高效能的分散式計算框架，由於是在記憶體中進行操作，效能比MapReduce要高出很多．具體的我就不介紹了，直接開始安裝部署並進行例項測試首先在官網下載http://spark.ap

Flink環境安裝部署、詞頻統計例項、WordCount原始碼分析

./start-cluster.sh 瀏覽器輸入http://localhost:8081可以看到UI介面單詞統計例項： jar包所在位置(安裝包自帶) 依次輸入： ./flink run .

中文分詞與詞頻統計例項

http://blog.ourren.com/2014/09/24/chinese_token_and_frequency/ 話說近兩年大資料確實火了，帶給我們最直接的視覺感受就是利用圖或者表來展示大資料所隱藏的內容，真是真實而又直觀。然而技術部落格的側邊欄標籤雲就

Trie樹實現詞頻統計與查詢

#encoding:utf-8 from collections import defaultdict import sys reload(sys) sys.setdefaultencoding('u

HDU1251 統計難題【trie樹】

courier ava 自己的 while onos ets ctrl pan alloc 統計難題 Time Limit: 4000/2000 MS (Java/Others) Memory Limit: 131070/65535 K (Java/Other

【hdoj】1251 統計難題【數據結構-Trie樹裸題】

lse show sin int 前綴 lock 參考需要 hdu 傳送門：統計難題題意：字典樹裸題。分析字典樹板子，但是這題需要註意一點。關於字典樹的只是可以參考hihocoder hiho一下第二周用G++提交會爆內存(Memory Limit Exce

Trie樹_CH1601_字首統計

點此開啟題目頁面思路分析: 直接應用Trie樹即可, 下面給出AC程式碼: //CH1601_字首統計 #include <iostream> #include <cstdio> #include <cstring>

[Trie樹] 統計英文文字中單詞出現的個數 - C語言實現 - 考慮數字、英文

【英文文字】 However, after reaching the shore there are plenty of challenges waiting for him."The biggest challenge now is learning to walk agai

Trie樹應用於統計和排序

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

Trie樹（字典樹）：應用於統計和排序

轉載這篇關於字典樹的原因是看到騰訊面試相關的題：就是在海量資料中找出某一個數，比如2億QQ號中查找出某一個特定的QQ號。。有人提到字典樹，我就順便了解下字典樹。 [轉自：http://blog.csdn.net/oncealong/article/details

字典樹（Trie樹）附例題（統計難題 HDU

一、基礎理論：字典樹，又稱單詞查詢樹，Trie樹，是一種樹形結構，是一種雜湊樹的變種。典型應用是用於統計，排序和儲存大量的字串（但不僅限於字串），所以經常被搜尋引擎系統用於文字詞頻統計。二、基本性質：根節點不包含字元，除根節點之外每個子節點都包含一個

字典樹-大量字串字首及出現次數是否存在統計(Trie樹-java)演算法實現

前言字典樹又稱單詞查詢樹，它是一種樹形結構，是一種雜湊樹的變種，典型應用是用於統計，儲存大量的字串（但不僅限於字串），統計以是否有以某字串最為字首的字串，有的話有多少，某字串出現了多少

hdu1251 統計難題(Trie樹入門題)

題目連結： Trie樹的入門題。程式碼： #include<stdio.h> #include<string.h> const int maxnode=1000005;

字典樹應用——詞頻統計（C++實現）

來學校交流學習的第一個正式的小專案作業就是軟體工程老師所提出的詞頻統計了，具體要求如下。要求：寫一個程式，分析一個文字檔案中各個詞出現的頻率，並且把頻率最高的10個詞打印出來。文字檔案大約是30KB~300KB大小。解決思路：剛看到這個問題，我腦海

Java詞頻統計演算法（使用單詞樹）

許多英語培訓機構（如新東方）都會出幾本“高頻詞彙”的書，主要內容是統計近幾年來各類外語考試中屢次出現的高頻詞彙，幫助考生減少需要背的生詞的數量。但這些高頻是如何被統計出來的呢？顯然不會用手工去計算。假如我們已經將一篇文章存在一字串(String)物件中，為了統計詞彙出現頻率

使用單詞樹進行詞頻統計演算法

許多英語培訓機構（如新東方）都會出幾本“高頻詞彙”的書，主要內容是統計近幾年來各類外語考試中屢次出現的高頻詞彙，幫助考生減少需要背的生詞的數量。但這些高頻是如何被統計出來的呢？顯然不會用手工去計算。　　假如我們已經將一篇文章存在一字串(String)物件中，為了統計詞

HDU 1251 統計難題 (Trie樹模板題)

統計難題 Time Limit: 4000/2000 MS (Java/Others) Memory Limit: 131070/65535 K (Java/Others) Total Sub

【BZOJ4567】[Scoi2016]背單詞 Trie樹+貪心

字母如果 ems scanf 序號 data scan name rdquo 【BZOJ4567】[Scoi2016]背單詞 Description Lweb 面對如山的英語單詞，陷入了深深的沈思，“我怎麽樣才能快點學完，然後去玩三國殺呢？&rdquo

Trie樹詞頻統計例項

Trie樹簡介

全英聖經詞頻統計

執行結果：

相關推薦