用程式實現在大檔案中出現次數為Top N的數字

阿新 • • 發佈：2019-01-01

有一個問題：有一個很大的檔案（如20GB），記憶體裝不下，其中存了很多個數字（也可能是URL之類的），找出出現次數最多的3個數字。

解題思路有這麼3個點：

1. Top N的問題自然是用最小堆來解。不過如果只是找Top 3而已，也不用構造堆那麼麻煩，直接幾行比較程式碼應該就可以了。
2. 檔案很大記憶體裝不下，也就意味著不可能一次把檔案整個讀入記憶體再做處理。其實各個程式語言都有讀大檔案的方法，就是一行一行或一塊一塊地讀，如C++有getline()函式，Python、Perl等都有類似的方法。

3. 形成一個Hash表，key是數字，value是該數字出現的次數。然後遍歷整個Hash表，找到Top 3的數字。但是注意，這個Hash表的大小也可能會超出記憶體的限制。

假設數字的取值範圍為0--2^31-1，那麼可能的數字個數為21億多（即2G），而每個數字佔4個位元組；假設所有的數字都出現了，並且每個數字只出現1次，然後再算上value的儲存，那麼佔用的記憶體為: 2G * 4B * 2 =16GB，所以這個Hash表是有可能超出普通計算機的可用記憶體限制的。當然，如果數字個數不變而數字重複率高，自然Hash表就會小很多了。可是誰知道重複率怎樣呢？如果是用生成隨機數的庫函式來生成這個大檔案的，那麼重複率還是比較低的。

在這種情況下，就要用到分而治之的思想了。如果把原來的大檔案分成若干小檔案，且每個小檔案中出現的數字不會出現在任何其他的小檔案中，那麼只要統計每個小檔案中出現次數的前3名，然後比較所有小檔案的前3名，就能找到整個大檔案中出現次數最高的前3名了。那麼如何製作這樣的小檔案呢？對數字取模就可以了。

下面用程式來實現大檔案的構造和問題的解答。

1. 我們需要建立一個很大的檔案，它擁有1億行，每行20個數字，即總共20億的數字，即近2G個數字。

這裡說一個與本題無關的數學問題。這個檔案會有多大？

考慮到每個數字作為十進位制儲存在檔案裡的時候，佔用的位數不一定是4位，即並不是4個位元組，那麼，平均每個數字在磁碟檔案中佔用幾個位元組呢？

首先，我們採用C語言的rand()來生成隨機數，那麼隨機數的取值範圍是0-RAND_MAX，這個RAND_MAX是在cstdlib中定義的一個值，就是2^31-1 = 2147483647，然後採用加權平均的演算法來計算：

(1*10 + 2*90 + 3*900 + 4*9000 + 5*9e4 + 
 6*9e5 + 7*9e6 + 8*9e7 + 9*9e8 + 
 10*(2147483647-9e8))/2147483647 
= 9.95

這就是說，平均每個數字在磁碟檔案中佔據了差不多10個位元組！那麼近2G個數字，就會佔據磁碟空間近20GB. 這個理論推測和實際觀察也是一致的。下面就是生成這個大檔案的C++程式：

// gen_rand.cpp
// g++ gen_rand.cpp -std=c++11 -o gen

#include <iostream>
#include <string>
#include <sstream>
#include <cstdlib> 
#include <ctime>
#include <vector>
#include <fstream>
using namespace std;

// For convenience, make all configurable variables Global 
const int total_rows = 2e8;  // 1e8 is close to 100M 
const int max_rows_in_buf = 1e4; // 1e4 rows are about 1M bytes
const int max_cols_in_buf = 20;  // 20 * 1e8 = 2e9 numbers
const string file = "./tmp.txt";


void gen_rand_int(ofstream & fout)
{
    srand((unsigned)time(NULL));
    int row_count = 0;
    string line; 
    stringstream ss(line);
    
    while (row_count < total_rows) {
        for (int rows=0; rows < max_rows_in_buf; ++rows) {
            for(int cols = 0; cols < max_cols_in_buf; ++cols) {
                ss << rand() << " ";
            }
            ss << "\n";
        }
        // using stringstream as buffer is 14% faster than using fout directly
        fout << ss.str();  
        ss.str("");  // clear string stream 
        
        row_count += max_rows_in_buf;
        // cout << row_count << " rows has been generated" << endl;
    }
}


int main() 
{  
    ofstream fout(file.c_str(), ios::out);
    if (fout.good()) {
        gen_rand_int(fout);
    }
    fout.close();
    
    return 0; 
}

總結一下：
a. 本人又寫了一段Python程式碼，做同樣的事情，只是為了對比一下效率。結果是，C++比Python快70倍左右。
b. 以上程式碼中使用了stringstream做了快取，快取為10000行，即每20萬個數字寫入檔案一次，而不是每生成一行數字就寫入檔案。使用了快取之後，速度快了約14%.
c. 程式碼執行總時間為 166秒左右。
d. 如果要做並行化提高速度，可以多個執行緒寫入多個檔案，完了用 cat file1 file2 file3 > big_file 的方式合併。

2. 假設我們想把大檔案分成40個小檔案。這裡需要保證的是，對於每個小檔案所形成的Hash表，不會超過記憶體的限制。總共是2G個數字，如果分佈比較平均的話，40個小檔案，每個小檔案包含50M個數字，那麼佔用記憶體大約為： 50M*4B*2 = 400MB ,但是假設分佈不夠平均的話，那麼即使再double一下，也不過佔用800MB，還是承受的住的。其實這裡分成40個小檔案還是有點武斷了，也許應該做更多更深入的思考。只是鑑於我們在以上第一部分中，採用計算機偽隨機演算法來生成的隨機數，還算是比較平均分佈的，因此不會有什麼問題。

因此，將大檔案分成40個小檔案的程式碼如下：

// read_rand.cpp
// g++ read_rand.cpp -std=c++11 -o rr

#include <vector>
#include <utility>
#include <fstream>
#include <sstream>
#include <string>
#include <iostream>
using namespace std;


// For convenience, make all configurable variables Global
const int file_num = 10;
const string infile_name = "./tmp.txt";
const string out_file_prefix = "out_file_";
// The buffer is critical to performance
const int max_num_in_bufline = 1000;


int main()
{
    // Preapre OUT files 
    fstream outfiles[file_num];
    vector<string> vec_outfile_names;
    vec_outfile_names.reserve(file_num);
    
    for (int i=0; i<file_num; ++i) {
        string tmp_str;
        stringstream tmp_ss(tmp_str);
        tmp_ss << out_file_prefix << i << ".txt";
        string file_name = tmp_ss.str();
        vec_outfile_names.push_back(file_name);
        
        outfiles[i].open(file_name.c_str(), ios::app);
        if (!outfiles[i].good()) {
            cout << "Cannot open " << i << " file\n";
            exit(-1);
        }
    }
    
    stringstream ss_array[file_num];
    // each line takes how many numbers (less than line_num_in_buf), will be initialized as 0
    vector<int> buf_line_count(file_num);  
    
    
    // Open IN file 
    fstream infile(infile_name, ios::in);
    if (!infile.good()) {
        cout << "Failed to open " << infile_name << endl;
        exit(-1);
    }
    
    // Reading and then Writing
    string line;
    int count = 0;
    while (!infile.eof()) {
        count ++;
        getline(infile, line);
        stringstream ss(line);
        
        int tmp_int; 
        while (ss >> tmp_int) {
            int id = tmp_int % file_num;
            buf_line_count[id] ++;
            ss_array[id] << tmp_int << " ";
            
            if (buf_line_count[id] == max_num_in_bufline) {
                outfiles[id] << ss_array[id].str() << endl;
                // clear
                buf_line_count[id] = 0;
                ss_array[id].str("");
            }
        }
    }
    infile.close();
    
    // Write the last line (the number of figures in this line could be less than max_num_in_bufline)
    for (int i=0; i<file_num; ++i) {
        if (buf_line_count[i] > 0) {
            outfiles[i] << ss_array[i].str() << endl;            
        }
    }
    
    for (int i=0; i<file_num; ++i) {
        outfiles[i].close();
    }

    return 0;
}

本步驟執行時間為 520秒左右。因為涉及單個檔案的讀寫，這一步不太好做並行化，但也不是完全不能做。這裡省略了。

3. 最後一步，讀取每個小檔案裡的數字，對每個小檔案都分別形成Hash表，然後找出每個小檔案中出現次數排前3的數字；最後找出整個大檔案的出現次數排名前3的數字。程式碼如下：

// get_numbers.cpp
// g++ get_numbers.cpp -std=c++11 -o gn

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;


template <typename T>
vector<pair<int, int> > get_max_3_numbers(const T& pairs)
{
    vector<pair<int, int> > array; 
    for (int i=0; i<3; ++i) {
        array.push_back({0,0});
    }
    
    for (auto item : pairs) {
        int index = -1;
        if (array[0].second <= array[1].second) {
            if (array[0].second <= array[2].second) {
                index = 0;
            } else {
                index = 2;
            }
        } else {
            if (array[1].second <= array[2].second) {
                index = 1;
            } else {
                index = 2;
            }
        }
        if (array[index].second < item.second) {
            array[index] = item;
        }
    }
    
    return array;
}

vector<pair<int, int> > get_numbers_from_one_file(const string& infile_name) {
    unordered_map<int, int> db; 
    fstream infile(infile_name.c_str(), ios::in);
    if (!infile.good()) {
        cout << "Failed to read " << infile_name << endl; 
        exit(-1);
    }
    
    string line;
    while(!infile.eof()) {
        getline(infile, line);
        stringstream ssline(line);
        int tmp_int;
        while (ssline >> tmp_int) {
            if (db.find(tmp_int) == db.end()) {
                db[tmp_int] = 1;
            }
            else {
                db[tmp_int] += 1;
            }
        }
    }
    infile.close();
    
    return get_max_3_numbers<unordered_map<int, int> >(db);
}

int main()
{
    const int file_number = 40;
    vector<string> infilenames;
    for (int i=0; i<file_number; ++i) {
        stringstream ss(string(""));
        ss << "./out_file_" << i << ".txt";
        infilenames.push_back(ss.str());
    }
    
    vector<pair<int, int> > multi_file_result;
    
    for (auto infile_name : infilenames) {
        vector<pair<int, int> > result = get_numbers_from_one_file(infile_name);
        for (auto item : result) {
            multi_file_result.push_back(item);
        }
        cout << infile_name << " is done..." << endl;
    }
    
    vector<pair<int, int> > final_result = get_max_3_numbers<vector<pair<int, int> >>(multi_file_result);
    for (auto item : final_result) {
        cout << item.first << ": " << item.second << endl;
    }
    return 0;
}

最終的結果如下：

1201971198: 12
542861170: 11
49242340: 11

real	61m1.689s
user	60m11.015s
sys	0m48.395s

總結一下：
a. 使用了模板，確實需要，一次填的值是vector<pair<int, int> >型別，另一次填的是 unordered_map<int, int> 型別；
b. 這一步的執行時間很久，約1個小時，所以應該使用並行處理去做，後續會去實現。

用真正的程式來實現了一個以前只會紙上寫寫畫畫的面試題，大約算是做到了“Show me the code”了吧。不過，並沒有結束，除了對優化的思考之外，至少還應該去實現並行化的處理。這留作不久將來的實現吧。

（未完待續）

用程式實現在大檔案中出現次數為Top N的數字

用程式實現在大檔案中出現次數為Top N的數字

劍指offer66題--Java實現，c++實現和python實現 28.陣列中出現次數超過一半的數字

【Java】劍指offer(39) 陣列中出現次數超過一半的數字《劍指Offer》Java實現合集《劍指Offer》Java實現合集

【Java】劍指offer(40) 最小的k個數《劍指Offer》Java實現合集劍指offer(39) 陣列中出現次數超過一半的數字《劍指Offer》Java實現合集

sort +awk+uniq 統計檔案中出現次數最多的前10個單詞

【劍指offer{25-30}】複雜連結串列的複製、字串的排列、陣列中出現次數超過一半的數字、最小的K個數、連續子陣列的最大和

linux中sort（統計檔案中出現次數最多的前10個單詞）

劍指offer程式設計題（JAVA實現)——第28題：陣列中出現次數超過一半的數字

劍指Offer - 陣列中出現次數超過一半的數字(Java實現)

劍指Offer-29-java實現查詢陣列中出現次數超過一半的元素

Python實現找到陣列中出現的最多的數字的次數

數組中出現次數超過一半的數字

劍指offer---數組中出現次數超過一半的數字

劍指offer：數組中出現次數超過一半的數字

28數組中出現次數超過一半的數字

牛客網在線編程：n個數中出現次數大於等於n/2的數

尋找數組中出現次數超過一半的數字

劍指Offer：數組中出現次數超過一半的數字【39】

【劍指offer】39、數組中出現次數超過一半的數字

面試題：數組中出現次數超過一半的數字

用程式實現在大檔案中出現次數為Top N的數字

相關推薦