comse項目筆記

阿新 • • 發佈：2017-09-06

ray clas void equal put upper let app pid

1.2017/1/9 head是用成員還是指針？

代碼

typedef struct index_node
{
uint32_t node_pos;//for up level loop
struct index_node *prev_p;//for loop
struct index_node *next_p;//for loop
uint32_t sum_data_num;//to do,can dynamic,if data_num is 0,the index_node equal NULL
uint32_t use_data_num;//for loop
uint32_t *data_p;
}INDEX_NODE;

修改之前

typedef struct index_hash_value
{
uint32_t my_pos;//for lock
uint32_t del_data_num;//for check if shrink
uint32_t use_data_num;//for check if shrink
uint32_t sum_node_num;//for loop,just ++
struct index_node head;//head count 1
}INDEX_HASH_VALUE;

修改之後

case原因：

當只有head、last node節點時，發現last的prev_p和next_p指針與

head不一致，引發段錯誤。

分析：

1.在shrink_index函數中

1.2.generate new hash_values

index_hash_value new_one_hash_value = {0};//init 0
index_hash_value * p_hash_value = &new_one_hash_value;

1.3.erase old and insert new

{index_map.insert ( std::pair<std::string,index_hash_value>(index_hash_key,new_one_hash_value) );}

如果是成員變量，在index_map.insert的時候value以及裏面的head內存地址會發生改變，而新分配的節點

new_one_hash_value裏面last_node指向的還是局部變量new_one_hash_value中的head，所有後續遍歷的時候會有問題。

總結：

指針隱式的指向了局部變量，使用的時候段錯誤。

2.2017/1/10 代碼review

2.1 _hash_now_num 在shrink和clear的時候--

不能，因為insert的時候是++,clear和shrink的時候不是從最後一個元素開始，如果-- 會造成my_pos有重復的問題。

2.2 shrink中可能發生的內存泄漏

if (!hit_hash_key) {return false;}
會導致產生的新2.generate new hash_values產生的內存不被銷毀麽？
不會，因為AutoLock_Mutex auto_lock0(&index_update_lock);保證了對index_map的update操作
都會被串行化，而之前已經有判斷 if (query_out.size() <= 0 )
{return false;}
所以hit_hash_key應該為true。

2.3代碼review

shrink_index
clear_all_index
clear_index
delete_index
insert_index

all_query_index

cross_query_index

3.內存泄漏檢查和多線程業務檢查

3.1內存泄漏檢查代碼

int main()
{
    while(1)
    {
        Index_Core idx_core(128,3);

        for (int i = 1 ; i < 100;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::string str(i,j);
                for (int k = 0 ;k < 1000;k++)
                {idx_core.insert_index(str,k);}
                printf("insert %s\n",str.c_str());
            }

        printf("==================\n");

        for (int i = 1 ; i < 60;i++)
            for (char j = ‘a‘;j<=‘m‘;j++)
            {
                std::string str(i,j);
                for (int k = i ;k < 300;k++)
                {idx_core.delete_index(str,k);}
                printf("delete %s\n",str.c_str());
            }

        for (int i = 1 ; i < 80;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::string str(i,j);
                idx_core.shrink_index(str);
                printf("shrink %s\n",str.c_str());
            }

        for (int i = 1 ; i < 90;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::string str(i,j);
                idx_core.clear_index(str);
                printf("clear %s\n",str.c_str());
            }

        fflush(stdout);
        sleep(1);

    }

}

結論：跑了12hour，內存使用RSS保持一致。

3.2多線程業務檢查

#define THREAD_NUM 1

Index_Core idx_core(8,64);

void *myfunc(void *arg)
{
    while(1)
    {
#if 1
        for (int i = 1 ; i < 15;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::vector<uint32_t> query_out;
                std::string str(i,j);

                idx_core.all_query_index(str,query_out);
                if (query_out.size() > 0)
                {
                    printf("%s[%d]:",str.c_str(),query_out.size());    
                    for (int k = 0 ; k < query_out.size(); k++)
                    {printf("%d ",query_out[k]);}
                    printf(":%s\n",str.c_str());    
                }
            }
#endif
        fflush(stdout);
        sleep(1);

    }
}


int main(int argc,char *argv[])
{
    pthread_t tid[THREAD_NUM];
    int id[THREAD_NUM] = {0};

    for (int i = 0; i < THREAD_NUM; i++)
    {
        id[i] = i;
        if (pthread_create(&tid[i],NULL,&myfunc,(void*)&id[i]) != 0)
        {
            fprintf(stderr,"thread create failed\n");
            return -1;
        }
    }

    while(1)
    {
        for (int i = 1 ; i < 10;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::string str(i,j);
                for (int k = 0 ;k < 30;k++)
                {idx_core.insert_index(str,k);}
            }

#if 1
        for (int i = 1 ; i < 10;i++)
            for (char j = ‘a‘;j<=‘m‘;j++)
            {
                std::string str(i,j);
                for (int k = (i+2);k < 10;k++)
                {idx_core.delete_index(str,k);}
            }
#endif
#if 1
        for (int i = 1 ; i < 10;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::string str(i,j);
                idx_core.shrink_index(str);
            }
#endif
#if 1
        for (int i = 3 ; i < 10;i++)
            for (char j = ‘a‘;j<=‘z‘;j++)
            {
                std::string str(i,j);
                idx_core.clear_index(str);
            }
#endif
//        sleep(3);
    }

    for (int i = 0 ;i < THREAD_NUM; i++)
        pthread_join(tid[i],NULL);
}

結論：跑了至少12hour，打印的vector檢查都為増序，並且內存保持一致。

4.單元測試

5.todo

1.使用type或view分離檢索數據

2.請求時攜帶歸並數量，返回時攜帶分詞結果。

2017/3/3

發現jsoncpp的可能bug，內存被破壞

且jsoncpp占用太多內存

187 Byte 數據循環生成100w次，占用1G內存。

每個數據約耗1073Byte。

2017/3/7

aws ec2

rapidjson:

空對象
$1 = 96

{\"name\":\"json\",\"array\":[{\"cpp\":\"jsoncpp\"},{\"java\":\"jsoninjava\"},{\"php\":\"support\"}]}，100Byte，10w，內存占用440M。

jspncpp:

空對象

$1 = 40

{\"name\":\"json\",\"array\":[{\"cpp\":\"jsoncpp\"},{\"java\":\"jsoninjava\"},{\"php\":\"support\"}]}，100Byte，50w，內存占用700M。

cjson:

$1 = 64

{\"name\":\"json\",\"array\":[{\"cpp\":\"jsoncpp\"},{\"java\":\"jsoninjava\"},{\"php\":\"support\"}]},100Bye,50w,500M.

2017/3/14

解決了2017/3/3的bug，並不是jsoncpp引起的，而是之前調用std::sort引起的。

這也佐證了為什麽sort函數之前jisuan_score時沒問題，sort

之後取breif會core。

由於std::sort中重載<或>符號的時候，==情況必須返回false，否則會core，詳見

http://blog.sina.com.cn/s/blog_79d599dc01012m7l.html

導致程序其他地方的問題，引發了jsoncpp的core。

分析：

1.core時除了註意發生core的位置，還要關註代碼上下文或者相關的”代碼環境“

2.問題除了直接產生，可能是別的問題影響了“代碼環境”比如公用的堆，棧等

總結：

1.外部集成的庫最好經過集成和壓力測試，至少備註一下是個隱患風險點

2.參考外界代碼時，盡量不要更改重要代碼，如果不知道什麽是重要，在完成功能的基礎上，最好啥也不改。

2017/3/15

close_wait過多

原始的Reactor模型是：

技術分享

分析：

netstat 觀察到close_wait過多

1.close_wait是被動關閉時，服務端沒有調用close導致，說明服務端accept處理成功了

2.服務端沒有close掉

原因:

thread1線程accept成功時向sock0發送了4字節的建連sock的fd

thread2線程異步的調用接收sock1的信息，如果接收全是4字節，沒問題，但如果

有一次收到的buffer < 4字節，會導致數據不以4字節對齊，接收到的fd不是有效的連接fd。

修改：

多個線程使用同一個listenfd，並accept(加互斥鎖)之後的nfd，加入本線程的異步事件調度。

技術分享

線程自適應的處理任務，任務處理不過來就accept不過來不建連，直接丟棄。

處理性能提高，減少了一個事件的註冊和觸發。

解決了close_wait堆積問題。

2017/3/16

編譯時需要升級gcc到gcc 4.8

Makefile時要加上-lrt庫，否則執行時候找不到clock_time函數？

2017/3/17

1.每到1w就output一次max_index_num

2.load_from_file 時load內容出錯時，打印日誌

to do:

max_index_num達到最大值時，如何處理？

下一版本解決，dump出全部di，另用一個search_engine加載，然後雙buffer交換。

1.日誌格式化(done 2017/5/4)

git:在service中生成log文件夾

2.增加日誌,記錄狀態:

ret1 = parse_in_json();
ret2 = get_out_json();

3.加快or檢索(done 2017/5/4)

檢查每個term重要性，即term倒排的數量，超過一定閾值認為是不重要的，跳過，不參與召回

4.search_engine.h中policy_compute_score函數聲明和定義不一致(done 2017/5/4)

5.load進去再dump數量不一致

1282887 dump.json.file
1284625 load.json.file

load時候部分沒有load進去

cat comse.log | grep "json parse error" | wc -l
1739

應該是load入口和服務的add入口不一致，add進去，load沒進去

6.dump日誌的時候格式有問題(done 5/12)

把之前query的臟數據也輸出了

recv和send發送數據的時候buff數據還有些臟數據沒有被清理

解決方式：在recv和send後面追加\0

7.max_ret_num確定實際返回的數量(done 2017/5/10)

8.生成md5的json內容，字段顛倒有無問題(done 2017/5/16)

字段顛倒之後用json_writer.write輸出的字符串都是按照字母序排列的

10.各個過程的耗時(done 1027/5/12)

11.增加view或type過濾（在打分時匹配type打0分，在search_filter階段過濾0分，type:0代表所有）

12.傳遞參數過多，包個struct傳遞？

13.All_time打日誌記錄時間時，增加query和召回數量的打印

增加一個logid？

14.加快search的過程

A:對每個term有一個重要性分析，過濾掉重要性低的term？

B:在召回的過程中實現文本相關性的打分？不符合程序的原始架構和策略一致性，召回，打分。。。

C:把詞為key的term變成數字為key的term?

數字為key的term:測試1ms執行500次，使用的話需要記錄term長度，使用map?

#include <iostream>      //std::cout
#include <algorithm>     //std::lower_bound, std::upper_bound, std::sort
#include <set>        //std::vector
#include <vector>        //std::vector
#include <string>        //std::vector
#include <sys/time.h>

#define TIMER(FUNC) {         struct timeval prev_time,cur_time;         int count_time = 0;         gettimeofday(&prev_time,NULL);         gettimeofday(&cur_time,NULL);         while(1)         {                     gettimeofday(&cur_time,NULL);                     FUNC;                     count_time++;                     if (cur_time.tv_sec - prev_time.tv_sec >= 1) {                                     prev_time = cur_time;                                     printf("output timer count %d\n",count_time);                                     fflush(stdout);                                     count_time=0;                                 }                 } }

int test_string_set(std::vector<int> & term_list,std::set<int> & term4se_set )
{
    int or_terms_length = 0;
    int or_terms_num = 0;

    if ( term4se_set.size() > 0)
    {
        std::set<int>::iterator term_hash_it;

        for (int i = 0 ; i < term_list.size();i++)
        {
            term_hash_it = term4se_set.find(term_list[i]);
            if (term_hash_it != term4se_set.end()) 
            {
                or_terms_length += term_list[i];
                or_terms_num++;
            }
        }
    }

    int ret = (or_terms_num * 1000 + or_terms_length);
    return ret;
}

int main () {
    std::vector<int> term_list;
    term_list.push_back(1);
    term_list.push_back(8);
    term_list.push_back(16);
    term_list.push_back(3012);
    term_list.push_back(18);
    term_list.push_back(7);
    term_list.push_back(462);
    term_list.push_back(129992);
    
    std::set<int> term4se_set;
    term4se_set.insert(1);
    term4se_set.insert(16);
    term4se_set.insert(7);
    term4se_set.insert(8);
    term4se_set.insert(1234);
    term4se_set.insert(34111111);
    TIMER(test_string_set(term_list,term4se_set));
    return 0;
}

字符串為key的term:測試1ms執行311次

#include <iostream>      //std::cout
#include <algorithm>     //std::lower_bound, std::upper_bound, std::sort
#include <set>        //std::vector
#include <vector>        //std::vector
#include <string>        //std::vector
#include <sys/time.h>

#define TIMER(FUNC) {         struct timeval prev_time,cur_time;         int count_time = 0;         gettimeofday(&prev_time,NULL);         gettimeofday(&cur_time,NULL);         while(1)         {                     gettimeofday(&cur_time,NULL);                     FUNC;                     count_time++;                     if (cur_time.tv_sec - prev_time.tv_sec >= 1) {                                     prev_time = cur_time;                                     printf("output timer count %d\n",count_time);                                     fflush(stdout);                                     count_time=0;                                 }                 } }

int test_string_set(std::vector<std::string> & term_list,std::set<std::string> & term4se_set )
{
    int or_terms_length = 0;
    int or_terms_num = 0;

    if ( term4se_set.size() > 0)
    {
        std::set<std::string>::iterator term_hash_it;

        for (int i = 0 ; i < term_list.size();i++)
        {
            term_hash_it = term4se_set.find(term_list[i]);
            if (term_hash_it != term4se_set.end()) 
            {
                or_terms_length += term_list[i].size();
                or_terms_num++;
            }
        }
    }

    int ret = (or_terms_num * 1000 + or_terms_length);
    return ret;
}

int main () {
    std::vector<std::string> term_list;
    term_list.push_back("book");
    term_list.push_back("apple");
    term_list.push_back("zoo");
    term_list.push_back("school");
    term_list.push_back("action");
    term_list.push_back("gogogogo");
    term_list.push_back("computer");
    term_list.push_back("machine");
    
    std::set<std::string> term4se_set;
    term4se_set.insert("abcd");
    term4se_set.insert("action");
    term4se_set.insert("computer");
    term4se_set.insert("book");
    term4se_set.insert("zzzzzzzzzzz");
    term4se_set.insert("abcd");
    TIMER(test_string_set(term_list,term4se_set));
    return 0;
}

已上線，效果提升1倍左右，符合測試的預期，但是內存占用較大。

==============================================

優化後
[2017-06-08 15:42:51]cal_time recall=5488|compute=64173|sort=5711|package=373 search_mode=2:recall_num=17417:query=《三生三世十裏桃花》終極預告花絮首發
[2017-06-08 15:42:51]Thread[a1bb7700] Client[20]All_time[accept2recv_event=211,recv=4,parse_recv=53,do_policy=84612,send=213]:Send=HTTP/1.1 200 OK
20913 video 20 0 2393m 2.0g 1748 S 0.0 2.2 4:00.27 comse_test

==============================================
優化前
[2017-06-08 16:56:17]cal_time recall=5489|compute=127615|sort=5730|package=463 search_mode=2:recall_num=17417:query=《三生三世十裏桃花》終極預告花絮首發
[2017-06-08 16:56:16]Thread[4f0ae700] Client[18]All_time[accept2recv_event=144,recv=4,parse_recv=46,do_policy=145616,send=188]:Send=HTTP/1.1 200 OK

23150 video 20 0 2979m 2.6g 1712 S 0.0 2.8 3:13.20 comse_test

==============================================

15.加快召回過程

使用bitmap節省空間，求交使用bit求交

16.title:

<我們的愛>第十二集看點

召回的關鍵詞是我們

bad原因一個作品名稱的term由許多常見單詞構成。

解決：term合並，n-gram，多層次檢索，關鍵是各個層次之間的打分體系如何融合。

==============================================

16.ret_num 和max_ret_num 過大

導致中間數據占用過多內存，output_buff會有限制不會吐出，浪費中間過程的內存和計算。

comse項目筆記

ray clas void equal put upper let app pid 1.2017/1/9 head是用成員還是指針？代碼 typedef struct index_node{ uint32_t node_pos;//for up

comse項目筆記

1.2017/1/9 head是用成員還是指針？

代碼

case原因：

分析：

總結：

2.2017/1/10 代碼review

2.1 _hash_now_num 在shrink和clear的時候--

2.2 shrink中可能發生的內存泄漏

2.3代碼review

3.內存泄漏檢查和多線程業務檢查

3.1內存泄漏檢查代碼

結論：跑了12hour，內存使用RSS保持一致。

3.2多線程業務檢查

結論：跑了至少12hour，打印的vector檢查都為増序，並且內存保持一致。

4.單元測試

5.todo

2017/3/14

解決了2017/3/3的bug，並不是jsoncpp引起的，而是之前調用std::sort引起的。

2017/3/15

close_wait過多

相關推薦