Boost 學習筆記--->字串&文字處理

阿新 • • 發佈：2019-01-18

編譯環境：win10 Vs2015 Boost version:1.65.0 概解： lexical_cast、string_algo、format這幾個方面是boost處理字串與文字的核心功能，涵蓋了以下方面： a、將數值與字串互做轉換 b、將輸出做精確的格式化處理 c、字串的具體表示形式 lexical_cast：此函式功能類似於c語言中的atoi函式，可以將string、int、flaot之間的字面值進行轉換，下面是這個函式一個簡單的示例： 示例一： #include <iostream> #include <boost/lexical_cast.hpp> //to use lexical_cast using namespace std; using namespace boost; template<typename T> //將類的過載<<操作符作為模版類 struct outable { friend ostream& operator<<(ostream& os,const T& x) { os << typeid(T).name(); return os; } }; class DemoClass : public outable<DemoClass> { }; void case1() { cout << lexical_cast<string>(DemoClass()) << endl; //輸出類的id並且列印類名 } int main() { case1(); } 簡單的運用示例：

示例二： #include <boost/lexical_cast.hpp> using namespace boost; int main() { int x = lexical_cast<int>("100"); //string ---> int long y = lexical_cast<long>("20000"); //string ---> long float z = lexical_cast<float>("3.14159e5"); //string ---> float double j = lexical_cast<double>("2.1767675"); //string ---> double std::cout << x << y << z << j << std::endl; /* *輸出結果：100 20000 314159 2.1767675 */ ///////////////////////////////////// string str = lexical_cast<string>(456); //int ---> string std::cout << str << std::endl; std::cout << lexical_cast<string>(0.618) << std::endl; //float ---> string std::cout << lexical_cast<string>(0x10) << std::endl; //16進位制整數 ---> string /* *輸出結果：456 0.61799999999999999 16 */ } 注意點：

lexical_cast函式在將字串轉換成數字顯示時，字串中只能有數字和小數點，不能出現字母(用作指數表示的e/E除外)或者其它數字字元。 lexical_cast不能轉換如："123L"、"0x100"這種格式的C++語法許可的數字字面量字串，而且lexical_cast不支援高階的格式控制，不能把數字轉換成指定格式的字串，如果需要更高階的格式控制，可使用 std::stringsream boost::format 異常bad_lexical_cast: 當lexical_cast執行轉換出錯時會丟擲異常：bad_lexical_cast，它是std::bad_cast的派生類，為了使程式更加健壯，需要使用try/catch塊來保護轉換程式碼，如下： 示例三：

try { cout << lexical_cast<int>("0x100"); cout << lexical_cast<double>("HelloWorld"); cout << lexical_cast<long>("1000L"); cout << lexical_cast<bool>("flase") << endl; } catch(bad_lexical_cast& e) { cout << "error: " << e.what() << endl; } 上述程式碼執行後結果如下： error: bad lexical cast: source type value could not be interpreted as target 同時可以使用異常來驗證數字字串的合法性，可以將這個實現為一個模版類： 示例四： template<typename T> bool Num_valid(const char *str) try { lexical_cast<T>(str); //進行嘗試轉換動作 return true; } catch(bad_lexical_cast &e) { return false; } /* *函式Num_valid使用了一個funtion_try塊捕獲ban_lexical_cast異常 *如果對字串呼叫lexical_cast成功則返回true，失敗返回false； */ int main() { assert(Num_valid<double>("3.14")); assert(!Num_valid<int>("3.14")); assert(Num_valid<int>("65535")); } 對準換物件的要求： lexical_cast僅僅只是模仿了轉型操作符，實際上是一個模版類，lexical_cast內部使用了標準庫的流操作符，因此，對於物件的轉換有如下要求： a、轉換七點物件是可用作流輸出的，即過載了"<<"操作符，operator<<; b、轉換終點物件是可用作流輸入的，即過載了">>"操作符，operator>>; c、轉換重點物件必須是可預設構造和拷貝構造的；對於C++中的內建型別：int、double、std::string等都滿足以上三個條件，這三個也是最常與lexical_cast搭配使用的型別；但是對於STL中的容器和其它使用者自定義的型別，這些條件一般都不滿足，不能使用lexical_cast函式進行轉換； 應用於自己的類： 如果要講lexical_cast應用與自己的類，只要實現了對於操作符"<<"的過載即可，就像示例一中所作一樣； Format： boost.format實現了類似於printf()的格式化物件，可以把引數格式化到一個字串，相比較C語言裡的printf而且是完全型別安全的格式化； format元件位於名字空間 boost，為了使用Format，需要包含標頭檔案: #include <boost/format.hpp> using namespace boost; 對於boost庫中的format一個簡單的執行例項： 示例五： #include <boost/format.hpp> using namespace boost; void case1() { cout << format("%s:%d + %d = %d\n")%"Sum" % 1 % 2 % (1 + 2); format fmt("(%1% + %2%)" * %2% = %3%\n); fmt % 2 % 5; fmt % ((2 + 5) * 5); cout << fmt.str(); } int main() { case1(); } 以上程式執行結果如下： sum:1 + 2 = 3 (2 + 5) * 5 = 35 例項概解： 程式的第一條語句演示了format的最簡單的用法，使用format(...)構造了一個format臨時(匿名物件)，建構函式的引數是格式化字串，其語義是標準printf()語法，使用%x來制定引數格式；因為要被格式化的引數個數是不確定的，printf()使用了C語言裡的可變引數(即引數生命中的省略號)，但它是不安全的，format模仿了流操作符<<,過載了二元操作符operator%作為引數輸入符，它同樣可以串聯任意數量的引數，因此： format(...)% a % b % c //可以理解成下面這樣的 format(...) << a << b << c;      操作符把引數逐個餵給format物件，完成對引數的格式化，最後format物件支援流輸出，可以直接向輸出流cout輸出內部儲存的已格式化好的字串；第一條format語句的等價printf()呼叫是： printf("%s: %d + %d = %d\n","sum",1,2,(1 + 2)); 程式後面三行程式碼演示了format的另一種用法，預先建立一個format格式化物件，這個物件是可以被後面的程式碼多次用於格式化操作，format物件仍然用操作符%來接受被格式化的引數，可以分多次輸入，(不必一次給全)，但引數的數量必須滿足格式化字串的要求，最後，使用format物件的str()成員函式獲得已格式化好的字串想cout輸出；第二個format用了略不同於printf()格式化的語法："(%1% + %2%) * %2% = %3%",有點類似於C#語言，%X%可以指示引數的位置，減少引數輸入的工作，是對printf()語法的一個改進；第二個format物件的等價printf()呼叫是： printf("(%d + %d) * %d = %d\n",2,5,5,(2+5) * 5); 類摘要： format並不是一個真正的類，而是一個typedef，真正的實現是basic_format，宣告如下： template<class charT,class Traits=std::char_traits<charT>> class basic_format; typedef basic_format<char> format; //basic_format類摘要如下： template<class charT,class Traitd=std::char_traits<charT>> class basic_format { public: explicit basic_format(const charT *str); explicit basic_format(const string &s); basic_format& operator=(const basic_format& x); string_t str() const; size_type size() const; void clear(); basic_format& parse(const string_t&); //pass arguments through those operator: template<class T>basic_format& operator%(T& x); friend std::basic_ostream& operator<<(...) };//basic_format typedef basic_format<char >    format; typedef basic_format<wchar_t > wformat; string str(const format& ); 成員概解： a、basic_format建構函式可以接受C字串(以0結尾的字元陣列)、std::string作為格式化字串，格式化字串使用類printf的格式規則，建構函式都被宣告為explicit，因此必須要顯式呼叫構造； b、str()返回format物件內部已經格式化好的字串(不清空)，如果沒有得到所有格式化字串要求的引數則會丟擲異常，format庫還同時提供一個同名的自由函式str()，它位於boost名字空間，返回format物件內部已格式化好的字串； c、size()函式可以獲得已格式化好的字串長度，相當於str().size()，同樣，如果沒有得到所有格式化字串要求的引數則會丟擲異常； d、parse()清空format物件內部快取，並改用一個新的格式化字串，如果僅僅想清空快取，可以使用clear()，它把format物件回覆到初始狀態，這兩個函式執行後呼叫str()、size()都會丟擲異常； e、format過載了operator%，可以接受待格式化的任意引數，%輸入的引數個數必須恰好等於格式化字串要求的數量，過多或過少在format物件輸出時都會導致丟擲異常，當呼叫str()輸出字串活clear()清空緩衝區之後，則可以繼續再次使用%; f、format還過載了流輸出操作符，因此可以直接向IO流輸出已格式化好的字串，相當於向流輸出str(); 格式化語法： format基本繼承了printf的格式化語法，它僅對printf語法有少量的不相容，一般情況下我們很難遇到；每個printf格式化選項以%開始，後面是格式規則，規定了輸出的對齊、寬度、精度、字元型別，如下所示： %05d :輸出寬度為5的整數，不足位用0填充 %-8.3f :輸出左對齊，總寬度為8，小數位3位的浮點數 % 10s :輸出10位的字串，不足位用空格填充 %05X :輸出寬度為5的大寫16進位制整數，不足位用0填充 程式碼示例： format fmt("%05d\n%-8.3f\n% 10s\n%05X\n"); cout << fmt %62 %2.236 % "123456789" %48; 執行結果如下： 00062 2.236 123456789 00030 在經典的printf式格式化外，format還增加了新的格式： a、%|spec|：與printf格式選項功能相同，但兩邊增加了豎線分割，可以更好的區分格式化選項與普通字元； b、%N%:標記第N個引數，相當於佔位符，不帶任何其他的格式化選項；使用%|spec|%的形式可以將上面的例子寫成如下格式： format fmt("%|05d|\n%|-8.3f|\n%| 10s|\n%|05X|\n");
format的效能： printf()不進行型別安全檢查，直接向stdout輸出，因此速度上非常塊，而format較printf()做了很多安全檢查工作，因此效能略差，速度上要慢很多，總得來說要比printf()至少慢2倒5倍；如果在意format的效能，那麼可以先簡歷const format物件，然後拷貝這個物件進行格式化操作，這樣比直接使用fromat物件能夠提高一些速度，如下： const format fmt("%10d %020.8f %010X %10.5e\n"); cout << format(fmt)%62 % 2.236 % 255 % 0.618; 高階用法： format提供了類似於printf的功能，但它並不等同於printf函式，這就是面向物件好處，在通常的格式化字串之外，format類還擁有幾個高階功能，可以在執行時修改格式化選項、繫結輸入引數； a、basic_format& bind_arg(int argN,const T& val) 把格式化字串第argN位置輸入引數固定為val，即使呼叫clear()也保持不變，除非呼叫clear_bind()或clear_binds(); b、basic_format& clear_bind(int argN) 取消格式化字串第argN位置的引數繫結； c、basic_format& clear_binds() 取消格式化字串所有位置的引數繫結，並呼叫clear()方法； d、basic_format& modify_item(int itemN,T manipulator) 設定格式化字串第itemN位置的格式化選項，manipulator是一個boost::io::group()返回的物件； e、boost::io::group(T1 al, ..., Var const& var) 它是一個模版函式，最多支援10個引數(10個過載形式)，可是設定IO流操縱器以指定格式或輸入引數值，IO流操縱器位於標頭檔案<iomanip> 以上用法如下示例： 示例六： #include <boost/format.hpp> #include <iomanip> using namespace boost; using boost::io::group void case1() { //宣告format物件，有三個輸入引數，五個格式化選項 format fmt("%1% %2% %3% %2% %1% \n"); cout << fmt %1 % 2 % 3; fmt.bind_arg(2,10); //將第二個引數固定為數字10 cout << fmt %1 %3; //輸出其餘兩個引數 fmt.clear(); //清空緩衝，但是版定的引數不變 //在%操作符中使用group()，指定IO流操縱符第一個引數顯示為八進位制 cout << fmt % group(showbase,oct, 111) % 333; fmt.clear_binds(); //清除所有繫結引數 //設定第一個格式化項，十六進位制，寬度為8，右對齊，不足位用*填充 fmt.modify_item(1,group(hex,right,showbase,setw(8),setfill('*'))); cout << fmt % 49 % 20 % 100; } int main() { case1(); } /* * 輸出結果： * 1 2 3 2 1 * 1 10 3 10 1 * 0157 10 333 10 0157 * ****0x31 20 100 20 49 */ string_algo： 是一個非常全面的字串演算法庫，提供了大量的字串操作函式，如大小寫無關比較、修剪、特定模式的子串查詢等，可以在不實用正則表示式的情況下處理大多數字符串相關問題； string_algo庫位於名字空間boost::algorithm，但被using語句引入到名字空間boost，為了使用string_algo需要包含等宣告如下： #include <boost/algorithm/string.hpp> using namespace boost; 示例程式碼如下： #include <iostream> #include <vector> #include <boost/smart_ptr.hpp> #include <boost/make_shared.hpp> #include <boost/algorithm/string.hpp>           //for use string_algo library using namespace std; using namespace boost; void case1() {        //shared_ptr、make_shared的使用，避免使用new、delete造成的記憶體問題        boost::shared_ptr<std::string> ps = boost::make_shared<std::string>(", I made a stupid decision to leave the world forever");        std::cout << "The ps content is: " << *ps << std::endl; } void case2() {        std::string str("ReadMe.txt");        if (boost::ends_with(str, "txt"))    //判斷後綴        {               std::cout << boost::to_upper_copy(str) + " UPPER" << std::endl;               assert(boost::ends_with(str, "txt"));        }        boost::replace_first(str, "ReadMe", "followme");                            //替換原字串內容        cout << "The replace_first str content: " << str << endl;        vector<char> v(str.begin(), str.end()); //一個字元大小的vector        vector<char> v2 = to_upper_copy(erase_first_copy(v, "txt")); //to_upper_copy大寫，然後刪除字串        /*        for (int i = 0; i < v2.size(); ++i)        {               cout << v2[i];        }*/        for (auto tmp : v2) //此種方式雖較為方便，但是比起前++的常規for迴圈來說，開銷較大        {               cout << tmp;        }        cout << endl; } int main() {        case1();        case2();        system("pause"); } 這個例子示範了string_algo庫中：ends_with()、to_upper_copy()、replace_first()、erase_first_copy()等函式的基本用法，它們的名稱含義都是自說明，可以直接理會其字面意思； string_algo效能概述： string_algo被設計用於處理字串，然而它的處理物件並不一定是string或者basic_string<T>，可以是任何符合boost.range要求的容器，容器內的元素也不一定是char或者wchar_t，任何可拷貝構造和賦值的型別均可，但如果型別的拷貝賦值代價很高，則string_algo的效能會下降； string_algo庫中的演算法命名遵循了標準庫的慣例，演算法名均為小寫形式，並使用不同的字首或者字尾來區分不同的版本，命名規則如下： a、字首i：有這個字首表明演算法是大小寫不敏感的，否則是大小寫敏感的； b、字尾_copy：有這個字尾表明演算法不變動輸入，返回處理結果的拷貝，否則演算法原地處理，輸入即輸出； c、字尾_if：有這個字尾表明演算法需要一個判斷式的謂詞函式物件，否則使用預設的判斷準則； string_algo庫提供的演算法共分為五大類： a、大小寫轉換 b、判斷式與分類 c、修剪 d、查詢與替換 e、分割與合併 A、大小寫轉換： string_algo庫可以高效的實現字串的大小寫轉換，包括兩組演算法：to_upper()、to_lower()；兩個演算法宣告如下： template<typename T> void to_upper(T &Input); template<typename T> void to_lower(T &Input); Usage: #include <boost/algorithm/string.hpp> using namespace boost; void case1() { string str("I Don't Know.\n"); cout << "to_upper_copy: " << to_upper_copy(str);//返回大寫拷貝 cout << "str content: " << str; //原字串不改變 to_lower(str); //字串小寫 cout << "to_lower: " << str; //原字串被改變 } 執行結果： to_upper_copy: I DON'T KNOW. str Content: I Don't Know. After lower str content: i don't know. B、判斷式演算法： 判斷式演算法可以檢測兩個字串之間的關係，包括： 1)、starts_with ：檢測一個字串是否是另一個的字首 2)、ends_with ：檢測一個字串是否是另一個的字尾 3)、contains ：檢測一個字串是否被另一個包含 4)、equals ：檢測兩個字串是否相等 5)、lexicographical_compare：根據字典順序檢測一個字串是否小於另一個 6)、all    ：檢測一個字串中的所有元素是否滿足指定的判斷式除了all，這些演算法都有一個i字首版本，由於這些操作函式都不會改變原字串內容，所有不會有copy版本；以上演算法示例如下： #include <iostream> #include <vector> #include <boost/smart_ptr.hpp> #include <boost/make_shared.hpp> #include <boost/algorithm/string.hpp>           //for use string_algo library using namespace std; using namespace boost; void case4() {        //starts_with() & ends_with() & contains() & equals() & lexicographical_compare() & all()        string str("Power Bomb");        assert(iends_with(str, "bomb")); //大小寫無關檢測字尾        assert(!ends_with(str, "bomb")); //大小寫敏感檢測字尾        assert(starts_with(str, "Pow")); //檢測字首        assert(contains(str, "er"));                    //測試包含關係        string str2 = to_lower_copy(str); //轉換小寫並返回一個拷貝        assert(iequals(str, str2));                     //大小寫無關判斷相等        string str3 = "power suit";        assert(lexicographical_compare(str, str3));     //大小寫無關比較        assert(all(str2.substr(0, 5), is_lower()));     //檢測子串均小寫 } int main() {        /*        case1();        case2();        case3();        */        case4();        system("pause"); } C、判斷式演算法(函式物件)： string_algo增強了標準庫中的equal_to<>和less<>函式物件，允許對不同型別的引數進行比較，並提供大小寫無關的形式，這些函式物件包括： 1)、is_equal      ：類似equals演算法，比較兩個物件是否相等 2)、is_less ：比較兩個物件是否具有小於關係 3)、is_not_greater ：比較兩個物件是否具有不大於關係具體使用例項如下： void case5() {        cout << "In case5() functions" << endl;        //is_equal() & is_less() & is_not_greater()        string str1 = "Samus", str2 = "samus";        assert(!is_equal()(str1, str2));        assert(is_less()(str1, str2)); } 注意函式物件名稱後的兩個括號，第一個括號呼叫了函式物件的建構函式，產生一個臨時物件，第二個擴後才是真正的函式呼叫操作符operator()； D、分類： string_algo提供一組分類函式，可以用於檢測一個字元是否許賀某種特性，主要用於搭配其它演算法，如下所示： 1)、is_space ：字元是否為空格 2)、is_alnum ：字元是否為字母和數字字元 3)、is_alpha    ：字元是否為字母 4)、is_cntrl    ：字元是否為控制字元 5)、is_digit    ：字元是否為十進位制數字 6)、is_graph ：字元是否為圓形字元 7)、is_print    ：字元是否為可列印字元 8)、is_lower ：字元是否為小寫字元 9)、is_punct ：字元是否為標點符號字元 10)、is_upper ：字元是否為大寫字元 11)、is_xdigit ：字元是否為十六進位制數字 12)、is_any_of ：字元是否是引數字元序列中的任意字元 13)、if_from_range ：字元是否位於制定區間內，即from <= ch <= to; 在使用過程中需要注意，這些方法並不去檢測字元，只是返回一個型別為details::is_classifiedF的函式物件，這個物件的operator()才是真正的分類檢查函式(這些函式都是工廠函式)； E、修剪： string_algo提供3個修剪演算法：trim_left、trim_right、trim 修剪演算法可以刪除字串開頭或結尾部分的空格，它有_if和_copy兩種字尾，因此每個演算法都有四個版本，_if版本接受判斷式IsSpace，將所有被判定為空格(IsSpace(c) == true)的字元刪除； 以上D、E的演算法示例如下： void case7() {        format fmt("|%s|\n");        string str = "   samus aran   ";        cout << "Delete Both Spaces: " << fmt % trim_copy(str) << endl;      //刪除兩端的空格        cout << "Delete Left Space : " << fmt % trim_left_copy(str) << endl;//刪除左邊的空格        cout << "Delete Right Space: " << fmt % trim_right_copy(str) << endl;//刪除右邊的空格        trim_right(str); //原地刪除右邊的空格        cout << "In Situ Delete: " << fmt % str << endl;        string str1 = "2017 is a year of egg pain;";        cout << "Delete Left Nums: " << fmt % trim_left_copy_if(str1, is_digit()); //刪除左端的數字        cout << "Delete Right put: " << fmt % trim_right_copy_if(str1, is_punct()); //刪除有段的標點        cout << "Delete Both Nums & Punct & Spaces: " << fmt % trim_copy_if(str1, is_punct() || is_digit() || is_space()); } int main() {        case7();        system("pause"); } F、查詢： string_algo與標準庫提供的search()功能類似，但介面不一樣，它不是返回一個迭代器(查詢到的位置)，而使用了boost.range庫的iterator_range返回查詢到的整個區間，獲得了更多的資訊； string_algo提供的查詢演算法如下： 1)、find_first ：查詢字串在輸入中第一次出現的位置 2)、find_last ：查詢字串在輸入中最後一次出現的位置 3)、find_nth ：查詢字串在輸入中的第n次(從0開始計數)出現的位置 4)、find_head    ：取一個字串開頭N個字元的子串，相當於substr(0,n); 5)、find_tail ：取一個字串末尾N個字元的子串以上演算法因為不變動字串原來內容，所有沒有_copy版本，其中前三個演算法有字首i版本，示例如下： void case8() { //find_first & find_last & find_nth & find_head & find_tail format fmt("|%s| .Pos value is: %d\n"); string str = "Long Long Ago,There Have A King;"; iterator_range<string::iterator> rge; //Explain the iterator interval rge = find_first(str, "Long"); //Find the location of the first occurrence with case cout << "Find First: " << setw(5) << fmt % rge % (rge.begin() - str.begin()); rge = ifind_first(str, "Long");    //Case independent search for the first place to appear cout << "Ifind first: " << setw(5) << fmt % rge % (rge.begin() - str.begin()); rge = find_nth(str, "ng", 2); //Look for ng's third place in STR cout << "Find nth: " << setw(5) << fmt % rge % (rge.begin() - str.begin()); rge = find_head(str, 4); //Take the first four characters cout << "Find Head: " << setw(5) << fmt % rge % (rge.begin() - str.begin()); rge = find_tail(str, 5); //Take the last five characters cout << "Find Tail: " << setw(5) << fmt % rge % (rge.begin() - str.begin()); rge = find_first(str, "samus");    //Not Find assert(rge.empty() && !rge); } int main() { case8(); system("pause"); } G、替換與刪除： 替換、刪除操作與查詢演算法非常接近，是在查詢到結果後再對字串進行處理，因此它們命名很相似，如下所示： 1)、replace/erase_first    :替換/刪除一個字串在輸入中的第一次出現 2)、replace/erase_last :替換/刪除一個字串在輸入中的最後一次出現 3)、replace/erase_nth :替換/刪除一個字串在輸入中第n次的出現(從0開始計數) 4)、replace/erase_all :替換/刪除一個字串在輸入中的所有出現 5)、replace/erase_head :替換/刪除輸入的開頭 6)、replace/erase_tail :替換/刪除輸入的末尾這些演算法是一個大集合，前八個每個都有字首"i"、字尾"_copy"組合，有四個版本，後四個則只有"_copy"兩個版本，示例如下： void case9() {        //replace_*** & erase_***        string str = "Samus beat the monster.\n";        cout << "replace_first_copy: " << replace_first_copy(str, "Samus", "samus") << endl;;        replace_last(str, "beat", "kill");        cout << "replace_last: " << str << endl;        cout << "ierase_all_copy: " << ierase_all_copy(str, "samus") << endl;        cout << "replace_nth_copy: " << replace_nth_copy(str, "1", 1, "L") << endl;        cout << "erase_tail_copy: " << erase_tail_copy(str, 8) << endl; } int main() {        case9();        system("pause"); } H、分割： string_algo提供了兩個字串分割演算法：find_all和split，可以使用某種策略把字串分割成若干部分，並將分割後的字串拷貝存入指定的容器，應用示例如下; void case10() {        string str = "Samus,Link.Zelda::Mario-Luigi+zelda";        deque<string> d;        ifind_all(d, str, "zELDA");       //Case-insensitive segmentation strings are not distinguishable        assert(d.size() == 2);        cout << "deque size: " << d.size() << endl;        for (BOOST_AUTO(pos, d.begin());pos != d.end();++pos)        {               cout << "Pos:[ " << *pos << " ]";        }        cout << endl;        list <iterator_range<string::iterator>> ls;        split(ls, str, is_any_of(",.:-+"));      //Use punctuation marks        for (auto tmp:ls)        {               cout << "Pos: [ " << tmp << " ]";        }        cout << endl;        ls.clear();        split(ls, str, is_any_of(".:-"), token_compress_on);        for (auto tmp : ls)        {               cout << "Pos:[ " << tmp << " ];";        }        cout << endl; } int main() {        case10();        system("pause"); } I、合併： 合併演算法join是分割演算法的逆運算，把儲存在容器中的字串連線成一個新的字串，並且可以指定連線的分隔符，示例如下： #include <iostream> #include <vector> #include <iomanip> #include <string> #include <list> #include <boost/assign.hpp> //for use list_of() #include <boost/format.hpp> #include <boost/smart_ptr.hpp> #include <boost/make_shared.hpp> #include <boost/typeof/typeof.hpp> #include <boost/algorithm/string.hpp>           //for use string_algo() library using namespace std; using namespace boost; using namespace boost::assign; void case11() {        vector<string> str = list_of("Samus")("Link")("Zelda")("Mario");        cout << "Vector str size is: " << str.size() << endl;        cout << "Vector str Content: " << join(str, "+") << endl;                   //coalescing        struct is_contains_a
       {               bool operator()(const string &st)               {                      return contains(st, "a");               }        };        cout << "After Operator() str Content: " << join_if(str, "**", is_contains_a()) << endl;  //coalescing } int main() {        case11();        system("pause"); } J、查詢分割迭代器： 通用的find_all以及split之外，string_algo庫中還提供兩個查詢迭代器find_iterator、split_iterator，它們可以在字串中像迭代器那樣遍歷匹配，進行查詢或者分割，不用容器容納，示例如下： #include <iostream> #include <vector> #include <iomanip> #include <string> #include <list> #include <boost/assign.hpp>    //for use list_of() function #include <boost/format.hpp> #include <boost/smart_ptr.hpp> #include <boost/make_shared.hpp> #include <boost/typeof/typeof.hpp> #include <boost/algorithm/string.hpp>           //for use string_algo library void case12() {        string str("Samus||samus||mario||||Link");        typedef find_iterator<string::iterator> string_find_iterator; //查詢迭代器型別定義        string_find_iterator pos, end;                                                            //宣告查詢迭代器變數        for (pos = make_find_iterator(str,first_finder("samus",is_iequal()));pos != end;++pos)        {               cout << "Pos Content is: " << *pos << ";";        }        cout << endl;        typedef split_iterator<string::iterator> string_split_iterator;      //分割迭代器型別定義        string_split_iterator p, endp;                                                            //宣告分割迭代器變數        for (p = make_split_iterator(str,first_finder("||",is_iequal()));p != endp;++p)        //is_iequal()判斷是否相等        {               cout << "P Content is: " << *p << ";";        }        cout << endl; } int main() {        case12();        system("pause"); } 過程概解： 使用查詢迭代器首先要宣告迭代器物件find_iterator或者split_iterator，它們的模版型別引數是一個迭代器型別a，如：string::iterator或者char*; 為了獲得迭代器的起始位置，需要呼叫first_finder()函式，用於判斷匹配物件，再用make_find_iterator或make_split_iterator來真正建立迭代器，同族的查詢函式還有last_finder、nth_finder、token_finder等，它們的含義與查詢演算法類似，從不同的位置開始查詢返回迭代器；初始化工作完成後，就可以像使用標準迭代器或者指標那樣，不斷的遍歷迭代器物件，使用解引用操作符獲取查詢的內容，知道找不到匹配的物件；特別注意分割器的運用，它可以以任意長度的字串作為分隔符進行分割，而普通的split演算法則只能以字元作為分隔符； tokenizer： tokeizer庫是有一個專門用於分詞(token)的字串處理庫，可以使用簡單易用的方法把一個字串分解成若干個單詞，它與string_algo庫的分割演算法類似，但不同之處也有很多； tokenizer位於名字空間boost，為了使用tokenizer元件，需要在檔案中包含並宣告如下： #include <boost/tokenizer.hpp> using namespace boost; /* * tokenizer類原型 */ template<typename TokenizerFunc = char_delimtiers_separator<char>, typename Iterator = std::string::const_iterator, typename Type = std::string> class { tokenizer(Iterator first,Iterator last,const TokenizerFunc& f); tokenizer(const Container& c,const TokenizerFunc& f); void assign(Iterator first,Iterator last); void assign(const Container& c); void assign(const Container& c,const TokenizerFunc& f); iterator begin() const; iterator end() const; }; 引數說明： A、TokenizerFunc : tokenizer庫專門的分詞函式物件，預設是使用空格、標點符號分詞 B、Iterator ：字元序列的迭代器型別 C、Type : 儲存分詞結果的型別這三個模版型別都提供了預設值，但通常只有前兩個模版引數可以變化，第三個型別一般只能選擇std::string或者std::wstring，這也是它位於模版引數列表最後的原因； tokenizer的建構函式接受要進行分詞的字串，可以以迭代器的區間形式給出，也可以是一個有begin()和end()成員函式的容器； assign()函式可以重新指定要分詞的字串，用於再利用tokenizer； tokenizer具有類似標準容器的介面，begin()函式使tokenizer開始執行分詞功能，返回第一個分詞迭代器，end()函式表明迭代器已經到達分詞序列的末尾，分詞結束； 用法： tokenizer的用法很像string_algo的分割迭代器，但要簡單一些，可以向使用一個容器用，向tokenizer傳入一個欲分詞的字串構造，然後用begin()獲得迭代器反覆迭代；詳細用法示例如下： #include <iostream> #include <vector> #include <iomanip> #include <string> #include <list> //for use lits<std::string> str #include <boost/assign.hpp> //for use list_of() function #include <boost/format.hpp> //for use format fmt("***") #include <boost/tokenizer.hpp>    //for use tokenizer<> tok(std:;string) #include <boost/smart_ptr.hpp>    //for use shared_ptr() #include <boost/make_shared.hpp>                //for use make_shared() #include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO #include <boost/algorithm/string.hpp>           //for use string_algo library void case13() {        //tokenizer<> tok(std::string);        string str = "Link raise the master-sword.";        tokenizer<> tok(str);                    //使用預設模版引數建立分詞物件 //此時是預設使用空格、標點符號進行字元分詞        for (BOOST_AUTO(pos,tok.begin());pos != tok.end();++pos)        {               cout << " Pos Content: " << *pos << endl;        } } int main() {        case13();        system("pause"); } 分詞函式物件： tokenizer的構造引數中，只要滿足且具有合適的operator()、reset(0語言的函式物件就可以用於分詞，tokenizer庫提供四個預定義好的分詞物件： a、char_delimiters_separaptor：使用標點符號分詞，已經被宣告廢棄，不推薦使用； b、char_separator：支援一個字元集合作為分隔符，預設的行為與char_delimiters_separator類似； c、escaped_list_separator：用於csv格式(逗號分隔)的分詞； d、offsert_separator：使用偏移量來分詞，在分解平檔案格式的字串時很有用；以下為上面主要三個物件的使用介紹： a、char_separator：使用一個字元集合作為分詞依據，行為很類似split演算法，它的構造如下所示： char_separator(const char* dropped_delims,const char* kept_delims = 0,empty_token_policy empty_tokens = drop_empty_tokens); 建構函式中的引數釋義如下： 1)、dropped_delims：分隔符集合，這個集合中的字元不會作為分詞的結果出現； 2)、kept_delims    ：分隔符集合，但其中的字元會保留在分詞結果中； 3)、empty_tokens ：類似split演算法的eCompress引數，處理兩個連續出現的分隔符，如keep_empty_tokens則表示連續出現的分隔符表示了一個空字串，相當於split演算法的token_compress_off值，如為drop_empty_tokens，則空白單次不會作為分詞的結果；如果使用預設構造，不傳入任何引數的話，則等同於char_separator(" ",標點符號字元,drop_empty_tokens)，以空格和標點符號分詞，保留標點符號，不輸出空白單次，示例如下： #include <iostream> #include <vector> #include <iomanip> #include <string> #include <list>                                                      //for use lits<std::string> str #include <boost/assign.hpp>                            //for use list_of() function #include <boost/format.hpp>                            //for use format fmt("***") #include <boost/tokenizer.hpp>                         //for use tokenizer<> tok(std:;string) #include <boost/smart_ptr.hpp>                         //for use shared_ptr() #include <boost/make_shared.hpp>                //for use make_shared() #include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO #include <boost/algorithm/string.hpp>           //for use string_algo library using namespace std; using namespace boost; using namespace boost::assign; template<typename T> void print(T &tok) {        for (BOOST_AUTO(pos,tok.begin()); pos != tok.end(); ++pos)        {               cout << " Pos Cotent: " << *pos << endl;        } } void case14() {        //char_separator()        char *str = "Link ;; <master-sword> zelda";        char_separator<char> seq;    //一個char_separator物件        tokenizer < char_separator<char>, char*> tok(str, str + strlen(str), seq); //傳入char_separator構造分詞物件        cout << "tokenizer: " << endl;        print(tok);                                            //分詞並輸出        tok.assign(str, str + strlen(str), char_separator<char>(" ;-","<>"));       //重新分詞        cout << "tok.assign: " << endl;        print(tok);        tok.assign(str, str + strlen(str), char_separator<char>(" ;-<>", "", drop_empty_tokens));        cout << "Twocie Assign: " << endl;        print(tok); } int main() {        case14();        system("pause"); } b、escaped_list_separator：這個是專門處理CSV格式(Comma Split Value，逗號分割值)的分詞物件，它的建構函式宣告如下： escaped_list_separator(char e = '\\',char c = ',',char q = '\"'); 這個函式的引數一般都取預設值，其釋義如下： 1)、引數"e"：指定了字串中的轉義字元，預設是‘\'； 2)、引數"c"：分隔符，預設是‘，’； 3)、引數"q"：引號字元，預設是” 具體示例如下： #include <iostream> #include <vector> #include <iomanip> #include <string> #include <list>                                                      //for use lits<std::string> str #include <boost/assign.hpp>                            //for use list_of() function #include <boost/format.hpp>                            //for use format fmt("***") #include <boost/tokenizer.hpp>                         //for use tokenizer<> tok(std:;string) #include <boost/smart_ptr.hpp>                         //for use shared_ptr() #include <boost/make_shared.hpp>                //for use make_shared() #include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO #include <boost/algorithm/string.hpp>           //for use string_algo library using namespace std; using namespace boost; using namespace boost::assign; template<typename T> void print(T &tok) {        for (BOOST_AUTO(pos,tok.begin()); pos != tok.end(); ++pos)        {               cout << " Pos Cotent: " << *pos << endl;        } } void case15() {        //escaped_list_separator()        string str = "id,100,name,\"mario\"";        escaped_list_separator<char> seq;        tokenizer<escaped_list_separator<char>> tok(str, seq);        print(tok); } int main() {        case15();        system("pause"); } /************************************************************************/ /* 輸出結果： /* Pos Cotent : id /* Pos Cotent : 100 /* Pos Cotent : name /* Pos Cotent : mario /* 請按任意鍵繼續. . . /************************************************************************/ c、offset_separator：與前兩種分詞函式不同，這個分詞功能不基於查詢分隔符，而是使用偏移量的概念，在處理某些不實用分隔符而使用固定欄位寬度的文字時非常有用，建構函式如下： template<typename Iter> offset_separator(Iter begin,Iter end,bool wrap_offsets = true,bool return_partial_last = true); offset_separator的建構函式接受兩個迭代器引數(也可以是陣列指標)begin、end，指定分詞用的整數偏移量序列，整數序列的每個元素分詞欄位的寬度； bool引數bwrapoffsets，決定是否在偏移量用完後繼續分詞，bool引數return_partial_last決定在偏移量學列最後是個否返回分詞不足的部分，這兩個附加引數的預設值都是true，示例如下： #include <iostream> #include <vector> #include <iomanip> #include <string> #include <list>                                                      //for use lits<std::string> str #include <boost/assign.hpp>                            //for use list_of() function #include <boost/format.hpp>                            //for use format fmt("***") #include <boost/tokenizer.hpp>                         //for use tokenizer<> tok(std:;string) #include <boost/smart_ptr.hpp>                         //for use shared_ptr() #include <boost/make_shared.hpp>                //for use make_shared() #include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO #include <boost/algorithm/string.hpp>           //for use string_algo library using namespace std; using namespace boost; using namespace boost::assign; template<typename T> void print(T &tok) {        for (BOOST_AUTO(pos,tok.begin()); pos != tok.end(); ++pos)        {               cout << " Pos Cotent: " << *pos;        }        cout << endl; } void case16() {        //offset_separator        string str = "2233344445";        i

Boost 學習筆記--->字串&文字處理

Boost 學習筆記--->字串&文字處理

Boost學習筆記 -- 字串與文字處理

Spark學習筆記——文本處理技術

python學習筆記7-異常處理

Python學習筆記-IP地址處理模塊Ipy

C#學習筆記：預處理指令

python學習筆記(八):異常處理

MYSOL 學習筆記之事務處理

Python學習筆記--音頻處理

ArcGIS模型構建器案例學習筆記-字段處理模型集

Python學習筆記9——異常處理

Java學習筆記之異常處理

python學習筆記(24) 異常處理

前端基礎學習筆記字型文字樣式和特殊符號

學習筆記：異常處理

Java學習筆記之——異常處理

演算法筆記.胡凡學習筆記之日期處理

NLTK學習筆記(七):文字資訊提取

學習筆記--NLP文字相似度之LCS（最長公共子序列）

【Linux學習五】文字處理

Boost 學習筆記--->字串&文字處理

相關推薦