boost 字串處理(1)

阿新 • • 發佈：2019-01-21

字串演算法
標頭檔案 include

一.從split開始

string str1("hello abc-*-ABC-*-aBc goodbye");
vector<string> SplitVec; //結果
split(SplitVec, str1, is_any_of("-*"), token_compress_on);

1.首先討論最簡單的一個引數token_compress_on，為一個列舉型別

namespace boost {
    namespace algorithm {

    //! Token compression mode 
    /*!
        Specifies token compression mode for the token_finder.
    */ 

    enum token_compress_mode_type
    {
        token_compress_on,    //!< Compress adjacent tokens
        token_compress_off  //!< Do not compress adjacent tokens
    };

    } // namespace algorithm

    // pull the names to the boost namespace
    using algorithm::token_compress_on;
    using algorithm::token_compress_off;

} // namespace boost

token_compress_on 為壓縮方式，如果在str1中遇到連續的’-‘,’*’則壓縮成一個
該引數下結果如下:
+ &SplitVec 0x005dfa9c [3](“hello abc”,”ABC”,”aBc goodbye”)

token_compress_off 為非壓縮凡是，和上面的相反結果為：
+ &SplitVec 0x0059fc88 [7](“hello abc”,”“,”“,”ABC”,”“,”“,”aBc goodbye”)

當然這個不是重點，重點是以上的列舉型別寫法，通過using方式將algorithm空間中的變數提升到boost空間中，這種方法比較常用，可避免列舉型別的衝突。

2.is_any_of(“-*”)
該函式返回一個is_any_of的struct物件，該物件為仿函式。
這些類似的仿函式生成函式，還提供幾個

// pull names to the boost namespace
    using algorithm::is_classified;
    using algorithm::is_space;
    using algorithm::is_alnum;
    using algorithm::is_alpha;
    using algorithm::is_cntrl;
    using algorithm::is_digit;
    using algorithm::is_graph;
    using algorithm::is_lower;
    using algorithm::is_upper;
    using algorithm::is_print;
    using algorithm::is_punct;
    using algorithm::is_xdigit;
    using algorithm::is_any_of;
    using algorithm::is_from_range;

這樣就好理解了，在執行split過程中，呼叫is_any_of()，仿函式來判斷是否需要切割，如果返回true則切割，false則繼續查詢。
當然每一次的切割結果放入SplitVec容器中。理解這個之後，自己也可以寫這個仿函數了。

二.split拓展

先給一個大致的流程圖

這裡寫圖片描述

split
Split input into parts
iter_split
Use the finder to find matching substrings in the input and use them as separators to split the input into parts

template< typename SequenceSequenceT, typename RangeT, typename PredicateT >
inline SequenceSequenceT& split(
    SequenceSequenceT& Result,
    RangeT& Input,
    PredicateT Pred,
    token_compress_mode_type eCompress=token_compress_off )
{
    return ::boost::algorithm::iter_split(
        Result,
        Input,
        ::boost::algorithm::token_finder( Pred, eCompress ) );         
}

split的內部是呼叫iter_split，iter_split是使用迭代器方式的。下面來看下iter_split中的具體實現：

template< 
    typename SequenceSequenceT,
    typename RangeT,
    typename FinderT >
inline SequenceSequenceT&
iter_split(
    SequenceSequenceT& Result,
    RangeT& Input,
    FinderT Finder )
{
    BOOST_CONCEPT_ASSERT((
        FinderConcept<FinderT,
        BOOST_STRING_TYPENAME range_iterator<RangeT>::type>
        ));

    iterator_range<BOOST_STRING_TYPENAME range_iterator<RangeT>::type> lit_input(::boost::as_literal(Input));

    typedef BOOST_STRING_TYPENAME 
        range_iterator<RangeT>::type input_iterator_type;
    typedef split_iterator<input_iterator_type> find_iterator_type;
    typedef detail::copy_iterator_rangeF<
        BOOST_STRING_TYPENAME 
            range_value<SequenceSequenceT>::type,
        input_iterator_type> copy_range_type;

    input_iterator_type InputEnd=::boost::end(lit_input);

    typedef transform_iterator<copy_range_type, find_iterator_type>
        transform_iter_type;

    transform_iter_type itBegin=
        ::boost::make_transform_iterator( 
            find_iterator_type( ::boost::begin(lit_input), InputEnd, Finder ),
            copy_range_type() );

    transform_iter_type itEnd=
        ::boost::make_transform_iterator( 
            find_iterator_type(),
            copy_range_type() );

    SequenceSequenceT Tmp(itBegin, itEnd);

    Result.swap(Tmp);
    return Result;
}

在iter_split將Input轉換為迭代器，也就是lit_input。然後使用make_transform_iterator轉換函式，轉換為split_iterator迭代器。這時候split_iterator的begin指向了字串的首地址。在split_iterator類中實現了
迭代器中的++操作。在match_type結構中有兩個指標，begin和end用來指向當前迭代器中的有效部分，每一次do_find就可以將兩個指標向後移動。

void increment()
{
     match_type FindMatch=this->do_find( m_Next, m_End );

     if(FindMatch.begin()==m_End && FindMatch.end()==m_End)
     {
         if(m_Match.end()==m_End)
         {
             // Mark iterator as eof
             m_bEof=true;
         }
     }

     m_Match=match_type( m_Next, FindMatch.begin() );
     m_Next=FindMatch.end();
 }

那麼do_find函式從何而來呢？
可以看一下，split_iterator 類的派生關係，可以看到這個類：detail::find_iterator_base，do_find就是來自這個類。

template<typename IteratorT>
        class split_iterator : 
            public iterator_facade<
                split_iterator<IteratorT>,
                const iterator_range<IteratorT>,
                forward_traversal_tag >,
            private detail::find_iterator_base<IteratorT>

現在來看下do_find函式，其中的m_Finder就是iter_split的最後一個引數FinderT Finder，也就最後用來傳遞給split_iterator的。m_Finder也就是::boost::algorithm::token_finder( Pred, eCompress )生成的仿函式物件。

// Find operation
match_type do_find( 
    input_iterator_type Begin,
    input_iterator_type End ) const
{
    if (!m_Finder.empty())
    {
        return m_Finder(Begin,End);
    }
    else
    {
        return match_type(End,End);
    }
}

在token_finder中又包含了一層，這樣來看的話token_finderF的才是仿函式的名字了。
template< typename PredicateT >
inline detail::token_finderF
token_finder(
PredicateT Pred,
token_compress_mode_type eCompress=token_compress_off )
{
return detail::token_finderF( Pred, eCompress );
}

看下token_finderF仿函式實現
ForwardIteratorT It=std::find_if( Begin, End, m_Pred );
就是查詢的重點了，m_Pred 就是is_any_of(“-*”)，
當遇到”-*”中的任意一個返回true的仿函式。
這樣的話就可以通過token_finderF的仿函式返回滿足m_Pred條件的區域了。

template< typename ForwardIteratorT >
iterator_range<ForwardIteratorT>
operator()(
    ForwardIteratorT Begin,
    ForwardIteratorT End ) const
{
    typedef iterator_range<ForwardIteratorT> result_type;

    ForwardIteratorT It=std::find_if( Begin, End, m_Pred );

    if( It==End )
    {
        return result_type( End, End );
    }
    else
    {
        ForwardIteratorT It2=It;

        if( m_eCompress==token_compress_on )
        {
            // Find first non-matching character
            while( It2!=End && m_Pred(*It2) ) ++It2;
        }
        else
        {
            // Advance by one position
            ++It2;
        }

        return result_type( It, It2 );
    }
}

三、split之外

在split中可見，boost中對字串的處理，幾乎是採用迭代器模式。
在boost::algorithm中，主要包括以下幾類演算法的實現，
演算法：
1. to_upper to_lower 字串大小寫的轉換
2. trim_left trim_right trim 字串左右空白字元的裁剪
3. starts_with ends_with contains …等字串包含關係
4. find 字串查詢
5. replace 字串替換
6. split 字串切割
7. join 字串拼接

boost 字串處理(1)

一.從split開始

二.split拓展

三、split之外

boost 字串處理(1)

字元和字串處理(1)

字串處理(1) : 首字母轉大寫/小寫

字串處理-1.1

Windows下的字串處理(1)

boost字串處理（下）

Java(一) -Core Java-(1)-字串處理

Boost（五）——字串處理（一）：字串操作

Boost（五）——字串處理（二）：正則表示式操作

Boost（五）——字串處理（三）：詞彙分割操作

boost常用字串處理方法學習

boost——字串與文字處理tokenizer

boost 字串和文字處理庫概述

boost的字串處理函式——format

<數字圖像處理1> 數字圖像定義(Definition) 類型(Type) 采樣 (Sampling) 量化 (Quantisation)

中文維基數據處理 - 1. 下載與清洗

Python自然語言處理1

老王Python-進階篇4-異常處理1.3（周末習題）

字串處理演算法（六）求2個字串最長公共部分

mysql進行字串處理

boost 字串處理(1)

一.從split開始

二.split拓展

三、split之外

相關推薦