解決RAPIDxml對UTF-16/UNICODE支援不好的方法

阿新 • • 發佈：2019-01-23

RAPDIDxml作為一種高效的XML解析庫被很多人推薦和使用，但在很多使用舉例中，都是使用UTF-8編碼的字串處理。在庫的幫助手冊中也說對UTF-16支援不好。

我在做測試時發現工程使用UNICODE編碼，字串使用WSTRING，甚至XML內容中包含中文字元，簡單測試時均可以正常解析，那為什麼說對UTF-16支援不好?看原始碼，確實內部處理基本上都是使用模版CH，根據傳入的型別是CHAR或者WCHAR自動區分。難道是新的庫已經解決這些問題了？

高興了沒多久，準備進一步實際測試RAPIDxml的效能和TINYxml的效能對比時，使用了一個本地比較大的表，有幾百條資料，多個欄位，欄位內容有英文中文，測試時發現TINYxml解析正常，RAPIDxml解析老是出現異常，開始以為是自己對字串記憶體使用不正確（這個是RAPIDxml使用需要注意的地方），後面各種調整（使用陣列--太大不行；使用檔案-讀入出問題；使用向量；向末尾自己加結束符號......），後來使用RAPIDxml自己的分配字串重新申請儲存還是不行，感覺不對，明明單獨測試中文可以支援為什麼這樣不行？

後來把表內容查詢從整表查詢變為1條查詢發現解析OK，懷疑表中有內容在解析時出錯（特殊的字元解析出錯），經過按照ID排除，發現有一條記錄解析肯定出錯，把記錄內容16禁止打印發現，有個中文字元UTF-16編碼為0x5f00，和其它不太一樣（其它多為0x00yy或者0xyyyy），RAPIDxml解析時會去判斷字串結束符號，這個0x5f00會不會被誤以為字串結束，導致後面解析異常。

單獨把0x5f00對應的中文字元“開”測試解析，每次必然異常。跟蹤除錯發現，在RAPIDxml解析過程中會對XML中的關鍵字元，如<、>、空格等進行匹配分析，匹配分析時是和一個unsigned char型別的陣列進行比較的，0x5f00必然會出現異常。

具體原因為：xml底層解析查詢一些關鍵字（空格、結束符號、>、<等）時，是和一個靜態定義的unsigned char型別陣列（256大小）進行匹配的，如果輸入字元為wchar型別，則強制轉換後比較，這樣，對於utf16型別由於裁剪了後面部分，可能一個漢字會被當作一個char字元處理，特別是後面一個位元組為00的會當作結束符處理（0x6253 0x5f00 0x6743 0x9650），後面位元組如果和ascii碼相同也可能被誤處理。

搞清楚原因，解決起來也很容易：漢字肯定是內容，不會是關鍵字，不用做特殊處理，所以可以簡單修改跳過即可。

---------------------------------------

//庫程式碼修改如下（注意，空格的返回值和其它的不同，否則會導致帶空格的內容解析出來為空）：

///////////////////////////////////////////////////////////////////////
// Internal character utility functions

// Detect whitespace character
struct whitespace_pred
{
static unsigned char test(Ch ch)
{
if (ch > 0x80) //解決UTF-16匹配問題
{
return 0; //和lookup_whitespace對應
}
return internal::lookup_tables<0>::lookup_whitespace[static_cast<unsigned char>(ch)];
}
};

// Detect node name character
struct node_name_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1;
}
return internal::lookup_tables<0>::lookup_node_name[static_cast<unsigned char>(ch)];
}
};

// Detect attribute name character
struct attribute_name_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1; //和lookup_attribute_name對應
}
return internal::lookup_tables<0>::lookup_attribute_name[static_cast<unsigned char>(ch)];
}
};

// Detect text character (PCDATA)
struct text_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1; //和lookup_text對應
}
return internal::lookup_tables<0>::lookup_text[static_cast<unsigned char>(ch)];
}
};

// Detect text character (PCDATA) that does not require processing
struct text_pure_no_ws_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1; //和lookup_text_pure_no_ws對應
}
return internal::lookup_tables<0>::lookup_text_pure_no_ws[static_cast<unsigned char>(ch)];
}
};

// Detect text character (PCDATA) that does not require processing
struct text_pure_with_ws_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1; //和lookup_text_pure_with_ws對應
}
return internal::lookup_tables<0>::lookup_text_pure_with_ws[static_cast<unsigned char>(ch)];
}
};

// Detect attribute value character
template<Ch Quote>
struct attribute_value_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1;
}
if (Quote == Ch('\''))
return internal::lookup_tables<0>::lookup_attribute_data_1[static_cast<unsigned char>(ch)];
if (Quote == Ch('\"'))
return internal::lookup_tables<0>::lookup_attribute_data_2[static_cast<unsigned char>(ch)];
return 0; // Should never be executed, to avoid warnings on Comeau
}
};

// Detect attribute value character
template<Ch Quote>
struct attribute_value_pure_pred
{
static unsigned char test(Ch ch)
{
if (ch >= 0x80) //解決UTF-16匹配問題
{
return 1; //和lookup_attribute_data_1_pure對應
}
if (Quote == Ch('\''))
return internal::lookup_tables<0>::lookup_attribute_data_1_pure[static_cast<unsigned char>(ch)];
if (Quote == Ch('\"'))
return internal::lookup_tables<0>::lookup_attribute_data_2_pure[static_cast<unsigned char>(ch)];
return 0; // Should never be executed, to avoid warnings on Comeau
}
};

修改後大量效能測試OK。

解決RAPIDxml對UTF-16/UNICODE支援不好的方法

解決RAPIDxml對UTF-16/UNICODE支援不好的方法

解決node對es6-import語法支援

蘋果手機對網頁上樣式為position:fixed的彈窗支援不好的解決辦法

mysql模糊查詢unicode編碼的欄位的解決方案——mybatis查資料庫mysql時對unicode支援問題

關於編碼之一：Unicode/UTF-8/UTF-16/UTF-32

細說：Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4

Mac電腦使用：解決Mac上“文字編碼Unicode(UTF-8)不適用”、文字編碼“中文 (GB 18030)不適用“的問題

MySQL 解決 emoji表情的方法，使用utf8mb4 字符集 4位元組 UTF-8 Unicode 編碼

Java基礎語法-Unicode、UTF-8、UTF-16

弄懂進位制、bit、java基本資料型別（byte、short、int 、char 、String）、ASCII、Unicode、UTF-8、UTF-16的關聯關係及UTF-8、UTF-16編碼原理

ASCII、GB2312、GBK、Unicode、UTF-8、UTF-16 編碼方式比較分析

徹底搞懂字元編碼(unicode,mbcs,utf-8,utf-16,utf-32,big endian,little endian...)

ASCII，Unicode，UTF-8，UTF-16，ANSI字元編碼比較

Java_字元編碼（Unicode、UTF-8、UTF-16）

理解unicode、ascii和utf-8、utf-16

解決：對修飾器的實驗支援是一項將在將來版本中更改的功能。設定+"experimentalDecorators"+選項以刪除此警告。

python 之編碼轉換 unicode, utf-8, utf-16, GBK

淺析unicode和UTF-8、UTF-16、UTF-32的區別

Unicode(UTF-8, UTF-16)令人混淆的概念

gbk, gb2312,big5,unicode,utf-8,utf-16的區別

解決RAPIDxml對UTF-16/UNICODE支援不好的方法

相關推薦