Lua UTF-8 全形轉半形
阿新 • • 發佈:2019-01-23
根據UTF-8的編碼規則,可以知道:
1. 全形空格為12288,半形空格為32
2. 其他字元半形(33-126)與全形(65281-65374)的對應關係是:均相差65248
但是UTF-8不能位元組轉換位整型變數,因此需要一個函式,做一個轉換:
為了方便將一個utf8字元,轉換成整型,還需要判斷一個utf8字元佔了多少個位元組,為此需要一個函式來判斷(具體參考:http://blog.csdn.net/SKY453589103/article/details/76337557):function utf8_to_num(raw_string) local result = {} pos = 1 while pos <= #raw_string do local count_1_of_byte = get_continuous_1_count_of_byte(string.byte(raw_string, pos)) local num = 0 if count_1_of_byte < 1 then num = string.byte(raw_string, pos) count_1_of_byte = 1 else boundary = 8 i = count_1_of_byte + 1 while i < boundary * count_1_of_byte do if 0 == i % boundary then i = i + 2 end if (1 << (boundary - i % boundary - 1)) & string.byte(raw_string, pos + math.floor(i / boundary)) ~= 0 then --print(1) num = (num << 1) | 1 else --print(0) num = num << 1 end i= i + 1 end end pos = pos + count_1_of_byte table.insert(result, num) end return result end
--獲取一個位元組中,從最高位開始連續的1的個數 function get_continuous_1_count_of_byte(num) if nil == num then return -1 end local count = 0 while num & 0x80 ~= 0 do count = count + 1 num = num << 1 end return count end
接下來就只轉換的函式:
function full_width_to_half_width(raw_string) local new_string = {} local pos = 1 while pos <= #raw_string do local count_1_of_byte = get_continuous_1_count_of_byte(string.byte(raw_string, pos)) if 3 == count_1_of_byte then char = string.sub(raw_string, pos, pos + 2) num_of_char = utf8_to_num(char)[1] if 12288 == num_of_char then num_of_char = 32 table.insert(new_string, string.char(num_of_char)) elseif 65281 <= num_of_char and num_of_char <= 65374 then num_of_char = num_of_char - 65248 table.insert(new_string, string.char(num_of_char)) end pos = pos + count_1_of_byte else table.insert(new_string, string.sub(raw_string, pos, pos)) pos = pos + 1 end end return table.concat(new_string) end
比較簡單,就不做解釋了。