SQLServer中文處理(二)
阿新 • • 發佈:2019-01-24
四、UTF-8
出於國際化和平臺獨立性的考慮,越來越多的網站和應用程式開始採用UTF-8作為預設字元編碼。通常編碼轉換的工作都是在前端程式語言中實現的。下面給出一下用T-SQL實現的UCS-2與UTF-8的互轉函式,沒有太多實際應用價值,僅僅是一個示例:(也沒準會遇到需要在資料庫層面對utf8字串進行編解碼的情況,這兩個函式就排上用場了。)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
CREATE FUNCTION dbo.ucs2_to_utf8(
@ucs2 varbinary( max )
)
RETURNS varbinary( max )
AS
/*
U-00000000 ... U-0000007F 0xxxxxxx
U-00000080 ... U-000007FF 110xxxxx 10xxxxxx
U-00000800 ... U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
*/
BEGIN
DECLARE
@ output varbinary( max ),
@i int ,
@code int
SET @ output = 0x
SET @i = 1
WHILE 1 = 1
BEGIN
SET @code = CAST ( SUBSTRING (@ucs2,@i+1,1) + SUBSTRING (@ucs2,@i,1) AS int )
IF @code = 0
BREAK
IF @code >= 0x0800
SET @ output = @ output +
CAST (@code / 4096 + 224 AS binary (1)) +
CAST ((@code % 4096) / 64 + 128 AS binary (1)) +
CAST ((@code % 4096) % 64 + 128 AS binary (1))
ELSE IF @code >= 0x0080
SET @ output = @ output +
CAST (@code / 64 + 192 AS binary (1)) +
CAST (@code % 64 + 128 AS binary (1))
ELSE
SET @ output = @ output + CAST (@code AS binary (1))
SET @i = @i + 2
END
RETURN @ output
END
GO
CREATE FUNCTION dbo.utf8_to_ucs2(
@utf8 varbinary( max )
)
RETURNS varbinary( max )
AS
BEGIN
DECLARE
@ output varbinary( max ),
@i int ,
@ next int ,
@code int ,
@tmp varbinary(1)
SET @ output = 0x
SET @i = 1
SET @ next = 0
WHILE 1 = 1
BEGIN
SET @tmp = SUBSTRING (@utf8,@i,1)
IF @tmp = 0x
BREAK
IF @tmp BETWEEN 0x01 AND 0x7F
SET @ output = @ output + @tmp + 0x00
ELSE IF @tmp BETWEEN 0xC0 AND 0xDF
BEGIN
SET @code = ( CAST (@tmp AS int ) & 0x1F) * 64
SET @ next = 1
END
ELSE IF @tmp BETWEEN 0xE0 AND 0xEF
BEGIN
SET @code = ( CAST (@tmp AS int ) & 0x0F) * 4096
SET @ next = 2
END
ELSE IF @tmp BETWEEN 0x80 AND 0xBF AND @ next IN (1,2)
BEGIN
IF @ next = 1
BEGIN
SET @code = @code + ( CAST (@tmp AS int ) & 0x3F)
SET @ output = @ output + CAST ( NCHAR (@code) AS binary (2))
END
IF @ next = 2
SET @code = @code + ( CAST (@tmp AS int ) & 0x3F) * 64
SET @ next = @ next - 1
END
ELSE
RETURN NULL
SET @i = @i + 1
END
RETURN @ output
END
GO
|
五、其它常見問題
如上所述,在指定Chinese_PRC_CI_AS為預設排序規則的情況下,char/varchar使用cp936編碼,也可以儲存中文。但個人建議是,char/varchar只用以儲存ASCII字元,對於包含大於127的非ASCII字元的字串,統一用nchar/nvarchar儲存。這樣,不但可以支援多語言,不會造成其他語言的字元遺失,而且可以避免許多計算上的問題。
例如:
1 2 3 4 5 6 7 8 |
DECLARE @str varchar (100)
SET @str = '1234567一二三四五六七'
SELECT LEN(@str) --14
SELECT LEFT (@str,10) --1234567一二三
DECLARE @col varchar (10)
SET @col = LEFT (@str,10)
SELECT @col --1234567一
|
倘若在char/varchar中包含了中文字元,SQLServer的字串函式(包括LEN、LEFT/RIGHT、SUBSTRING、STUFF、CHARINDEX/PATINDEX)會把一箇中文字元(雙位元組)作為一個字元處理,而定義變數或列時指定的資料型別char/varchar卻是以位元組為單位,結果則如上例,截取了一個字串的10個字元,卻無法放入一個varchar(10)的變數或列中,這種違反直覺的不一致會給系統帶來一些討厭的BUG。