1. 程式人生 > >NSStringEncoding關於文字編碼問題的解決方法

NSStringEncoding關於文字編碼問題的解決方法

今天看見一個很棒的部落格,只是無法粉絲之,就轉載一下幾篇很好用的博文吧

轉載:http://www.cnblogs.com/zhwl/archive/2012/12/31/2840746.html

今天在嘗試抓取起點中文網首頁的時候遇到了一個問題 — 如果編碼沒有用對的話是沒辦法讀取任何東西的.

這也算是C#用的太多養成的壞習慣, 以前基本沒怎麼考慮過編碼問題. 應該說, C#裡面就算編碼錯了, 也能讀進來東西,

只是一片亂碼而已. Cocoa裡面就狠了點, 直接拋異常了.

下面是剛開始寫的一段程式碼, 把起點中文網的主頁下載到一個字串中.

NSURL *url = [[NSURL alloc]

initWithString:@"http://www.cmfu.com"];

NSError *error;

NSString *xml = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:&error];

if(xml == nil)

{ NSLog(@"Error reading url at %@", [error localizedFailureReason]); }

else { [result setString:xml]; }

死活下載失敗, 錯誤資訊就是編碼不對. 好吧, 我打開了幫助查看了下所有的編碼:

enum {

NSASCIIStringEncoding = 1,

NSNEXTSTEPStringEncoding = 2,

NSJapaneseEUCStringEncoding = 3,

NSUTF8StringEncoding = 4,

NSISOLatin1StringEncoding = 5,

NSSymbolStringEncoding = 6,

NSNonLossyASCIIStringEncoding = 7,

NSShiftJISStringEncoding = 8,

NSISOLatin2StringEncoding = 9,

NSUnicodeStringEncoding = 10,

NSWindowsCP1251StringEncoding = 11,

NSWindowsCP1252StringEncoding = 12,

NSWindowsCP1253StringEncoding = 13,

NSWindowsCP1254StringEncoding = 14,

NSWindowsCP1250StringEncoding = 15,

NSISO2022JPStringEncoding = 21,

NSMacOSRomanStringEncoding = 30,

NSUTF16StringEncoding = NSUnicodeStringEncoding,

NSUTF16BigEndianStringEncoding = 0x90000100,

NSUTF16LittleEndianStringEncoding = 0x94000100,

NSUTF32StringEncoding = 0x8c000100,

NSUTF32BigEndianStringEncoding = 0x98000100,

NSUTF32LittleEndianStringEncoding = 0x9c000100,

};

我一個一個的試,

居然全都不行! 崩潰了, 這都什麼年代了, 難道Cocoa還不支援中文? 不可能啊.

估計是上面那份文件裡面只是列出了最長用的幾種編碼(這裡是蘋果認為最長用的, 可見對於中國基本是無視了, 鄙視下!),

我就寫了下面這段程式碼輸出了所有支援的編碼:

const NSStringEncoding *encodings = [NSString availableStringEncodings];

NSMutableString *str = [[NSMutableString alloc] init];

NSStringEncoding encoding;

while ((encoding = *encodings++) != 0)

{

[str appendFormat: @"%@ === %in", [NSString localizedNameOfStringEncoding:encoding], encoding]; }

[result setString: str];

好傢伙, 果然被我猜中了, 下面就是所有支援的編碼列表

Western (Mac OS Roman) === 30

Japanese (Mac OS) === -2147483647

Traditional Chinese (Mac OS) === -2147483646

Korean (Mac OS) === -2147483645

Arabic (Mac OS) === -2147483644

Hebrew (Mac OS) === -2147483643

Greek (Mac OS) === -2147483642

Cyrillic (Mac OS) === -2147483641

Devanagari (Mac OS) === -2147483639

Gurmukhi (Mac OS) === -2147483638

Gujarati (Mac OS) === -2147483637

Thai (Mac OS) === -2147483627

Simplified Chinese (Mac OS) === -2147483623

Tibetan (Mac OS) === -2147483622

Central European (Mac OS) === -2147483619

Symbol (Mac OS) === 6

Dingbats (Mac OS) === -2147483614

Turkish (Mac OS) === -2147483613

Croatian (Mac OS) === -2147483612

Icelandic (Mac OS) === -2147483611

Romanian (Mac OS) === -2147483610

Celtic (Mac OS) === -2147483609

Gaelic (Mac OS) === -2147483608

Keyboard Symbols (Mac OS) === -2147483607

Farsi (Mac OS) === -2147483508

Cyrillic (Mac OS Ukrainian) === -2147483496

Inuit (Mac OS) === -2147483412

Unicode (UTF-32LE) === -1677721344

Unicode (UTF-8) === 4

Unicode (UTF-16) === 10

Unicode (UTF-16BE) === -1879047936

Unicode (UTF-16LE) === -1811939072

Unicode (UTF-32) === -1946156800

Unicode (UTF-32BE) === -1744830208

Western (ISO Latin 1) === 5

Central European (ISO Latin 2) === 9

Western (ISO Latin 3) === -2147483133

Central European (ISO Latin 4) === -2147483132

Cyrillic (ISO 8859-5) === -2147483131

Arabic (ISO 8859-6) === -2147483130

Greek (ISO 8859-7) === -2147483129

Hebrew (ISO 8859-8) === -2147483128

Turkish (ISO Latin 5) === -2147483127

Nordic (ISO Latin 6) === -2147483126

Thai (ISO 8859-11) === -2147483125

Baltic Rim (ISO Latin 7) === -2147483123

Celtic (ISO Latin) === -2147483122

Western (ISO Latin 9) === -2147483121

Romanian (ISO Latin 10) === -2147483120

Latin-US (DOS) === -2147482624

Greek (DOS) === -2147482619

Baltic Rim (DOS) === -2147482618

Western (DOS Latin 1) === -2147482608

Greek (DOS Greek 1) === -2147482607

Central European (DOS Latin 2) === -2147482606

Cyrillic (DOS) === -2147482605

Turkish (DOS) === -2147482604

Portuguese (DOS) === -2147482603

Icelandic (DOS) === -2147482602

Hebrew (DOS) === -2147482601

Canadian French (DOS) === -2147482600

Arabic (DOS) === -2147482599

Nordic (DOS) === -2147482598

Cyrillic (DOS) === -2147482597

Greek (DOS Greek 2) === -2147482596

Thai (Windows, DOS) === -2147482595

Japanese (Windows, DOS) === 8

Simplified Chinese (Windows, DOS) === -2147482591

Korean (Windows, DOS) === -2147482590

Traditional Chinese (Windows, DOS) === -2147482589

Western (Windows Latin 1) === 12

Central European (Windows Latin 2) === 15

Cyrillic (Windows) === 11

Greek (Windows) === 13

Turkish (Windows Latin 5) === 14

Hebrew (Windows) === -2147482363

Arabic (Windows) === -2147482362

Baltic Rim (Windows) === -2147482361

Vietnamese (Windows) === -2147482360

Western (ASCII) === 1

Japanese (Shift JIS X0213) === -2147482072

Chinese (GBK) === -2147482063

Chinese (GB 18030) === -2147482062

Japanese (ISO 2022-JP) === 21

Korean (ISO 2022-KR) === -2147481536

Japanese (EUC) === 3

Simplified Chinese (EUC) === -2147481296

Traditional Chinese (EUC) === -2147481295

Korean (EUC) === -2147481280

Japanese (Shift JIS) === -2147481087

Cyrillic (KOI8-R) === -2147481086

Traditional Chinese (Big 5) === -2147481085

Western (Mac Mail) === -2147481084

Simplified Chinese (HZ GB 2312) === -2147481083

Traditional Chinese (Big 5 HKSCS) === -2147481082

Ukrainian (KOI8-U) === -2147481080

Traditional Chinese (Big 5-E) === -2147481079

Western (NextStep) === 2

Non-lossy ASCII === 7

Western (EBCDIC Latin 1) === -2147480574

終於看到了熟悉的 GBK 編碼, 對應的程式碼是 -2147482063. Ok, 更改一下最開始的程式碼

NSURL *url = [[NSURL alloc] initWithString:@"http://www.cmfu.com"];

NSError *error;

NSStringEncoding encoder;

NSString *xml = [NSString stringWithContentsOfURL:url encoding:encoder=-2147482063 error:&error];

if(xml == nil)

{ NSLog(@"Error reading url at %@", [error localizedFailureReason]); }

else { [result setString:xml]; }

終於搞定了! 看到熟悉的中文真是激動了.

 

以下為編碼問題的擴充套件______________________________________________

// our secure service :-)
NSURL *server = [NSURL URLWithString:@http://www.cocoanetics.com/feed/];
NSURLRequest *request = [NSURLRequest requestWithURL:server];
 
// use synchronous convenience method
NSURLResponse *response = nil;
NSError *error = nil;
NSData *returnedData = [NSURLConnection sendSynchronousRequest:request
					returningResponse:&response
					error:&error];
if (!returnedData)
{
	NSLog(@Error retrieving data, %@, [error localizedDescription]);
	return NO;
}
 
// get the correct text encoding
// http://stackoverflow.com/questions/1409537/nsdata-to-nsstring-converstion-problem
CFStringEncoding cfEncoding = CFStringConvertIANACharSetNameToEncoding((CFStringRef)
					[response textEncodingName]);
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(cfEncoding);
 
// output
NSString *xml = [[[NSString alloc] initWithData:returnedData encoding:encoding]
					autorelease];
NSLog(@%@, xml);

 

技巧:不採用硬編碼UTF8的方式,我們從應答中獲取適當的編碼.

____________________________________________________________

 

關於HTTP請求頭 


Accept:瀏覽器可接受的MIME型別 
Accept-Charset:瀏覽器可接受的字符集 
Accept-Encoding:瀏覽器能夠進行解碼的資料編碼方式,如gzip. 
Accept-Language:瀏覽器所希望的語言種類 
Authorization:授權資訊 
Connection:表示是否需要持久連線 
Content-Length:表示請求訊息正文的長度 
Cookie:請求頭資訊 
From:請求傳送者的email地址。 
Host:初始URL中的主機和埠 
If-Modified-Since:只有當所請求的內容在指定日期之後又經過修改才返回它,否者返回Not Modified 應答 
Pragma:指定”no-cache”值 表示伺服器必須返回一個重新整理後的文件,即使他有代理伺服器而且已經有葉面的本地拷貝 
Referer:包含一個URL,使用者從該URL代表的頁面出發反問當前請求的頁面 
User-Agent:瀏覽器的型別 
UA-Pixels,UA-Color,UA-OS和UA-CPU:非標準的請求頭,表示螢幕大小,顏色深度,作業系統和CPU型別等。 
HTTP應答頭 

setContentType: 設定Content-Type頭。大多數Servlet都要用到這個方法。 
setContentLength:設定Content-Length頭。對於支援持久HTTP連線的瀏覽器來說,這個函式是很有用的。 
addCookie:設定一個Cookie 
Allow:伺服器支援那些請求方法 
Content-Encoding:文件的編碼方法 
Content-Length: 
Content-Type 表示後面的文件屬於什麼MIME型別。 
Date:當前的GMT時間 
Expired:應該在什麼時候文件已經過期,從而不再快取了。 
Last-Modified:文件的最後改動時間。 
Location:表示客戶應當到哪裡去提取文件。Location通常不是直接設定的 而是通過HttpServletResponse 中的 serRedirect()方法,同時設定狀態碼為302 
Refresh:表示瀏覽器應該在多少時間之後重新整理頁面。 
Server: 伺服器名。 
Set-Cookie:設定和葉面相關的Cookie 
www-Authenticate 客戶應該在Authenticate 投中應該提供什麼型別的授權資訊.