NSStringEncoding關於文字編碼問題的解決方法
今天看見一個很棒的部落格,只是無法粉絲之,就轉載一下幾篇很好用的博文吧
轉載:http://www.cnblogs.com/zhwl/archive/2012/12/31/2840746.html
今天在嘗試抓取起點中文網首頁的時候遇到了一個問題 — 如果編碼沒有用對的話是沒辦法讀取任何東西的.
這也算是C#用的太多養成的壞習慣, 以前基本沒怎麼考慮過編碼問題. 應該說, C#裡面就算編碼錯了, 也能讀進來東西,
只是一片亂碼而已. Cocoa裡面就狠了點, 直接拋異常了.
下面是剛開始寫的一段程式碼, 把起點中文網的主頁下載到一個字串中.
NSURL *url = [[NSURL alloc]
initWithString:@"http://www.cmfu.com"];
NSError *error;
NSString *xml = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:&error];
if(xml == nil)
{ NSLog(@"Error reading url at %@", [error localizedFailureReason]); }
else { [result setString:xml]; }
死活下載失敗, 錯誤資訊就是編碼不對. 好吧, 我打開了幫助查看了下所有的編碼:
enum {
NSASCIIStringEncoding = 1,
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5,
NSSymbolStringEncoding = 6,
NSNonLossyASCIIStringEncoding = 7,
NSShiftJISStringEncoding = 8,
NSISOLatin2StringEncoding = 9,
NSUnicodeStringEncoding = 10,
NSWindowsCP1251StringEncoding = 11,
NSWindowsCP1252StringEncoding = 12,
NSWindowsCP1253StringEncoding = 13,
NSWindowsCP1254StringEncoding = 14,
NSWindowsCP1250StringEncoding = 15,
NSISO2022JPStringEncoding = 21,
NSMacOSRomanStringEncoding = 30,
NSUTF16StringEncoding = NSUnicodeStringEncoding,
NSUTF16BigEndianStringEncoding = 0x90000100,
NSUTF16LittleEndianStringEncoding = 0x94000100,
NSUTF32StringEncoding = 0x8c000100,
NSUTF32BigEndianStringEncoding = 0x98000100,
NSUTF32LittleEndianStringEncoding = 0x9c000100,
};
我一個一個的試,
居然全都不行! 崩潰了, 這都什麼年代了, 難道Cocoa還不支援中文? 不可能啊.
估計是上面那份文件裡面只是列出了最長用的幾種編碼(這裡是蘋果認為最長用的, 可見對於中國基本是無視了, 鄙視下!),
我就寫了下面這段程式碼輸出了所有支援的編碼:
const NSStringEncoding *encodings = [NSString availableStringEncodings];
NSMutableString *str = [[NSMutableString alloc] init];
NSStringEncoding encoding;
while ((encoding = *encodings++) != 0)
{
[str appendFormat: @"%@ === %in", [NSString localizedNameOfStringEncoding:encoding], encoding]; }
[result setString: str];
好傢伙, 果然被我猜中了, 下面就是所有支援的編碼列表
Western (Mac OS Roman) === 30
Japanese (Mac OS) === -2147483647
Traditional Chinese (Mac OS) === -2147483646
Korean (Mac OS) === -2147483645
Arabic (Mac OS) === -2147483644
Hebrew (Mac OS) === -2147483643
Greek (Mac OS) === -2147483642
Cyrillic (Mac OS) === -2147483641
Devanagari (Mac OS) === -2147483639
Gurmukhi (Mac OS) === -2147483638
Gujarati (Mac OS) === -2147483637
Thai (Mac OS) === -2147483627
Simplified Chinese (Mac OS) === -2147483623
Tibetan (Mac OS) === -2147483622
Central European (Mac OS) === -2147483619
Symbol (Mac OS) === 6
Dingbats (Mac OS) === -2147483614
Turkish (Mac OS) === -2147483613
Croatian (Mac OS) === -2147483612
Icelandic (Mac OS) === -2147483611
Romanian (Mac OS) === -2147483610
Celtic (Mac OS) === -2147483609
Gaelic (Mac OS) === -2147483608
Keyboard Symbols (Mac OS) === -2147483607
Farsi (Mac OS) === -2147483508
Cyrillic (Mac OS Ukrainian) === -2147483496
Inuit (Mac OS) === -2147483412
Unicode (UTF-32LE) === -1677721344
Unicode (UTF-8) === 4
Unicode (UTF-16) === 10
Unicode (UTF-16BE) === -1879047936
Unicode (UTF-16LE) === -1811939072
Unicode (UTF-32) === -1946156800
Unicode (UTF-32BE) === -1744830208
Western (ISO Latin 1) === 5
Central European (ISO Latin 2) === 9
Western (ISO Latin 3) === -2147483133
Central European (ISO Latin 4) === -2147483132
Cyrillic (ISO 8859-5) === -2147483131
Arabic (ISO 8859-6) === -2147483130
Greek (ISO 8859-7) === -2147483129
Hebrew (ISO 8859-8) === -2147483128
Turkish (ISO Latin 5) === -2147483127
Nordic (ISO Latin 6) === -2147483126
Thai (ISO 8859-11) === -2147483125
Baltic Rim (ISO Latin 7) === -2147483123
Celtic (ISO Latin) === -2147483122
Western (ISO Latin 9) === -2147483121
Romanian (ISO Latin 10) === -2147483120
Latin-US (DOS) === -2147482624
Greek (DOS) === -2147482619
Baltic Rim (DOS) === -2147482618
Western (DOS Latin 1) === -2147482608
Greek (DOS Greek 1) === -2147482607
Central European (DOS Latin 2) === -2147482606
Cyrillic (DOS) === -2147482605
Turkish (DOS) === -2147482604
Portuguese (DOS) === -2147482603
Icelandic (DOS) === -2147482602
Hebrew (DOS) === -2147482601
Canadian French (DOS) === -2147482600
Arabic (DOS) === -2147482599
Nordic (DOS) === -2147482598
Cyrillic (DOS) === -2147482597
Greek (DOS Greek 2) === -2147482596
Thai (Windows, DOS) === -2147482595
Japanese (Windows, DOS) === 8
Simplified Chinese (Windows, DOS) === -2147482591
Korean (Windows, DOS) === -2147482590
Traditional Chinese (Windows, DOS) === -2147482589
Western (Windows Latin 1) === 12
Central European (Windows Latin 2) === 15
Cyrillic (Windows) === 11
Greek (Windows) === 13
Turkish (Windows Latin 5) === 14
Hebrew (Windows) === -2147482363
Arabic (Windows) === -2147482362
Baltic Rim (Windows) === -2147482361
Vietnamese (Windows) === -2147482360
Western (ASCII) === 1
Japanese (Shift JIS X0213) === -2147482072
Chinese (GBK) === -2147482063
Chinese (GB 18030) === -2147482062
Japanese (ISO 2022-JP) === 21
Korean (ISO 2022-KR) === -2147481536
Japanese (EUC) === 3
Simplified Chinese (EUC) === -2147481296
Traditional Chinese (EUC) === -2147481295
Korean (EUC) === -2147481280
Japanese (Shift JIS) === -2147481087
Cyrillic (KOI8-R) === -2147481086
Traditional Chinese (Big 5) === -2147481085
Western (Mac Mail) === -2147481084
Simplified Chinese (HZ GB 2312) === -2147481083
Traditional Chinese (Big 5 HKSCS) === -2147481082
Ukrainian (KOI8-U) === -2147481080
Traditional Chinese (Big 5-E) === -2147481079
Western (NextStep) === 2
Non-lossy ASCII === 7
Western (EBCDIC Latin 1) === -2147480574
終於看到了熟悉的 GBK 編碼, 對應的程式碼是 -2147482063. Ok, 更改一下最開始的程式碼
NSURL *url = [[NSURL alloc] initWithString:@"http://www.cmfu.com"];
NSError *error;
NSStringEncoding encoder;
NSString *xml = [NSString stringWithContentsOfURL:url encoding:encoder=-2147482063 error:&error];
if(xml == nil)
{ NSLog(@"Error reading url at %@", [error localizedFailureReason]); }
else { [result setString:xml]; }
終於搞定了! 看到熟悉的中文真是激動了.
以下為編碼問題的擴充套件______________________________________________
// our secure service :-)
NSURL *server = [NSURL URLWithString:@http://www.cocoanetics.com/feed/];
NSURLRequest *request = [NSURLRequest requestWithURL:server];
// use synchronous convenience method
NSURLResponse *response = nil;
NSError *error = nil;
NSData *returnedData = [NSURLConnection sendSynchronousRequest:request
returningResponse:&response
error:&error];
if (!returnedData)
{
NSLog(@Error retrieving data, %@, [error localizedDescription]);
return NO;
}
// get the correct text encoding
// http://stackoverflow.com/questions/1409537/nsdata-to-nsstring-converstion-problem
CFStringEncoding cfEncoding = CFStringConvertIANACharSetNameToEncoding((CFStringRef)
[response textEncodingName]);
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(cfEncoding);
// output
NSString *xml = [[[NSString alloc] initWithData:returnedData encoding:encoding]
autorelease];
NSLog(@%@, xml);
技巧:不採用硬編碼UTF8的方式,我們從應答中獲取適當的編碼.
____________________________________________________________
關於HTTP請求頭
Accept:瀏覽器可接受的MIME型別
Accept-Charset:瀏覽器可接受的字符集
Accept-Encoding:瀏覽器能夠進行解碼的資料編碼方式,如gzip.
Accept-Language:瀏覽器所希望的語言種類
Authorization:授權資訊
Connection:表示是否需要持久連線
Content-Length:表示請求訊息正文的長度
Cookie:請求頭資訊
From:請求傳送者的email地址。
Host:初始URL中的主機和埠
If-Modified-Since:只有當所請求的內容在指定日期之後又經過修改才返回它,否者返回Not Modified 應答
Pragma:指定”no-cache”值 表示伺服器必須返回一個重新整理後的文件,即使他有代理伺服器而且已經有葉面的本地拷貝
Referer:包含一個URL,使用者從該URL代表的頁面出發反問當前請求的頁面
User-Agent:瀏覽器的型別
UA-Pixels,UA-Color,UA-OS和UA-CPU:非標準的請求頭,表示螢幕大小,顏色深度,作業系統和CPU型別等。
HTTP應答頭
setContentType: 設定Content-Type頭。大多數Servlet都要用到這個方法。
setContentLength:設定Content-Length頭。對於支援持久HTTP連線的瀏覽器來說,這個函式是很有用的。
addCookie:設定一個Cookie
Allow:伺服器支援那些請求方法
Content-Encoding:文件的編碼方法
Content-Length:
Content-Type 表示後面的文件屬於什麼MIME型別。
Date:當前的GMT時間
Expired:應該在什麼時候文件已經過期,從而不再快取了。
Last-Modified:文件的最後改動時間。
Location:表示客戶應當到哪裡去提取文件。Location通常不是直接設定的 而是通過HttpServletResponse 中的 serRedirect()方法,同時設定狀態碼為302
Refresh:表示瀏覽器應該在多少時間之後重新整理頁面。
Server: 伺服器名。
Set-Cookie:設定和葉面相關的Cookie
www-Authenticate 客戶應該在Authenticate 投中應該提供什麼型別的授權資訊.