深入分析noedjs爬蟲中出現的亂碼情況

阿新 • • 發佈：2019-02-05

上一篇文章中分析了目前沒有能夠解決的亂碼的三種情況，今天就這三種情況分析一下背後的原因。
1，網頁原始碼中的編碼方式和抓包得到的編碼方式不一致問題，這個有可能是故意為之，為了反爬蟲之類的。當然也有可能是在配置伺服器的時候出錯了。
2，由content-Encoding欄位為gzip導致的問題：
客戶端和瀏覽器進行通訊的協商過程中存在Accept-Encoding欄位，摘錄RFC2616中關於該欄位的定義如下：

Examples of its use are:
Accept-Encoding: compress, gzip
Accept-Encoding:
Accept-Encoding: *
Accept-Encoding: compress;q=0.5, gzip;q=1.0
Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0
A server tests whether a content-coding is acceptable, according to an Accept-Encoding field, using these rules:
1. If the content-coding is one of the content-codings listed in the Accept-Encoding field, then it is acceptable, unless it is accompanied by a qvalue of 0. (As defined in section 3.9, a qvalue of 0 means “not acceptable.”)
2. The special “*” symbol in an Accept-Encoding field matches any available content-coding not explicitly listed in the header field.
3. If multiple content-codings are acceptable, then the acceptable content-coding with the highest non-zero qvalue is preferred.
4. The “identity” content-coding is always acceptable, unless specifically refused because the Accept-Encoding field includes “identity;q=0”, or because the field includes “*;q=0” and does not explicitly include the “identity” content-coding. If the Accept-Encoding field-value is empty, then only the “identity” encoding is acceptable.
If an Accept-Encoding field is present in a request, and if the server cannot send a response which is acceptable according to the Accept-Encoding header, then the server SHOULD send an error response with the 406 (Not Acceptable) status code.
If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. In this case,if “identity” is one of the available content-codings, then the server SHOULD use the “identity” content-coding, unless it has additional information that a different content-coding is meaningful to the client.

因此在針對www.guoguo-app.com進行爬蟲的時候設定了兩次不同的Accept-Encoding，效果如下：

GET / HTTP/1.1
Accept-Charset: gbk
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 05:16:22 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary 
: Accept-Encoding
Content-Language: zh-CN
Server: Tengine/Aserver

GET / HTTP/1.1
Accept-Charset: gbk
Accept-Encoding: gzip
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 05:21:53 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding 

Content-Language: zh-CN
Content-Encoding: gzip
Server: Tengine/Aserver

可以看到當客戶端顯示的指出可以接受gzip的時候，伺服器給與了gzip的響應，而預設的情況是不經過壓縮的。
但是針對andersonjiang.blog.sohu.com網站就沒有那麼順利了，如下所示：

GET / HTTP/1.1
Accept-Encoding: gzip;q=1.0
Host: andersonjiang.blog.sohu.com
Connection: close

HTTP/1.1 200 OK
Content-Type: text/html; charset=GBK
Transfer-Encoding: chunked
Connection: close
Server: nginx
Date: Fri, 07 Apr 2017 05:36:10 GMT
Vary: Accept-Encoding
Expires: Thu, 01 Jan 1970 00:00:00 GMT
RHOST: [email protected]
Pragma: No-cache
Cache-Control: no-cache
Content-Language: en-US
Content-Encoding: gzip
FSS-Cache: MISS from 13998460.19372422.21936590
FSS-Proxy: Powered by 9935166.11245896.17873234

GET / HTTP/1.1
Accept-Encoding: *;q=0
Host: andersonjiang.blog.sohu.com
Connection: close

HTTP/1.1 200 OK
Content-Type: text/html; charset=GBK
Transfer-Encoding: chunked
Connection: close
Server: nginx
Date: Fri, 07 Apr 2017 05:59:08 GMT
Vary: Accept-Encoding
Expires: Thu, 01 Jan 1970 00:00:00 GMT
RHOST: [email protected]
Pragma: No-cache
Cache-Control: no-cache
Content-Language: en-US
Content-Encoding: gzip
FSS-Cache: MISS from 13998460.19372422.21936590
FSS-Proxy: Powered by 10131777.11639115.18069848

第二次的時候表示客戶端不接受任何形式的壓縮編碼，但是服務端卻仍然以壓縮形式返回，這種情況是不符合標準RFC的規定的，因此針對這種情況只有編寫解壓縮程式，方可提取到想要的網頁內容。
由上可以看出，伺服器端給不給出壓縮或者非壓縮的網頁，完全取決於伺服器段行為，並沒有完全遵守RFC 給出的建議和規定。
3，網頁編碼導致的亂碼問題。
客戶端和瀏覽器進行通訊的協商過程中存在Accept-Charset欄位，摘錄RFC2616中關於該欄位的定義如下：

An example is
Accept-Charset: iso-8859-5, unicode-1-1;q=0.8
The special value ““, if present in the Accept-Charset field,matches every character set (including ISO-8859-1) which is not mentioned elsewhere in the Accept-Charset field. If no “” is present in an Accept-Charset field, then all character sets not explicitly mentioned get a quality value of 0, except for ISO-8859-1, which gets a quality value of 1 if not explicitly mentioned.
If no Accept-Charset header is present, the default is that any character set is acceptable. If an Accept-Charset header is present,and if the server cannot send a response which is acceptable according to the Accept-Charset header, then the server SHOULD send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed.

GET / HTTP/1.1
Accept-Charset: gbk;q=0
Accept-Encoding: *;q=0
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 06:23:12 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
Content-Language: zh-CN
Server: Tengine/Aserver

GET / HTTP/1.1
Accept-Charset: utf-8;q=1
Accept-Encoding: *;q=0
Host: www.guoguo-app.com
Connection: close

HTTP/1.1 200 OK
Date: Fri, 07 Apr 2017 06:25:57 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
Content-Language: zh-CN
Server: Tengine/Aserver

可以看到客戶端要求的utf8編碼，但是服務端的Content-Type卻是GBK。
由上面可以看出，使用’Accept-Charset’: ‘gbk’,’Accept-Encoding’: ‘gzip’等頭部請求的時候，雖然客戶端聲明瞭自己所支援的編碼方式以及，解壓縮方式，但是伺服器段並沒有按照客戶端的要求返回對應的編碼和壓縮方法。但是正常情況下，可以加上述的兩個協商欄位，當伺服器有選擇的時候，則會返回我們請求的方式。
本文為CSDN村中少年原創文章，轉載記得加上小尾巴偶，博主連結這裡。

深入分析noedjs爬蟲中出現的亂碼情況

深入分析noedjs爬蟲中出現的亂碼情況

客戶端提交資料給伺服器端，如果資料中帶有中文的話，有可能會出現亂碼情況

php寫入數據到mysql數據庫中出現亂碼解決方法

php中出現亂碼

深入分析Java規範中JVM的記憶體佈局模型

關於Linux環境下應用生成圖片中出現亂碼的問題處理

位址列中文引數在頁面中出現亂碼問題

寫爬蟲總是出現亂碼，習慣不好所以謝謝筆記

用Scrapy抓取的中文字元匯出到csv中出現亂碼

eclipse 控制檯中文輸出出現亂碼情況及解決

關於spring boot 前臺訪問後臺過程中出現亂碼的解決方案

C/C++字串中出現‘\’的情況

vs code中出現亂碼現象

java中出現亂碼的解決辦法

安卓框架，分析解決專案中出現的anr

spring 傳送郵件中的亂碼情況

SELECT的結果中出現"亂碼"的解決方案--【葉子】

關於cmd執行java程式出現亂碼情況解決辦法

js檔案中的中文提示資訊發到jsp中出現亂碼解決步驟

tomcat日誌中出現亂碼

深入分析noedjs爬蟲中出現的亂碼情況

相關推薦