1. 程式人生 > >HttpClient解析伺服器返回的response出現亂碼



最近在用httpClient做網路爬蟲的時候,遇到了一個不大不小的問題,當使用HttpGet向指定網址傳送請求後,接收到的Response無法正常解析,出現 口口??這樣的亂碼,編碼也考慮到了中文編碼,具體程式碼如下:

HttpResponse response = HttpUtils.doGet(baseUrl + title + postUrl, headers);
InputStream is = getInputStreamFromResponse(response);
responseText = Utils.getStringFromInputStream(in);
result = EncodeUtils.unicdoeToGB2312(responseText);

    public static HttpResponse doGet(String url, Map<String, String> headers) {
        HttpClient client = createHttpClient();
        HttpGet getMethod = new HttpGet(url);
        HttpResponse response = null;
        response = client.execute(getMethod);
        return response;

public static String getStringFromStream(InputStream in) {
        StringBuilder buffer = new StringBuilder();
        BufferedReader reader = null;
        reader = new BufferedReader(new InputStreamReader(in, "UTF-8"));
        String line = null;
        while ((line = reader.readLine()) != null) {
            buffer.append(line + "\n");
        return buffer.toString();



而上面的處理邏輯則沒有考慮到Response的inputStream是經過壓縮的,需要使用對應的資料流物件處理,圖中使用的content-encoding是gzip格式,則需要使用GZIPInputStream對其進行處理,只需要對上文中的函式public static String getStringFromStream(InputStream in)改進即可,如下所示:

public static String getStringFromResponse(HttpResponse response) {
        if (response == null) {
            return null;
        String responseText = "";
        InputStream in = getInputStreamFromResponse(response);
        Header[] headers = response.getHeaders("Content-Encoding");
        for(Header h : headers){
            if(h.getValue().indexOf("gzip") > -1){
                //For GZip response
                    GZIPInputStream gzin = new GZIPInputStream(is);
                    InputStreamReader isr = new InputStreamReader(gzin,"utf-8");
                    responseText = Utils.getStringFromInputStreamReader(isr);
                }catch (IOException exception){
        responseText = Utils.getStringFromStream(in);
        return responseText;



RFC 2616 for HTTP 1.1 specifies how web servers must indicate encoding transformations using the Content-Encoding header. Although on the surface, Content-Encoding (e.g., gzip, deflate, compress) and Content-Type(e.g., x-application/x-gzip) sound similar, they are, in fact, two distinct pieces of information. Whereas servers use Content-Type to specify the data type of the entity body, which can be useful for client applications that want to open the content with the appropriate application, Content-Encoding is used solely to specify any additional encoding done by the server before the content was transmitted to the client. Although the HTTP RFC outlines these rules pretty clearly, some web sites respond with “gzip” as the Content-Encoding even though the server has not gzipped the content.
Our testing has shown this problem to be limited to some sites that serve Unix/Linux style “tarball” files. Tarballs are gzip compressed archives files. By setting the Content-Encoding header to “gzip” on a tarball, the server is specifying that it has additionally gzipped the gzipped file. This, of course, is unlikely but not impossible or non-compliant.
Therein lies the problem. A server responding with content-encoding, such as “gzip,” is specifying the necessary mechanism that the client needs in order to decompress the content. If the server did not actually encode the content as specified, then the client’s decompression would fail.