c#抓取頁面亂碼解決辦法
阿新 • • 發佈:2019-01-24
最近在做一個頁面採集的過程中發現,頁面抓取後亂碼,而且時好時不好。然後發現編碼也沒有問題,原來是GZIP壓縮導致的。
在朋友們的熱心幫助下終於解決了。下面就貼程式碼吧,抓取gzip及其它頁面防止亂碼。
核心程式碼如下:
C#程式碼- using (HttpWebResponse response = (HttpWebResponse)req.GetResponse())
- {
- if (response.ContentEncoding.ToLower().Contains("gzip"))
- {
-
using
- {
- using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
- {
- sHTML = reader.ReadToEnd();
-
}
- }
- }
- else if (response.ContentEncoding.ToLower().Contains("deflate"))
- {
- using (DeflateStream stream = new DeflateStream(response.GetResponseStream(), CompressionMode.Decompress))
- {
-
using
- {
- sHTML = reader.ReadToEnd();
- }
- }
- }
- else
- {
- using (Stream stream = response.GetResponseStream())
- {
- using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
- {
- sHTML = reader.ReadToEnd();
- }
- }
- }
- }
完整前臺程式碼gethtml.aspx
C#程式碼- <%@ Page Language="C#" AutoEventWireup="true" CodeFile="gethtml.aspx.cs" Inherits="gethtml" ValidateRequest="false" %>
- <!DOCTYPE html>
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head runat="server">
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
- <title>抓取頁面</title>
- </head>
- <body>
- <form id="form1" runat="server">
- <div>
- url地址:<asp:TextBox ID="url" runat="server" Text="http://www.baidu.com" style="width:400px;"></asp:TextBox><asp:Button ID="Button1" runat="server" Text="抓取" OnClick="Button1_Click" /><br />
- <textarea name="code" id="code" runat="server" style="width:530px;height:300px;"></textarea>
- </div>
- </form>
- </body>
- </html>
完整後臺程式碼gethtml.aspx.cs
C#程式碼- using System;
- using System.Net;
- using System.IO;
- using System.Text;
- using System.IO.Compression;
- public partial class gethtml : System.Web.UI.Page
- {
- protected void Page_Load(object sender, EventArgs e)
- {
- }
- public static string GetHtmlWithUtf(string url)
- {
- if (!(url.Contains("http://") || url.Contains("https://")))
- {
- url = "http://" + url;
- }
- HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
- req.UserAgent = "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
- req.Accept = "*/*";
- req.Headers.Add("Accept-Language", "zh-cn,en-us;q=0.5");
- req.ContentType = "text/xml";
- string sHTML = string.Empty;
- using (HttpWebResponse response = (HttpWebResponse)req.GetResponse())
- {
- if (response.ContentEncoding.ToLower().Contains("gzip"))
- {
- using (GZipStream stream = new GZipStream(response.GetResponseStream(), CompressionMode.Decompress))
- {
- using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
- {
- sHTML = reader.ReadToEnd();
- }
- }
- }
- else if (response.ContentEncoding.ToLower().Contains("deflate"))
- {
- using (DeflateStream stream = new DeflateStream(response.GetResponseStream(), CompressionMode.Decompress))
- {
- using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
- {
- sHTML = reader.ReadToEnd();
- }
- }
- }
- else
- {
- using (Stream stream = response.GetResponseStream())
- {
- using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
- {
- sHTML = reader.ReadToEnd();
- }
- }
- }
- }
- return sHTML;
- }
- protected void Button1_Click(object sender, EventArgs e)
- {
- string urlstr = url.Text;
- code.InnerHtml = GetHtmlWithUtf(urlstr);
- }
- }
(完)