1. 程式人生 > 實用技巧 >基於C#的簡單網路爬蟲例項

基於C#的簡單網路爬蟲例項

以下程式碼用來爬取近30年來地震資料,實際執行時間為4.5小時(博主電腦配置較低)

 static void Main(string[] args)
        {
            FileStream fw = new FileStream("1.txt", FileMode.OpenOrCreate);
            FileStream fw2 = new FileStream("2.txt", FileMode.OpenOrCreate);
            StreamWriter sw = new StreamWriter(fw);
            StreamWriter sw2 
= new StreamWriter(fw2); WebClient wc = new WebClient(); wc.Encoding = Encoding.UTF8; //以字串的形式返回資料 string page = "1"; string 起始強度 = "1"; string 終止強度 = "10"; string 起始日期 = "1990-01-01"; string 終止日期 = "1990-02-01";
int 頁數 = 1; string send_message; string html; //以正則表示式的形式匹配到字串網頁中想要的資料 MatchCollection matches; //依次取得匹配到的資料 for (int m=1990;m<=2020;m++) { for(int n=1;n<=12;n++) { Console.WriteLine(m
*100+n); for (int i = 1; i <= 頁數; i++) { send_message = "http://ditu.92cha.com/dizhen.php?page=" + i.ToString() + "&dizhen_ly=usa&dizhen_zjs=" + 起始強度 + "&dizhen_zje=" + 終止強度 + "&dizhen_riqis=" + m.ToString() + "-" + n.ToString() + "-01" + "&dizhen_riqie=" + m.ToString() + "-" + n.ToString() + "-31"; html = wc.DownloadString(send_message); matches = Regex.Matches(html, "text-center\">(.*)</td"); //依次取得匹配到的資料 foreach (Match item in matches) { sw.WriteLine(item.Groups[1].Value); } MatchCollection matches2 = Regex.Matches(html, "條記錄,分(.*)頁顯示"); //匹配頁數 MatchCollection matches3 = Regex.Matches(html, "_blank\">(.*)</a>"); foreach (Match item in matches2) { 頁數 = Convert.ToInt32(item.Groups[1].Value); } foreach (Match item in matches3) { sw2.WriteLine(item.Groups[1].Value); } } } } Console.ReadKey(); }

爬取完需要對資料進一步處理,這裡分別寫進xlsx和資料庫,以下是xlsx的程式碼,需要在依賴項中新增spire.xls:

using Spire.Xls;


namespace txt_to_xml
{
    class Program
    {
        static void Main(string[] args)
        {
            FileStream fw = new FileStream("1.txt", FileMode.OpenOrCreate);
            FileStream fw2 = new FileStream("2.txt", FileMode.OpenOrCreate);
            FileStream fw3 = new FileStream("3.txt", FileMode.OpenOrCreate);
            StreamReader r1 = new StreamReader(fw);
            StreamReader r2 = new StreamReader(fw2);
            StreamWriter w1=new StreamWriter(fw3);
            Workbook workbook;
            Worksheet sheet;
            workbook = new Workbook();
            workbook.LoadFromFile("1.xlsx");
            sheet = workbook.Worksheets[0];int i = 0;
            for (int k = 1; k <= 1900391; k++)
            {   
                    i = k ;
                    sheet.Range[i, 1].Text = r1.ReadLine();
                    sheet.Range[i, 2].Text = r1.ReadLine();
                    sheet.Range[i, 3].Text = r1.ReadLine();
                    sheet.Range[i, 4].Text = r1.ReadLine();
                    sheet.Range[i, 5].Text = r1.ReadLine();
                    sheet.Range[i, 6].Text = r2.ReadLine();
                    sheet.Range[i, 6].Text = r2.ReadLine();
                }
            w1.Close();
            Console.WriteLine(i);
            FileViewer(path);
            workbook.SaveToFile("1.xlsx", ExcelVersion.Version2010);
        }
       
    }
}

值得一提的是,xls僅支援記錄 6萬多條,xlsx也僅可以記錄100萬條,更多記錄可能需要使用資料庫,(C#操作資料庫方法庫參看資料庫專題)