基於C#的簡單網路爬蟲例項
阿新 • • 發佈:2020-07-31
以下程式碼用來爬取近30年來地震資料,實際執行時間為4.5小時(博主電腦配置較低)
static void Main(string[] args) { FileStream fw = new FileStream("1.txt", FileMode.OpenOrCreate); FileStream fw2 = new FileStream("2.txt", FileMode.OpenOrCreate); StreamWriter sw = new StreamWriter(fw); StreamWriter sw2= new StreamWriter(fw2); WebClient wc = new WebClient(); wc.Encoding = Encoding.UTF8; //以字串的形式返回資料 string page = "1"; string 起始強度 = "1"; string 終止強度 = "10"; string 起始日期 = "1990-01-01"; string 終止日期 = "1990-02-01";int 頁數 = 1; string send_message; string html; //以正則表示式的形式匹配到字串網頁中想要的資料 MatchCollection matches; //依次取得匹配到的資料 for (int m=1990;m<=2020;m++) { for(int n=1;n<=12;n++) { Console.WriteLine(m*100+n); for (int i = 1; i <= 頁數; i++) { send_message = "http://ditu.92cha.com/dizhen.php?page=" + i.ToString() + "&dizhen_ly=usa&dizhen_zjs=" + 起始強度 + "&dizhen_zje=" + 終止強度 + "&dizhen_riqis=" + m.ToString() + "-" + n.ToString() + "-01" + "&dizhen_riqie=" + m.ToString() + "-" + n.ToString() + "-31"; html = wc.DownloadString(send_message); matches = Regex.Matches(html, "text-center\">(.*)</td"); //依次取得匹配到的資料 foreach (Match item in matches) { sw.WriteLine(item.Groups[1].Value); } MatchCollection matches2 = Regex.Matches(html, "條記錄,分(.*)頁顯示"); //匹配頁數 MatchCollection matches3 = Regex.Matches(html, "_blank\">(.*)</a>"); foreach (Match item in matches2) { 頁數 = Convert.ToInt32(item.Groups[1].Value); } foreach (Match item in matches3) { sw2.WriteLine(item.Groups[1].Value); } } } } Console.ReadKey(); }
爬取完需要對資料進一步處理,這裡分別寫進xlsx和資料庫,以下是xlsx的程式碼,需要在依賴項中新增spire.xls:
using Spire.Xls; namespace txt_to_xml { class Program { static void Main(string[] args) { FileStream fw = new FileStream("1.txt", FileMode.OpenOrCreate); FileStream fw2 = new FileStream("2.txt", FileMode.OpenOrCreate); FileStream fw3 = new FileStream("3.txt", FileMode.OpenOrCreate); StreamReader r1 = new StreamReader(fw); StreamReader r2 = new StreamReader(fw2); StreamWriter w1=new StreamWriter(fw3); Workbook workbook; Worksheet sheet; workbook = new Workbook(); workbook.LoadFromFile("1.xlsx"); sheet = workbook.Worksheets[0];int i = 0; for (int k = 1; k <= 1900391; k++) { i = k ; sheet.Range[i, 1].Text = r1.ReadLine(); sheet.Range[i, 2].Text = r1.ReadLine(); sheet.Range[i, 3].Text = r1.ReadLine(); sheet.Range[i, 4].Text = r1.ReadLine(); sheet.Range[i, 5].Text = r1.ReadLine(); sheet.Range[i, 6].Text = r2.ReadLine(); sheet.Range[i, 6].Text = r2.ReadLine(); } w1.Close(); Console.WriteLine(i); FileViewer(path); workbook.SaveToFile("1.xlsx", ExcelVersion.Version2010); } } }
值得一提的是,xls僅支援記錄 6萬多條,xlsx也僅可以記錄100萬條,更多記錄可能需要使用資料庫,(C#操作資料庫方法庫參看資料庫專題)