1. 程式人生 > 實用技巧 >用C#+Selenium+ChromeDriver 爬取網頁,完美模擬真實的使用者瀏覽行為

用C#+Selenium+ChromeDriver 爬取網頁,完美模擬真實的使用者瀏覽行為

背景

Selenium是一個用於Web應用程式測試的工具。Selenium測試直接執行在瀏覽器中,就像真正的使用者在操作一樣。而對於爬蟲來說,使用Selenium操控瀏覽器來爬取網上的資料那麼肯定是爬蟲中的殺手武器。這裡,我將介紹selenium + 谷歌瀏覽器的一般使用。

需求

在平常的爬蟲開發中,有時候網頁是一堆js堆起來的程式碼,涉及很多非同步計算,如果是普通的http 控制檯請求,那麼得到的原始檔是一堆js ,需要自己在去組裝資料,很費力;但是採用Selenium+ChromeDriver可以達到所見即所得的完美效果。

實現方式

專案結構:為了方便使用,用的winform程式,附nuget包

以下是form1.cs的程式碼,這裡就只放關鍵方法程式碼了。需要安裝最新的chrome瀏覽器+程式碼中使用的chromedriver是 v2.9.248315

 1    private void crawlingWebFunc()
 2         {
 3             SetText("\r\n開始嘗試...");
 4             List<testfold> surls = new List<testfold>();
 5             string path = System.Environment.CurrentDirectory + "
\\圖片url\\"; 6 DirectoryInfo root = new DirectoryInfo(path); 7 DirectoryInfo[] dics = root.GetDirectories(); 8 foreach (var itemdic in dics) 9 { 10 string txt = ""; 11 StreamReader sr = new StreamReader(itemdic.FullName + "
\\data.txt"); 12 while (!sr.EndOfStream) 13 { 14 string str = sr.ReadLine(); 15 txt += str;// + "\n"; 16 } 17 sr.Close(); 18 surls.Add(new testfold() { key = itemdic.FullName, picurl = txt }); 19 } 20 21 ChromeDriverService service = ChromeDriverService.CreateDefaultService(System.Environment.CurrentDirectory); 22 // service.HideCommandPromptWindow = true; 23 24 ChromeOptions options = new ChromeOptions(); 25 options.AddArguments("--test-type", "--ignore-certificate-errors"); 26 options.AddArgument("enable-automation"); 27 // options.AddArgument("headless"); 28 // options.AddArguments("--proxy-server=http://user:[email protected]:8080"); 29 30 using (IWebDriver driver = new OpenQA.Selenium.Chrome.ChromeDriver(service, options, TimeSpan.FromSeconds(120))) 31 { 32 driver.Url = "https://www.1688.com/"; 33 Thread.Sleep(200); 34 try 35 { 36 int a = 1; 37 foreach (var itemsurls in surls) 38 { 39 SetText("\r\n第" + a.ToString() + ""); 40 driver.Navigate().GoToUrl(itemsurls.picurl); 41 //登入 42 if (driver.Url.Contains("login.1688.com")) 43 { 44 SetText("\r\n需要登入,開始嘗試..."); 45 trylogin(driver); //嘗試登入完成 46 //再試試 47 driver.Navigate().GoToUrl("https://s.1688.com/youyuan/index.htm?tab=imageSearch&imageType=oss&imageAddress=cbuimgsearch/eWXC7XHHPN1607529600000&spm="); 48 49 if (driver.Url.Contains("login.1688.com")) 50 { 51 //沒辦法退出 52 SetText("\r\n退出,換ip重試..."); 53 return; 54 } 55 } 56 57 //滑鼠放上去的內容因為頁面自帶只能顯示一個的原因 沒辦法做到全部顯示 然後在下載 只能是其他方式下載 58 // var elements = document.getElementsByClassName('hover-container'); 59 // Array.prototype.forEach.call(elements, function(element) { 60 // element.style.display = "block"; 61 // console.log(element); 62 // }); 63 64 // IJavaScriptExecutor js = (IJavaScriptExecutor)driver; 65 66 // var sss = js.ExecuteScript(" var elements = document.getElementsByClassName('hover-container'); Array.prototype.forEach.call(elements, function(element) { console.log(element); element.setAttribute(\"class\", \"測試title\"); element.style.display = \"block\"; console.log(element); });"); 67 68 Thread.Sleep(500); 69 var responseModel = Write(itemsurls.key, driver.PageSource, Pagetypeenum.列表); 70 Thread.Sleep(500); 71 int i = 1; 72 foreach (var offer in responseModel?.data?.offerList ?? new List<OfferItemModel>()) 73 { 74 driver.Navigate().GoToUrl(offer.information.detailUrl); 75 string responseDatadetail = driver.PageSource; 76 Write(itemsurls.key, driver.PageSource, Pagetypeenum.詳情); 77 SetText("\r\n第" + a.ToString() + "-" + i.ToString() + ""); 78 Thread.Sleep(500); 79 i++; 80 } 81 } 82 } 83 catch (Exception ex) 84 { 85 CloseChromeDriver(driver); 86 throw; 87 } 88 } 89 }
 1  #region 異常  退出chromedriver
 2 
 3         [DllImport("user32.dll", EntryPoint = "FindWindow")]
 4         private extern static IntPtr FindWindow(string lpClassName, string lpWindowName);
 5 
 6         [DllImport("user32.dll", EntryPoint = "SendMessage")]
 7         public static extern int SendMessage(IntPtr hWnd, int Msg, int wParam, int lParam);
 8 
 9         public const int SW_HIDE = 0;
10         public const int SW_SHOW = 5;
11 
12         [DllImport("user32.dll", EntryPoint = "ShowWindow")]
13         public static extern int ShowWindow(IntPtr hwnd, int nCmdShow);
14 
15         /// <summary>
16         /// 獲取視窗控制代碼
17         /// </summary>
18         /// <returns></returns>
19         public IntPtr GetWindowHandle()
20         {
21             string name = (Environment.CurrentDirectory + "\\chromedriver.exe");
22             IntPtr hwd = FindWindow(null, name);
23             return hwd;
24         }
25 
26         /// <summary>
27         /// 關閉chromedriver視窗
28         /// </summary>
29         public void CloseWindow()
30         {
31             try
32             {
33                 IntPtr hwd = GetWindowHandle();
34                 SendMessage(hwd, 0x10, 0, 0);
35             }
36             catch { }
37         }
38 
39         /// <summary>
40         /// 退出chromedriver
41         /// </summary>
42         /// <param name="driver"></param>
43         public void CloseChromeDriver(IWebDriver driver)
44         {
45             try
46             {
47                 driver.Quit();
48                 driver.Dispose();
49             }
50             catch { }
51             CloseWindow();
52         }
53 
54         #endregion 異常  退出chromedriver

總結

說一下思路:

1.跳轉到指定的網頁driver.Navigate().GoToUrl

2.確定資料來源,從driver.PageSource讀取資料

3.對html資料進行解析