我和爬蟲有個約會（java）

阿新 • • 發佈：2018-12-18

我理解的爬蟲就是通過技術手段拿到網頁的原始碼（java）

方法很多種，可以通過程式碼發起http請求，拿到網頁原始碼。也可以用封裝好的工具如httpclient，htmlunit。或者上大殺器phantomjs（貌似有點淘汰了）和selenium來獲得原始碼。。。。。

簡單的網站直接發起請求就能夠拿到網頁原始碼，如部落格園。有些網站需要一些簡單的請求資訊，如request-header裡面的User-Agent。有些網站需要cookie驗證資訊。有些喪心病狂的網站技術大多是動態載入的。拿過你不想分析他的ajax請求來艱難度日的話。可以試試selenium。

接下來講講每個技術的具體實現方法。有點晚了，先睡了。明天繼續寫。

---------------------------------------------------------------------------------------------------------------------------------------------------

寫了一大半藍屏了 win10 氣炸了。

繼續開工

第一種，通過java自己的程式碼發起http請求，獲得連結。能做很多功能，但是太多了，可以被使用封裝好的httpClient工具。

URL url = new URL("https://www.cnblogs.com/"); 
		URLConnection openConnection = url.openConnection();
		HttpURLConnection httpConnection = (HttpURLConnection) openConnection;
		int responseCode = httpConnection.getResponseCode();
		System.out.println(responseCode);
		InputStream inputStream = httpConnection.getInputStream();
         String html = IOUtils.toString(inputStream); //org.apache.commons.io.IOUtils;
         System.out.println(html);

第二種使用簡單的封裝工具如ioutils或者jsoup獲得獲得原始碼，適合簡單的網頁，沒有cookie或者user-agent之類限制的網頁。jsoup還可以用來解析html文件，免得寫正則或則xpath之類的表示式。

String html = IOUtils.toString(new URL("https://www.cnblogs.com/"), "utf-8");//org.apache.commons.io.IOUtils;
		System.out.println(html);


	Document htmlDocument = Jsoup.parse(new URL("https://www.cnblogs.com/"), 5*1000);//org.jsoup.Jsoup;
		String text = htmlDocument.toString();
		System.out.println(text);

第三種使用專業的http工具，如httpclient。可以設定發起的引數和獲得返會的引數資訊。但是解析ajax動態生成的網頁就難受了

	 CloseableHttpClient httpClient = HttpClients.createDefault();
		 HttpGet get = new HttpGet("https://www.cnblogs.com/");
		 get.addHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36");
		
		 
		 CloseableHttpResponse response = httpClient.execute(get);
		 System.out.println(response.getStatusLine().getStatusCode());
		 String html =  EntityUtils.toString(response.getEntity());// 獲得網頁原始碼
		 System.out.println(html);
		 Header[] allHeaders = response.getAllHeaders();//獲得返回的response資訊
         for (Header header : allHeaders) {
			System.out.println(header);
		}		 
		 response.close();
		 httpClient.close();

第四種 htmlunit 能夠處理簡單的ajax生成的網頁。但是因為該技術使用的瀏覽器核心不是主流瀏覽器。所以解析存在一些相容性問題。建立物件，按需求設定一些true，false引數。設定一個後臺執行等待時間之類的。再發起請求。獲得原始碼。大致如下。沒有深入研究過

 WebClient  webClient = new WebClient(BrowserVersion.CHROME);
		  webClient.getCookieManager().setCookiesEnabled(true);
		  webClient.getOptions().setUseInsecureSSL(true);//支援https
	      webClient.getOptions().setJavaScriptEnabled(true); // 啟用JS直譯器，預設為true
	      webClient.getOptions().setCssEnabled(true); // css支援
		  webClient.setAjaxController(new NicelyResynchronizingAjaxController());
		  webClient.setJavaScriptTimeout(10000);//設定js執行超時時間
		  webClient.waitForBackgroundJavaScript(10000);
		  webClient.getOptions().setThrowExceptionOnScriptError(false);  
		  webClient.getOptions().setUseInsecureSSL(true);
		  webClient.getOptions().setRedirectEnabled(true);  
		  webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
		  URL url = new URL("https://www.cnblogs.com/");
          HtmlPage htmlpage = webClient.getPage(url);
          webClient.waitForBackgroundJavaScript(10000);
		  System.out.println(htmlpage.asXml());
		  webClient.close();

第五種 selenuim+chrome 瀏覽器通過程式碼開啟瀏覽器，模擬人的輸入，點選等事件。獲得原始碼

（windos版本）需要安裝新版的chrome瀏覽器，也可以用其他瀏覽器。這裡我用chrome瀏覽器。再在chrome瀏覽器的啟動資料夾下加上chromedriver（網上下載下載chromedirver地址）驅動。如下圖。

 System.setProperty("webdriver.chrome.driver", "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chromedriver.exe");

Map<String, Object> prefs = new HashMap<String, Object>();  
prefs.put("profile.default_content_setting_values.notifications", 2);  
ChromeOptions options = new ChromeOptions();  
options.setExperimentalOption("prefs", prefs);  
//  options.addArguments("--headless","--no-sandbox","--disable-dev-shm-usage","--disable-gpu");   //這個配置意思是不開啟介面，不載入gpu渲染介面。linux server執行需要次配置
options.setExperimentalOption("useAutomationExtension", false);
	        
WebDriver driver = new ChromeDriver(options);  

//到此獲得了driver物件。可通過次物件 控制瀏覽器

通過  driver.get("https://www.xxxx.com"); 訪問網站
 driver.getPageSource()   獲得網頁原始碼


//關閉瀏覽器和程序 

driver.close();
driver.quit();

還可以控制輸入框的清空和輸入，按鈕的點選之類的。

順便說一句，如果部署在linux server 上，沒有介面的linux 上面也是可以的。不過需要先安裝chrome瀏覽器。然後把linux對應的驅動上傳上去。程式碼裡面指定好chromedriver的路徑。

over

這排版有點不太會用呀

我和爬蟲有個約會（java）

我和爬蟲有個約會（java）

我和spark有個約會（1）-Spark中的stage的劃分原理

1111我和程式有個約會

1111我和程序有個約會

#51CTO學院四周年# 我與學院有個約會

《劍指offer》系列和為S的兩個數字（Java）

劍指offer：和位s的兩個數字（java）

我和shiro有個故事

我和 WebSocket 的那些事（一）

簡單易用的參數校驗北京PK10平臺出租和版本校驗方式（java）

leetcode:確實的第一個整數（java）

方法過載和方法覆蓋及其異同（Java）

簡單易用的引數校驗和版本校驗方式（java）

HMAC-SHA1和MD5 訊息摘要演算法（java）

求連結串列中倒數第K個結點（Java）

java中的URLConnection和HttpURLConnection有什麼區別（因為我自己搜到別人寫的區別看下來都沒有什麼區別）

一個類，有新增元素（add）和獲取元素數量（size）方法。啟動兩個線程。線程1向容器中新增數據。線程2監聽容器元素數量，當容器元素數量為5時，線程2輸出信息並終止

演算法之斐波那契數列如何求第n個值與求前n項和？（Java）

Java中DriverManager跟DataSource獲取getConnection有什麼不同（Java中資料來源和連線池的區別）

爬蟲入門，從第一個爬蟲建立起做蟲師的心，爬蟲的編譯器的安裝，pycharm第三方庫的安裝和pip的安裝，爬蟲的認知篇（5）

我和爬蟲有個約會（java）

相關推薦