Java獲取網頁資料步驟方法詳解

阿新 • • 發佈：2020-03-26

在很多行業當中，我們需要對行業進行分析，就需要對這個行業的資料進行分類，彙總，及時分析行業的資料，對於公司未來的發展，有很好的參照和橫向對比。面前通過網路進行資料獲取是一個很有效而且快捷的方式。
首先我們來簡單的介紹一下，利用java對網頁資料進行抓取的一些步驟，有不足的地方，還望指正，哈哈。屁話不多說了。

其實一般分為以下步驟：

1：通過HttpClient請求到達某網頁的url訪問地址（特別需要注意的是請求方式）

2：獲取網頁原始碼

3：檢視原始碼是否有我們需要提取的資料

4：對原始碼進行拆解，一般使用分割，正則或者第三方jar包

5：獲取需要的資料對自己建立的物件賦值

6：資料提取儲存

下面簡單的說一下在提取資料中的部分原始碼，以及用途：

/**
   * 向指定URL傳送GET方法的請求
   *
   * @param url
   *      傳送請求的URL
   * @param param
   *      請求引數，請求引數應該是 name1=value1&name2=value2 的形式。
   * @return URL 所代表遠端資源的響應結果
   */
  public static String sendGet(String url,String param) {
    String result = "";
    BufferedReader in = null;
    try {
      String urlNameString = url;
      URL realUrl = new URL(urlNameString);
      // 開啟和URL之間的連線
      URLConnection connection = realUrl.openConnection();
      // 設定通用的請求屬性
      connection.setRequestProperty("accept","*/*");
      connection.setRequestProperty("connection","Keep-Alive");
      connection.setRequestProperty("user-agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
      // 建立實際的連線
      connection.connect();
      // 獲取所有響應頭欄位
      Map<String,List<String>> map = connection.getHeaderFields();

      // 定義 BufferedReader輸入流來讀取URL的響應
      in = new BufferedReader(new InputStreamReader(
          connection.getInputStream())); //這裡如果出現亂碼，請使用帶編碼的InputStreamReader構造方法，將需要的編碼設定進去
      String line;
      while ((line = in.readLine()) != null) {
        result += line;
      }
    } catch (Exception e) {
      System.out.println("傳送GET請求出現異常！" + e);
      e.printStackTrace();
    }
    // 使用finally塊來關閉輸入流
    finally {
      try {
        if (in != null) {
          in.close();
        }
      } catch (Exception e2) {
        e2.printStackTrace();
      }
    }
    return result;
  }

解析儲存資料

public Bid getData(String html) throws Exception {
    //獲取的資料，存放在到Bid的物件中，自己可以重新建立一個物件儲存
    Bid bid = new Bid();
    //採用Jsoup解析
    Document doc = Jsoup.parse(html);
    // System.out.println("doc內容" + doc.text());
    //獲取html標籤中的內容tr
    Elements elements = doc.select("tr");
    System.out.println(elements.size() + "****條");
    //迴圈遍歷資料
    for (Element element : elements) {
      if (element.select("td").first() == null){
        continue;
      }
      Elements tdes = element.select("td");
      for(int i = 0; i < tdes.size(); i++){
        this.relation(tdes,tdes.get(i).text(),bid,i+1);
      }
    }
    return bid;
  }

得到的資料

Bid {
  h2 = '詳見內容',itemName = '訴訟服務中心裝置採購',item = '貨物/辦公消耗用品及類似物品/其他辦公消耗用品及類似物品',itemUnit = '詳見內容',areaName = '港北區',noticeTime = '2018年10月22日 18:41',itemNoticeTime = 'null',itemTime = 'null',kaibiaoTime = '2018年10月26日 09:00',winTime = 'null',kaibiaoDiDian = 'null',yusuanMoney = '￥67.00元（人民幣）',allMoney = 'null',money = 'null',text = ''
}

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支援我們。