1. 程式人生 > >groovy爬蟲練習之——企業資訊

groovy爬蟲練習之——企業資訊

話不多說,資訊源暫時隱藏了,獲取資料的方法依然才去了regex正則匹配的方法,請求框架採用了java,爬蟲語言是groovy,本地拼接好sql語句,傳送到mysql服務端,完成儲存。

程式碼如下:

package com.fan

import com.fantest.httpclient.FanLibrary
import com.fantest.mysql.MySqlTest
import com.fantest.utils.Regex
import net.sf.json.JSONObject

class Company extends FanLibrary {
    static void main(String[] args) {
        for (def i in 1..1060) {
            getPage(i)
//                getInfo("/eportal/ui?pageId=307900&t=toDetail&ZSBH=D311056737")
        }
        testOver()
    }

    static getPage(int page) {
        def url = "http://www.***.gov.cn/eportal/ui?pageId=307900"
        def params = new JSONObject()
        params.put("filter_LIKE_QYMC", EMPTY)
        params.put("filter_LIKE_YYZZZCH", EMPTY)
        params.put("filter_LIKE_ZSBH", EMPTY)
        params.put("filter_LIKE_XXDZ", EMPTY)
        params.put("currentPage", page)
        params.put("pageSize", 15)
        params.put("OrderByField", EMPTY)
        params.put("OrderByDesc", EMPTY)
        def response = getHttpResponse(getHttpPost(url, params))
        def s = response.getString("content")
        def all = Regex.regexAll(s, "<td s.*?瀏覽")
        for (int i = 1; i < all.size(); i++) {
            def get = all.get(i)
            def regex = Regex.getRegex(get, "href=\".*?\"").replace("amp;", EMPTY)
            getInfo(regex)
            sleep(3)
        }
        return response;
    }

    static getInfo(String url) {
        try {
            url = "http://www.***.gov.cn" + url;
            def response = getHttpResponse(getHttpGet(url))
            def content = response.getString("content")
            def all = Regex.regexAll(content, "<td class=\"label\".*?\n.*\n.*\n.*\n.*\n.*")
            def name = all.get(0).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def adress = all.get(1).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def money = all.get(2).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def sid = all.get(3).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def type = all.get(4).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def man = all.get(5).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def paper = all.get(6).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def level = all.get(7).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def gov = all.get(8).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def time = all.get(9).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split(":")[1]
            def start = time.split("~")[0]
            def end = time.split("~")[1]
            String sql = "INSERT INTO company (name,adress,money,sid,type,man,paper,level,gov,start,end) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\");"
            sql = String.format(sql, name, adress, money, sid, type, man, paper, level, gov, start, end)
            output(sql)
            MySqlTest.sendWork(sql)
        }
        catch (Exception e) {
            output(e)
        }
    }
}

第一頁的網頁結構如下:

第二頁詳情頁結構如下:

regex是我自己簡單封裝的正則匹配的類,程式碼可以去我碼雲上面看看。

框架已經在碼雲開源 邀請連結

歡迎有興趣的一起交流:群號:340964272