Jsoup-簡單爬取知乎推薦頁面(附:get_agent())
阿新 • • 發佈:2019-01-23
ron times 字典類 safari macintosh time != date toolbar
總覽
今天我們就來小用一下Jsoup,從一個整體的角度來看一看爬蟲
一個基本的爬蟲框架包括:
- [x] 解析網頁
- [x] 失敗重試
- [x] 抓取內容保存至本地
[x] 多線程抓取
***分模塊講解
將上述基本框架的模塊按邏輯順序講解,一步一步復現代碼實現過程
失敗重試
一個好的模塊必然有異常捕捉和處理
在之前的內容中,我們提到過一個簡單的異常處理,小夥伴還記得麽
簡易版
// 爬取的網址 val url = "https://www.zhihu.com/explore/recommendations" // 加上TryCatch框架 Try(Jsoup.connect(url).get())match { case Failure(e) => // 打印異常信息 println(e.getMessage) case Success(doc:Document) => // 解析正常則返回Document,然後提取Document內所需信息 println(doc.body()) }
今天我們來在之上稍微豐富一下,把他包裝的更健壯一點
豐富版
var count = 0 //解析網頁時統計抓取數用 //用於記錄總數,和失敗次數 val sum, fail: AtomicInteger = new AtomicInteger(0) //當出現異常時1s後重試,異常重復100次 def requestGetUrl(times:Int=100,delay:Long=1000) : Unit ={ Try(Jsoup.connect(Url).userAgent(get_agent()).get())match { case Failure(e) =>{ if(times!=0){ println(e.getMessage) //打印錯誤信息 Thread.sleep(delay) //等待1s fail.addAndGet(1) //失敗次數+1 requestGetUrl(times-1,delay) //times-1後,重調方法 }else throw e } case Success(doc) => parseDoc(doc) if (count==0){ // 解析網頁時用來統計是否抓取為空 Thread.sleep(delay) requestGetUrl(times-1,delay) } sum.addAndGet(1) //成功次數+1 } }
- get_agent()說明
//自己設置一下user-agent,或者更好的是,可以從一系列的user-agent裏隨機挑出一個符合標準的使用 def get_agent()={ //模擬header的user-agent字段,返回一個隨機的user-agent字典類型的鍵值對 val agents=Array("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11") val ran = new Random().nextInt(agents.length) agents(ran) }
解析網頁
沿用上一篇我們寫過的方法就可以
//解析Document
var count = 0
//用一個hashmap來保存住區的內容
val text = new ConcurrentHashMap[String,String]()
def parseDoc(doc:Document): Unit ={
// 解析正常則返回Document,然後提取Document內所需信息
val links = doc.select("div.zm-item") //選取class為"zm-item"的div
for (link<-links.asScala) { //遍歷每一個這樣的div
val title = link.select("h2").text() //選取div中的所有"h2"標簽,並讀取它的文本內容
val approve = link.select("div.zm-item-vote").text() //找到贊同的位置,選中它並讀取它的文本內容
//逐層找到唯一識別的標簽,然後選中(唯一識別很關鍵)
val author = link.select("div.answer-head").select("span.author-link-line").select("a").text()
val content = link.select("div.zh-summary.summary.clearfix").text() //多個class類型,直接加.就行,如.A.B.C
text.put(title,author+"\t"+approve+"\t"+content)
count+=1
}
count
}
- 抓取內容保存至本地
// 獲取當前日期
def getNowDate(): String ={
new SimpleDateFormat("yyMMdd").format(new Date())
}
// 爬取內容寫入文件
def output(zone:String): Unit ={
val writer = new PrintWriter(new File(getNowDate()+"_"+zone++".txt"))
for((title,value)<-text){
writer.println(title+value)
}
writer.flush()
writer.close()
}
抓取內容展示
- 多線程抓取
//多線程抓取
def concurrentCrawler(zone: String,maxPage:Int,threadNum:Int)={
var loopar = (1 to maxPage).par
loopar.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threadNum))
loopar.foreach(x=>requestGetUrl())
output(zone)
}
- get_agent()補充說明及福利
def get_agent()={
//模擬header的user-agent字段,返回一個隨機的user-agent字典類型的鍵值對
val agents=Array("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Macintosh; U; Mac OS X Mach-O; en-US; rv:2.0a) Gecko/20040614 Firefox/3.0.0 ",
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.0.3) Gecko/2008092414 Firefox/3.0.3",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.14) Gecko/20110218 AlexaToolbar/alxf-2.0 Firefox/3.6.14",
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0(Macintosh;U;IntelMacOSX10_6_8;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50")
val ran = new Random().nextInt(agents.length)
agents(ran)
}
結尾嘮叨兩句
如果你對我的文章感興趣,歡迎你點開我下一篇文章,後面我將手把手帶你一起完成一個個小case,對了如果你也有好的想法,歡迎溝通交流
今天主要是帶大家一起完成了知乎網站的爬取,練一練手,熟能生巧!
Jsoup-簡單爬取知乎推薦頁面(附:get_agent())