關於java爬蟲與python爬蟲
阿新 • • 發佈:2019-02-08
前言
很多人說學習資料探勘,先從爬蟲入手。接觸了大大小小的專案後,發現數據的獲取是資料建模前的一項非常重要的活兒。在此,我需要先總結一些爬蟲的流程,分別有python版的以及java版的。
url請求
java版的程式碼如下:
public String call (String url){
String content = "";
BufferedReader in = null;
try{
URL realUrl = new URL(url);
URLConnection connection = realUrl.openConnection();
connection.connect();
in = new BufferedReader(new InputStreamReader(connection.getInputStream(),"gbk"));
String line ;
while ((line = in.readLine()) != null){
content += line + "\n";
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try{
if (in != null){
in.close();
}
}catch(Exception e2){
e2.printStackTrace();
}
}
return content;
}
python版的程式碼如下:
# coding=utf-8
import chardet
import urllib2
url = "http://www.baidu.com"
data = (urllib2.urlopen(url)).read()
charset = chardet.detect(data)
code = charset['encoding']
content = str(data).decode(code, 'ignore').encode('utf8')
print content
正則表示式
java版的程式碼如下:
public String call(String content) throws Exception {
Pattern p = Pattern.compile("content\":\".*?\"");
Matcher match = p.matcher(content);
StringBuilder sb = new StringBuilder();
String tmp;
while (match.find()){
tmp = match.group();
tmp = tmp.replaceAll("\"", "");
tmp = tmp.replace("content:", "");
tmp = tmp.replaceAll("<.*>", "");
sb.append(tmp + "\n");
}
String comment = sb.toString();
return comment;
}
}
python的程式碼如下:
import re
pattern = re.compile(正則)
group = pattern.findall(字串)