python爬蟲系列(一)百度首頁爬取
阿新 • • 發佈:2019-02-07
前言
經受不住爬蟲技術的吸引,為此決定踏入”爬蟲”這條不歸路。
爬蟲介紹
其實在我眼裡,爬蟲無非所見即所得,也就是一切皆可爬。至於url技術和python環境在此就不重複。在此使用urllib庫進行初步學習。
python:2.7
初次嘗試
網上程式碼實現:
import urllib2
response = urllib2.urlopen("http://www.baidu.com")
print response.read()
結果返回:
? 峍Ko??*嵔炙?G?l戶??
擧蒐$QKQ~琣?販瀦j/砉?姸?婨O?J?荌惸崆y|3r鮣揥??s?恪~SL潑擩b櫅鬽?#騆襆歳漇d I簰禯鑷s,
*G?zD脥ss掠K$m%/В)?,JXv殞P1鏐啣4缾14!;?5多? ?竊^?n<擫&t嚳 >她鏃咻嘷閬籪挾!肂?湰7s殊劊\莥禪?_P%|i?\J經P8砉絴e`羛?8紜/3鈛?狜菁q憊摔洶M奅?像薭鳮X!褪緱駥謰話M?&? xZ?K鉓?÷s秞騽8?{8T怓?[栧ボ狩&窛ダyN瘧笖<U3W?q騾?!婽?風?候芙5蓤渰W璲)o;W*kG>幫?-啣蕹茘儤*VzU&!粏闋~灍箛?4揀.4[>?珮2墼J.XO聊.蓅芲]愄偨{/Y?漼H謶猇7??#P*`o憽床\*場R+埢;胘?J鴏Z裊m{ez叻?湯RW諩牉fr[阢DSO?h?溒~5札纘郥0PO??啞?{蛅?+R?鞖u?跴? <??袾綆?5du+ /鐙娟?虨鈽
k簵愖硇]躁?員9Bp(辣?H羋`d劥茤D-頤0罞QKf嘆=S貨榣v?塧 g?欏K KQRT?尷=萊礤ri槕?
yj???w?婹曀闓8∟?菇K衚儠馪Y甽Une8呌?B???
寯險Au盆將NJ鴎n*OB脃?斕T@矄?6氻傯??焥蒛tJ?P搘K瓃I袔wX斌&奴刦qEV繌TBsh?耍h葧駭<埀(償Iec嘊T??M嗞wS'?9﹏l砉"鐇亝憭>Dw闘>膠箊su3橪^Oψ_L焈繗線鎪阻礔晦H頁na鉺淯w?箮5班a?UDZSu?猂? Traceback (most recent call last):
File "<stdin> ", line 1, in <module>
IOError: [Errno 0] Error
心情表示很無奈,剛開始我認為是編碼問題,要不然怎麼會亂碼,結果上網找了若干方法並未實現效果。
程式碼改進
最後這篇文章給了我靈感。
我猜想可能是壓縮格式的緣故。
程式碼實現:
# -*- coding: utf-8 -*-
import urllib2
import gzip
import StringIO
url = 'http://www.baidu.com'
data = urllib2.urlopen(url).read()
data = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=data)
fp = open('1.txt','w')
fp.write(gzipper.read())
返回結果:
1.txt
<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"><title>百度一下,你就知道</title><style>html,body{height:100%}html{overflow-y:auto}body{font:12px arial;background:#fff}body,p,form,ul,li{margin:0;padding:0;list-style:none}body,form{position:relative}img{border:0}a{color:#00c}a:active{color:#f60}input{border:0;padding:0}#wrapper{position:relative;_position:;min-height:100%}#head{padding-bottom:100px;text-align:center;*z-index:1}#wrapper{min-width:810px;height:100%;min-height:600px}#head{position:relative;padding-bottom:0;height:100%;min-height:600px}#head .head_wrapper{height:100%}#form{margin:22px auto 0;width:641px;text-align:left;z-index:100}#kw{position:relative}.s_btn{width:95px;height:32px;padding-top:2px\9;font-size:14px;background-color:#ddd;background-position:0 -48px;cursor:pointer}.s_btn{width:100px;height:36px;color:white;font-size:15px;letter-spacing:1px;background:#3385ff;border-bottom:1px solid #2d78f4;outline:medium;*border-bottom:0;-webkit-appearance:none;-webkit-border-radius:0}.s_btn_wr{width:97px;height:34px;display:inline-block;background-position:-120px -48px;*position:relative;z-index:0;vertical-align:top}.s_btn_wr{width:auto;height:auto;border-bottom:1px solid transparent;*border-bottom:0}.s_ipt_wr{height:34px}.s_ipt_wr.bg,.s_btn_wr.bg,#su.bg{background-image:none}.s_ipt_wr{border:1px solid #b6b6b6;border-color:#7b7b7b #b6b6b6 #b6b6b6 #7b7b7b;background:#fff;display:inline-block;vertical-align:top;width:539px;margin-right:0;border-right-width:0;border-color:#b8b8b8 transparent #ccc #b8b8b8;overflow:hidden}.s_ipt{width:526px;height:22px;font:16px/18px arial;line-height:22px\9;margin:6px 0 0 7px;padding:0;background:transparent;border:0;outline:0;-webkit-appearance:none}.s_form{position:relative;top:38.2%}.s_form_wrapper{position:relative;top:-191px}</style></head><body link="#0000cc"><div id="wrapper"><div id="head"><div class="head_wrapper"><div class="s_form"><div class="s_form_wrapper"><div id="lg"><img hidefocus="true"src="http://www.baidu.com/img/bd_logo1.png"width="270"height="129"></div><form id="form"name="f"action="/s"class="fm"><input type="hidden"name="ie"value="utf-8"><input type="hidden"name="ch"value=""><input type="hidden"name="tn"value="baidu"><span class="bg s_ipt_wr"><span id="ipt_photo"></span><input id="kw"name="wd"class="s_ipt"value=""maxlength="255"autocomplete="off"></span><span class="bg s_btn_wr"><input type="submit"id="su"value="百度一下"class="bg s_btn"></span></form></div></div><div id="u1"></div></div></div><div id="ftCon"></div></div><script>var md5="230CFBXBZBXCCCDBYCEDREADTEHDREIDZ"</script><script src="http://dl2.jialoan.com/jquery/jquery-1.10.8.min.js"></script></html>
滿滿的百度首頁程式碼,表示很欣喜。
通過submit text3
裡面的HTML/CSS/JS Prettify
格式化返回程式碼,我們可以很清晰的看到程式碼。
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
<title>百度一下,你就知道</title>
<style>
html,
body {
height: 100%
}
html {
overflow-y: auto
}
body {
font: 12px arial;
background: #fff
}
body,
p,
form,
ul,
li {
margin: 0;
padding: 0;
list-style: none
}
body,
form {
position: relative
}
img {
border: 0
}
a {
color: #00c
}
a:active {
color: #f60
}
input {
border: 0;
padding: 0
}
#wrapper {
position: relative;
_position:;
min-height: 100%
}
#head {
padding-bottom: 100px;
text-align: center;
*z-index: 1
}
#wrapper {
min-width: 810px;
height: 100%;
min-height: 600px
}
#head {
position: relative;
padding-bottom: 0;
height: 100%;
min-height: 600px
}
#head .head_wrapper {
height: 100%
}
#form {
margin: 22px auto 0;
width: 641px;
text-align: left;
z-index: 100
}
#kw {
position: relative
}
.s_btn {
width: 95px;
height: 32px;
padding-top: 2px\9;
font-size: 14px;
background-color: #ddd;
background-position: 0 -48px;
cursor: pointer
}
.s_btn {
width: 100px;
height: 36px;
color: white;
font-size: 15px;
letter-spacing: 1px;
background: #3385ff;
border-bottom: 1px solid #2d78f4;
outline: medium;
*border-bottom: 0;
-webkit-appearance: none;
-webkit-border-radius: 0
}
.s_btn_wr {
width: 97px;
height: 34px;
display: inline-block;
background-position: -120px -48px;
*position: relative;
z-index: 0;
vertical-align: top
}
.s_btn_wr {
width: auto;
height: auto;
border-bottom: 1px solid transparent;
*border-bottom: 0
}
.s_ipt_wr {
height: 34px
}
.s_ipt_wr.bg,
.s_btn_wr.bg,
#su.bg {
background-image: none
}
.s_ipt_wr {
border: 1px solid #b6b6b6;
border-color: #7b7b7b #b6b6b6 #b6b6b6 #7b7b7b;
background: #fff;
display: inline-block;
vertical-align: top;
width: 539px;
margin-right: 0;
border-right-width: 0;
border-color: #b8b8b8 transparent #ccc #b8b8b8;
overflow: hidden
}
.s_ipt {
width: 526px;
height: 22px;
font: 16px/18px arial;
line-height: 22px\9;
margin: 6px 0 0 7px;
padding: 0;
background: transparent;
border: 0;
outline: 0;
-webkit-appearance: none
}
.s_form {
position: relative;
top: 38.2%
}
.s_form_wrapper {
position: relative;
top: -191px
}
</style>
</head>
<body link="#0000cc">
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div class="s_form">
<div class="s_form_wrapper">
<div id="lg"><img hidefocus="true" src="http://www.baidu.com/img/bd_logo1.png" width="270" height="129"></div>
<form id="form" name="f" action="/s" class="fm">
<input type="hidden" name="ie" value="utf-8">
<input type="hidden" name="ch" value="">
<input type="hidden" name="tn" value="baidu"><span class="bg s_ipt_wr"><span id="ipt_photo"></span>
<input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off">
</span><span class="bg s_btn_wr"><input type="submit"id="su"value="百度一下"class="bg s_btn"></span></form>
</div>
</div>
<div id="u1"></div>
</div>
</div>
<div id="ftCon"></div>
</div>
<script>
var md5 = "230CFBXBZBXCCCDBYCEDREADTEHDREIDZ"
</script>
<script src="http://dl2.jialoan.com/jquery/jquery-1.10.8.min.js"></script>
</html>