1. 程式人生 > >Scrapy爬蟲urlparse之urljoin() 必備

Scrapy爬蟲urlparse之urljoin() 必備

首先匯入模組,用help檢視相關文件

>>> from urllib import parse
>>> help(parse.urljoin)
Help on function urljoin in module urlparse:

urljoin(base, url, allow_fragments=True)
    Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter.

意思就是將基地址與一個相對地址形成一個絕對地址,然而講的太過抽象

>>> urljoin("http://www.google.com/1/aaa.html","bbbb.html")
'http://www.google.com/1/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","2/bbbb.html")
'http://www.google.com/1/2/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html")
'http://www.google.com/2/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html")
'http://www.google.com/3/ccc.html'
>>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html")
'http://www.google.com/ccc.html'
>>> urljoin("http://www.google.com/1/aaa.html","javascript:void(0)")
'javascript:void(0)'

規律不難發現,但是並不是萬事大吉了,還需要處理特殊情況,如連結是其本身,連結中包含無效字元等

url = urljoin("****","****")<br><br>### find()查詢字串函式,如果查到:返回查詢到的第一個出現的位置。否則,返回-1<br>if url.find("'")!=-1:<br>    continue  <br><br>### 只取井號前部分<br>url = url.split('#')[0]<br><br>### 這個isindexed()是我自己定義的函式,判斷該連結不在儲存連結的資料庫中<br>if url[0:4]=='http' and not self.isindexed(url):<br><br>    ###newpages = set(),無序不重複元素集<br>    newpages.add(url)