爬蟲學習-urlparse之urljoin()
阿新 • • 發佈:2018-09-20
elf 並不是 字符串函數 abs 文檔 接下來 asc fragments 不難
首先導入模塊,用help查看相關文檔
>>> from urlparse import urljoin >>> help(urljoin) Help on function urljoin in module urlparse: urljoin(base, url, allow_fragments=True) Join a base URL and a possibly relative URL to form an absolute interpretation of the latter.
1 |
意思就是將基地址與一個相對地址形成一個絕對地址,然而講的太過抽象 |
接下來,看幾個例子,從例子中發現規律。
>>> urljoin("http://www.google.com/1/aaa.html","bbbb.html") ‘http://www.google.com/1/bbbb.html‘ >>> urljoin("http://www.google.com/1/aaa.html","2/bbbb.html") ‘http://www.google.com/1/2/bbbb.html‘ >>> urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html") ‘http://www.google.com/2/bbbb.html‘ >>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html") ‘http://www.google.com/3/ccc.html‘ >>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html") ‘http://www.google.com/ccc.html‘ >>> urljoin("http://www.google.com/1/aaa.html","javascript:void(0)") ‘javascript:void(0)‘
規律不難發現,但是並不是萬事大吉了,還需要處理特殊情況,如鏈接是其本身,鏈接中包含無效字符等
1 |
url = urljoin( "****" , "****" )<br><br> ### find()查找字符串函數,如果查到:返回查找到的第一個出現的位置。否則,返回-1<br>if url.find("‘")!=-1:<br> continue <br><br>### 只取井號前部分<br>url = url.split(‘#‘)[0]<br><br>### 這個isindexed()是我自己定義的函數,判斷該鏈接不在保存鏈接的數據庫中<br>if url[0:4]==‘http‘ and not self.isindexed(url):<br><br> ###newpages = set(),無序不重復元素集<br> newpages.add(url) |
?
爬蟲學習-urlparse之urljoin()