1. 程式人生 > >pyhon3爬取百度搜索結果

pyhon3爬取百度搜索結果

前不久為了在群裡鬥圖,想多蒐集點表情包學習了一下python爬蟲,蒐集了一萬多張吧。下載太多,完全不知道有什麼圖,還是鬥不過!!!!!

今天又想爬取百度的搜尋結果,本人還是小白,怕忘記記錄一下,望大神賜教指正

同樣是以爬取圖片為例,還很簡陋,沒什麼實用價值
手機百度搜索和PC的搜尋爬取有些不一樣,主要是html不一樣

1、首先獲取百度搜索頁面的html程式碼,一定要記得設定User-Agent

 # 獲取指定地址的html的程式碼
    def getHtml(url):
        try:
            headers = {
                'User-Agent'
: 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8' } req = urllib.request.Request(url, None
, headers, None, False) response = urllib.request.urlopen(req) html = response.read() return html except AttributeError as e: return None

2、得到html以後當然是遍歷每條搜尋結果,得到對應的站點地址集合

# 獲取PC百度搜索的每條地址
    def getPCItemUrl(html):
        urls = []
        try
: bsObj = BeautifulSoup(html) bq = bsObj.find('div', {'id': 'content_left'}).findAll('h3', {'class': 't'}) for uu in bq: bsO = BeautifulSoup(uu.encode('utf-8')) urll = bsO.findAll('a') urls.append(urll[0]['href']) return urls except AttributeError as e: return []

3、得到地址陣列以後只要在得到指定地址的html然後去獲取裡面的指定資訊即可

# 開始遍歷網站地址,得到圖片
def getImage(urls):
    if urls == None:
        return

    get_html = GetHtml.GetHtml

    n = 0
    for url in urls:
        getImg = GetImg.GetImg
        # 獲取頁面的html
        one_html = get_html.getHtml(url)
        # 得到對應地址裡的圖片地址集合
        images = getImg.getImgs(one_html)
        i = 0
        for img in images:
            src = img['src']
            print(src)
            endname = src[-4:]
            if endname[-3:] in img_ends:
                endname = endname
            else:
                endname = endname + '.jpg'
            endname = endname.replace('?', '')
            # str[-3:]  # 擷取倒數第三位到結尾
            getImg.SaveImg(str(n) + str(i) + 'img' + endname, src)
            i += 1

        n += 1

這裡只是我的一點思路,只能得到少量圖片

4、`# 儲存圖片
def SaveImg(filename, url):
print(filename)
try:
response = urllib.request.urlopen(url)
cat_img = response.read()
with open(filename, ‘wb’) as f:
f.write(cat_img)
except urllib.error.HTTPError as reason:
print(reason)

# 獲取圖片地址(jpg|gif|png|bmp)
def getImgs(html):
    try:
        bsObj = BeautifulSoup(html)
        bq = bsObj.findAll('img', {'src': re.compile('http[/:A-Za-z0-9\.]+\.(jpg|gif|png|bmp)')})
        return bq

    except AttributeError as e:
        return None`

個人感覺爬取手機百度會更容易些