python使用關鍵字爬取url

阿新 • • 發佈：2018-10-21

sea urn ade port 加載失敗 aid closed ive __main__

python網路爬蟲 --------- 使用百度輸入的關鍵字搜索內容然後爬取搜索內容的url

開發環境：windows7+python3.6.3

開發語言：Python

開發工具：pycharm

第三方軟件包：需安裝lxml4.0，如果只安裝lxml會出錯，要需要lxml中的etree

廢話不多說，貼上代碼：

爬取數據保存以TXT格式保存，等會嘗試使用Excel表格跟數據庫保存。

 1 import requests,time
 2 from lxml import etree
 3 
 4 
 5 def Redirect(url):
 6     try :
 7         res = requests.get(url,timeout=10)
 
 8         url = res.url
 9     except Exception as e:
10         print(‘4‘,e)
11         time.sleep(1)
12     return url
13 
14 def baidu_search(wd,pn_max,sav_file_name):
15     url = ‘http://www.baidu.com/s‘
16     return_set = set()
17 
18     for page in range(pn_max):
19         pn = page*10
20         querystring = {‘ 
wd‘:wd,‘pn‘:pn}
21         headers = {
22             ‘pragma‘:‘no-cache‘,
23             ‘accept-encoding‘: ‘gzip,deflate,br‘,
24             ‘accept-language‘ : ‘zh-CN,zh;q=0.8‘,
25             ‘upgrade-insecure-requests‘ : ‘1‘,
26             ‘user-agent‘: "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0 
",
27             ‘accept‘: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
28             ‘cache-control‘: "no-cache",
29             ‘connection‘: "keep-alive",
30         }
31 
32         try :
33             response = requests.request(‘GET‘,url,headers=headers,params=querystring)
34             print(‘!!!!!!!!!!!!!!‘,response.url)
35             selector = etree.HTML(response.text,parser = etree.HTMLParser(encoding=‘utf-8‘))
36         except Exception as e:
37             print(‘頁面加載失敗‘,e)
38             continue
39         with open(sav_file_name,‘a+‘) as f:
40             for i in range(1,10):
41                 try :
42                     context = selector.xpath(‘//*[@id="‘+str(pn+i)+‘"]/h3/a[1]/@href‘)
43                     print(len(context),context[0])
44                     i = Redirect(context[0])
45                     print(‘context=‘+context[0])
46                     print (‘i=‘+i)
47                     f.write(i)
48                     f.write(‘\n‘)
49                     break
50                     return_set.add(i)
51                     f.write(‘\n‘)
52                 except Exception as e:
53                     print(i,return_set)
54                     print(‘3‘,e)
55 
56     return return_set
57 
58 if __name__ == ‘__main__‘:
59     wd = ‘網絡貸款‘
60     pn = 100
61     save_file_name = ‘save_url_soup.txt‘
62     return_set = baidu_search(wd,pn,save_file_name)

View Code

python使用關鍵字爬取url

sea urn ade port 加載失敗 aid closed ive __main__ python網路爬蟲 --------- 使用百度輸入的關鍵字搜索內容然後爬取搜索內容的url 開發環境：windows7+python3.6.3 開發語言：Python 開發工具：

python爬蟲爬取全站url，完美小demo（可防止連結到外網等各種強大篩選）

上次完成的url爬取專案並不能滿足需求，在此完成了一個更為強大的爬取程式碼，有需要的可以直接執行，根據自己爬取的網站更改部分正則和形參即可。前排提示：執行需要耐心，因為幾千個url爬完的話，還是建議花生瓜子可樂電影準備好。話不多說，直接上程式碼，程式碼有註釋，很容易理解。

python爬蟲: 指定關鍵字爬取圖片

Introduction 設定關鍵字，從百度圖片上爬取海量圖片。 Code # coding:utf-8 import os import re import urllib import shutil import requests import i

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

python爬蟲爬取頁面源碼在本頁面展示

一個 nts ring 想要 strip code 空白列表 ngs python爬蟲在爬取網頁內容時，需要將內容連同內容格式一同爬取過來，然後在自己的web頁面中顯示，自己的web頁面為django框架首先定義一個變量html，變量值為一段HTML代碼 >&

python 爬蟲爬取證券之星網站

爬蟲周末無聊，找點樂子。。。#coding:utf-8 import requests from bs4 import BeautifulSoup import random import time #抓取所需內容 user_agent = ["Mozilla/5.0 (Windows NT 10.0

python爬蟲爬取海量病毒文件

tle format nbsp contex logs request spl tde __name__ 因為工作需要，需要做深度學習識別惡意二進制文件，所以爬一些資源。 # -*- coding: utf-8 -*- import requests import re

用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）

enc 用途 css選擇器狀態 csv文件表格 area 加密重要用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）在進行爬取前，首先要了解： 1、什麽是CSS選擇器？每一條css樣式定義由兩部分組成，形式如下： [code] 選擇器{樣式} [/code

python爬蟲——爬取古詩詞

爬蟲古詩詞實現目標 1.古詩詞網站爬取唐詩宋詞 2.落地到本地數據庫頁面分析通過firedebug進行頁面定位：源碼定位：根據lxml etree定位div標簽：# 通過 lxml進行頁面分析 response = etree.HTML(data

零基礎掌握百度地圖興趣點獲取POI爬蟲（python語言爬取）（基礎篇）

region map 基礎輸入 filter put mark page -h 實現目的：爬取昆明市範圍內的全部中學數據，包括名稱、坐標。先進入基礎篇，本篇主要講原理方面，並實現步驟分解，為python代碼編寫打基礎。因為是0基礎開始，所以講得會比較詳細。如實現目的

使用python-aiohttp爬取今日頭條

cas 觀察字典類 length tez gen mod 格式 jos http://blog.csdn.net/u011475134/article/details/70198533 原出處在上一篇文章《使用python-aiohttp爬取網易雲音樂》中，我們給自

python實現爬取30頁百度校園女神圖片！

dpi 分享圖片 ges pat path lis 校園 one sha 1、以下是源代碼import requestsimport osdef getManyPages(keyword,pages): params=[] for i in range(30,3

利用Python爬蟲爬取淘寶商品做數據挖掘分析實戰篇，超詳細教程

實戰趨勢 fat sts AI top 名稱 2萬安裝模塊項目內容本案例選擇>> 商品類目：沙發；數量：共100頁 4400個商品；篩選條件：天貓、銷量從高到低、價格500元以上。項目目的 1. 對商品標題進行文本分析詞雲可視化 2.

Python爬蟲 - 爬取百度html代碼前200行

http src mage bsp bubuko str 百度爬蟲圖片 Python爬蟲 - 爬取百度html代碼前200行 - 改進版, 增加了對字符串的.strip()處理 Python爬蟲 - 爬取百度html代碼前200行

python動態爬取網頁

匹配應用 https select idt beautiful 檢查選擇 path 簡介有時候，我們天真無邪的使用urllib庫或Scrapy下載HTML網頁時會發現，我們要提取的網頁元素並不在我們下載到的HTML之中，盡管它們在瀏覽器裏看起來唾手可得。這說明我們想

簡易python爬蟲爬取boss直聘職位，並寫入excel

python爬蟲寫入excel1，默認城市是杭州，代碼如下#! -*-coding:utf-8 -*-from urllib import request, parsefrom bs4 import BeautifulSoupimport datetimeimport xlwt starttime = dat

Python 爬蟲爬取微信文章

微信爬蟲爬取微信文章爬取公眾號文章搜狗微信平臺為入口地址：http://weixin.sogou.com/ --------------------------------------------------------------搜索關鍵詞“科技”對比網址變化情況查看網址http://wei

python爬蟲爬取QQ說說並且生成詞雲圖，回憶滿滿！

運維開發網絡分析 matplot 容易 jieba 編程語言提示框然而 Python（發音：英[?pa?θ?n]，美[?pa?θɑ:n]），是一種面向對象、直譯式電腦編程語言，也是一種功能強大的通用型語言，已經具有近二十年的發展歷史，成熟且穩定。它包含了一組完善而且

Python爬蟲爬取OA幸運飛艇平臺獲取數據

sta 獲取數據 status fail attrs color wrapper 排行榜 req 安裝BeautifulSoup以及requests 打開window 的cmd窗口輸入命令pip install requests 執行安裝，等待他安裝完成就可以了 Beaut

python scrapy爬取皇冠體育源碼下載網站數據二（scrapy使用詳細介紹）

時間源碼保存文件 i+1 zh-cn china flat url def 1、scrapy工程創建皇冠體育源碼下載論壇：haozbbs.com Q1446595067 在命令行輸入如下命令，創建一個使用scrapy框架的工程 scrapy startproject s

python使用關鍵字爬取url

相關推薦