資料採集與融合技術_實驗3
-
作業①:
1)中國氣象網圖片的爬取
– 要求:要求:指定一個網站,爬取這個網站中的所有的所有圖片,例如中國氣象網(http://www.weather.com.cn)。
– 分別使用單執行緒和多執行緒的方式爬取。(限定爬取圖片數量為學號後3位)
– 輸出資訊:將下載的Url資訊在控制檯輸出,並將下載的圖片儲存在images子資料夾中,並給出截圖。
完成過程(單執行緒):
1.向頁面傳送請求,獲取圖片所在網頁連結:
def get_url(start_url): req = urllib.request.Request(start_url, headers=headers) data = urllib.request.urlopen(req) data = data.read() dammit = UnicodeDammit(data, ["utf-8", "gbk"]) data = dammit.unicode_markup soup = BeautifulSoup(data, "lxml") urls = soup.select("a") i = 0 for a in urls: href = a["href"] imageSpider(href, i + 1) i = i + 1 if count > 110: # 爬取110張 break
2.爬取該網頁下的所有圖片的下載連結並下載到本地:
def imageSpider(start_url, cous): try: urls = [] req = urllib.request.Request(start_url, headers=headers) data = urllib.request.urlopen(req) data = data.read() dammit = UnicodeDammit(data, ["utf-8", "gbk"]) data = dammit.unicode_markup soup = BeautifulSoup(data, "lxml") images = soup.select("img") for image in images: try: if count > 110: break src = image["src"] url = urllib.request.urljoin(start_url, src) if url not in urls: urls.append(url) print(url) download(url, cous) except Exception as err: print(err) except Exception as err: print(err)
3.下載圖片到指定路徑函式:
def download(url, cous): global count try: count = count + 1 # 提取檔案字尾副檔名 if url[len(url) - 4] == ".": ext = url[len(url) - 4:] else: ext = "" req = urllib.request.Request(url, headers=headers) data = urllib.request.urlopen(req, timeout=100) data = data.read() path = r"C:\Users\黃杜恩\PycharmProjects\pythonProject3\images\\" + "第" + str(count) + "張" + ".jpg" # 指定下載路徑 with open(path, 'wb') as f: f.write(data) f.close() print("downloaded " + str(cous) + "頁" + str(count) + ext) except Exception as err: print(err)
4.輸出結果展示:
5.爬取圖片結果:
6.程式碼地址:https://gitee.com/huang-dunn/crawl_project/blob/master/實驗三作業1/project_three_test1_1.py
完成過程(多執行緒):
1.修改單執行緒程式碼部分:
def imageSpider(start_url, cous):
global threads
global count
try:
urls = []
req = urllib.request.Request(start_url, headers=headers)
data = urllib.request.urlopen(req)
data = data.read()
dammit = UnicodeDammit(data, ["utf-8", "gbk"])
data = dammit.unicode_markup
soup = BeautifulSoup(data, "lxml")
images = soup.select("img")
for image in images:
try:
if count >= 110:
break
src = image["src"]
url = urllib.request.urljoin(start_url, src)
if url not in urls:
urls.append(url)
count = count+1
T = threading.Thread(target=download, args=(url, cous, count))
T.setDaemon(False)
T.start()
threads.append(T)
except Exception as err:
print(err)
except Exception as err:
print(err)
主函式新增以下部分
get_url(start_url)
threads = []
for t in threads:
t.join()
2.執行結果展示:
3.程式碼地址:https://gitee.com/huang-dunn/crawl_project/blob/master/實驗三作業1/project_three_test1_2.py
2)心得體會:加深了對多執行緒爬取圖片方法的程式設計理解。
-
作業②
1)爬取股票資訊
– 要求:使用scrapy框架復現作業①
– 輸出資訊:同作業①
完成過程:
1.編寫item類:
class Pro3Test2Item(scrapy.Item):
data = scrapy.Field() # 圖片資料
count = scrapy.Field() # 圖片總數
ext = scrapy.Field() # 檔案字尾
url = scrapy.Field() # 圖片連結
2.編寫spiders類:
class Test2Spider(scrapy.Spider):
name = 'pic_test'
global count
count = 1
# allowed_domains = ['XXX.com']
# start_urls = ['http://www.weather.com.cn/']
def start_requests(self):
yield scrapy.Request(url='http://www.weather.com.cn', callback=self.parse)
def parse(self, response):
href_list = response.xpath("//a/@href") # 爬取初始網頁下圖片的所在網頁連結
for href in href_list:
# print(href.extract())e
H = str(href.extract())
if count > PIC_LIMIT:
return
if len(H) > 0 and H[0] == 'h':
yield scrapy.Request(url=href.extract(), callback=self.parse1)
def parse1(self, response):
a_list = response.xpath("//img/@src") # 爬取圖片下載連結
for a in a_list:
if count > PIC_LIMIT:
return
# print(a.extract())
url = urllib.request.urljoin(response.url, a.extract())
# print(url)
yield scrapy.Request(url=url, callback=self.parse2)
def parse2(self, response):
global count
count += 1
if count > PIC_LIMIT:
return
item = Pro3Test2Item()
item["ext"] = response.url[-4:]
item["data"] = response.body
item["count"] = count
item["url"] = response.url
return item
3.編寫pipeline類:
class Pro3Test2Pipeline:
def process_item(self, item, spider):
path = "D:/py_download/" + "第" + str(item["count"]) + "張" + item["ext"] # 指定下載路徑
with open(path, 'wb') as f:
f.write(item["data"])
f.close()
print("downloaded " + str(item["count"]) + "張" + item["ext"] + " 圖片連結:" + item["url"])
return item
4.輸出結果展示:
5.圖片爬取結果:
6.程式碼連結:https://gitee.com/huang-dunn/crawl_project/tree/master/實驗三作業2
2)心得體會:對scrapy框架的使用更加的熟練,對xpath匹配文字也理解的更深。
-
作業③
1)
– 要求:爬取豆瓣電影資料使用scrapy和xpath,並將內容儲存到資料庫,同時將圖片儲存在
– imgs路徑下。
– 候選網站: https://movie.douban.com/top250
– 輸出資訊:
序號 | 電影名稱 | 導演 | 演員 | 簡介 | 電影評分 | 電影封面 |
---|---|---|---|---|---|---|
1 | 肖申克的救贖 | 弗蘭克·德拉邦特 | 蒂姆·羅賓斯 | 希望讓人自由 | 9.7 | ./imgs/xsk.jpg |
2... |
完成過程:
1.編寫item類:
class Pro3Test3Item(scrapy.Item):
no = scrapy.Field() # 序號
name = scrapy.Field() # 電影名稱
director = scrapy.Field() # 導演
actor = scrapy.Field() # 演員
grade = scrapy.Field() # 電影評分
url = scrapy.Field() # 圖片連結
inf = scrapy.Field() # 簡介
pass
2.編寫spiders類:
class MovieSpider(scrapy.Spider):
name = 'movie'
# allowed_domains = ['XXX.com']
# start_urls = ['http://XXX.com/']
def start_requests(self):
cookie = {}
for i in range(0, 11):
yield scrapy.Request(url='https://movie.douban.com/top250?start=' + str(i * 25) + '&filter=',
callback=self.parse)
def parse(self, response):
li_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li') #使用Xpath進行資訊標識
item = Pro3Test3Item()
for li in li_list:
item["no"] = li.xpath('./div/div[1]/em/text()').extract_first().strip()
item["name"] = li.xpath('./div/div[2]/div[1]/a/span[1]/text()').extract_first()
temp_ = li.xpath('./div/div[2]/div[2]/p[1]/text()[1]').extract_first().split(" ")[9]
temp = temp_.split(":")
item["director"] = temp[1].split("")[0]
if len(temp) > 2:
item["actor"] = temp[2]
else:
item["actor"] = 'None'
item["grade"] = li.xpath('./div/div[2]/div[2]/div/span[2]//text()').extract_first()
item["inf"] = li.xpath('./div/div[2]/div[2]/p[2]/span/text()').extract_first()
if item["inf"] == '':
item["inf"] = 'None'
item["url"] = li.xpath('./div/div[1]/a/img/@src').extract_first()
print(item["no"], item["name"], item["director"], item["grade"], item["inf"])
yield item
3.編寫資料庫類:
class MovieDB:
def __init__(self):
self.con = sqlite3.connect("movies.db")
self.cursor = self.con.cursor()
def openDB(self):
try:
self.cursor.execute(
"create table movies (序號 int(128),電影名稱 varchar(128),導演 varchar(128),"
"演員 varchar(128),簡介 varchar(128),電影評分 varchar(128),電影封面 varchar(128),"
"constraint pk_movies primary key (序號))")
except:
self.cursor.execute("delete from movies")
def closeDB(self):
self.con.commit()
self.con.close()
def insert(self, no, name, director, actor, grade, inf, image):
try:
self.cursor.execute("insert into movies (序號,電影名稱,導演,演員,簡介,電影評分,電影封面) "
"values (?,?,?,?,?,?,?)",
(int(no), name, director, actor, inf, grade, image))
except Exception as err:
print(err)
4.編寫pipeline類:
class Pro3Test3Pipeline:
def __init__(self):
self.db = MovieDB()
def open_spider(self, spider):
self.db.openDB()
def process_item(self, item, spider):
data = requests.get(item['url']).content
path = r"D:/example/pro3_test3/pro3_test3/images/" + "第" + str(item["no"]) + "張" + ".jpg" # 指定下載路徑
with open(path, 'wb') as f:
f.write(data)
f.close()
print("downloaded " + str(item["no"]) + "張" + "jpg" + " 圖片連結:" + item["url"])
self.db.insert(int(item["no"]), item["name"], item["director"], item["actor"], item["grade"], item["inf"],
item["url"])
return item
def close_spider(self, spider):
self.db.closeDB()
5.輸出結果展示:
6.爬取圖片展示:
7.資料庫儲存結果展示:
8.相關程式碼連結:https://gitee.com/huang-dunn/crawl_project/tree/master/實驗3作業3