1. 程式人生 > >爬資料,專利的名字及摘要

爬資料,專利的名字及摘要

# -*- coding:UTF-8 -*-
#########################################################################
# File Name: getsoopt.py
# Author: Ev
# mail: [email protected]
# Created Time: Mon 24 Dec 2018 10:35:12 AM CST
#########################################################################
#!/usr/bin/python
import
sys import requests import re from bs4 import BeautifulSoup def get_html(url): headers = { 'User-Agent':'Mozilla/5.0(Macintosh; Intel Mac OS X 10_11_4)\ AppleWebKit/537.36(KHTML, like Gecko) Chrome/52 .0.2743. 116 Safari/537.36' } #模擬瀏覽器訪問 response = requests.get(url,headers = headers) #
請求訪問網站 #with open('./1.html','w+') as f: # f.write(response.text.encode('utf-8')) html = response.text #獲取網頁原始碼 return html #返回網頁原始碼 index = 27 soup = BeautifulSoup(get_html("http://www.soopat.com/....PatentIndex=" + str(index*10)),"lxml") #soup = BeautifulSoup(open("./1.html"),"lxml")
reload(sys) sys.setdefaultencoding('utf-8') if "請輸入驗證碼" in soup.title.string: print soup.title.string sys.exit() print "get result ok!\n" #p = soup.body.attr title = [] p = soup.find_all(class_="PatentTypeBlock") for m in p: titleTemp = m.find("a").get_text() #print type(titleTemp) title.append(titleTemp) content = [] p = soup.find_all(class_="PatentContentBlock") for m in p: titleTemp = m.get_text() #print type(titleTemp) content.append(titleTemp) # break; with open("get.txt","a+") as f: for i in range(len(content)): f.write(str(index*10+i) + ":") f.write(title[i]) f.write("\n") f.write(content[i]) f.write("\n\n")

網頁是專利關鍵字搜尋的結果

我是在ubuntu上使用python+BeautifulSoup+requests,環境的搭建直接百度

index是頁數,0代表第一頁,以此類推

這個指令碼的目的是抓取專利的名字及簡單摘要,以方便參考和規避^_^

指令碼缺點就是,只能一頁一頁的執行,執行幾次之後得輸驗證碼,目前我不知道怎麼辦