Scrapy(二)獲取script標籤裡面的資料內容
阿新 • • 發佈:2019-02-19
1.資料例子演示
1.1主要獲取內容
2.開始擼程式碼(python3.6)
只是部分主要程式碼
import requests from bs4 import BeautifulSoup import js2xml from lxml import etree class HdbSpider(scrapy.Spider): name = 'hdb' allowed_domains = ['http://www.hdb.com/'] start_urls = ['http://www.hdb.com/'] #全國 globalUrl = ['http://www.hdb.com/quanguo/'] def url(self): url = http://www.hdb.com/party/a0lz2.html yield scrapy.Request(url,self.parse,dont_filter=True) def parse(self,response): #主要內容 resp = response.text soup = BeautifulSoup(resp, 'lxml') src = soup.select('head script')[6].string src_text = js2xml.parse(src, debug=False) src_tree = js2xml.pretty_print(src_text) print('treeeeeeeeeeeeeeeeeeeeeeeeeeeee') print(src_tree) #生成結果展示圖一 selector = etree.HTML(src_tree) # print(selector) #自己去匹配自己想要的資料 content = selector.xpath("//property[@name = '_id']/string/text()")[0] print(content)
圖一
詳細程式碼地址
[email protected]:yzw1/python-Reptilian-content.git
參考文章
1. https://blog.csdn.net/fan3652/article/details/72780301(去除裡面的內容)
2. https://blog.csdn.net/qq_34246164/article/details/80700399
3. https://blog.csdn.net/freeking101/article/details/64461574