1. 程式人生 > >Scrapy(二)獲取script標籤裡面的資料內容

Scrapy(二)獲取script標籤裡面的資料內容

1.資料例子演示

1.1主要獲取內容

主要獲取內容

2.開始擼程式碼(python3.6)

只是部分主要程式碼

import requests
from bs4 import BeautifulSoup
import js2xml
from lxml import etree

class HdbSpider(scrapy.Spider):
    name = 'hdb'
    allowed_domains = ['http://www.hdb.com/']
    start_urls = ['http://www.hdb.com/']
    #全國
    globalUrl = ['http://www.hdb.com/quanguo/']

def url(self):
    url = http://www.hdb.com/party/a0lz2.html 
    yield scrapy.Request(url,self.parse,dont_filter=True)
def parse(self,response):
    #主要內容
    resp = response.text
    soup = BeautifulSoup(resp, 'lxml')
    src = soup.select('head script')[6].string
    src_text = js2xml.parse(src,  debug=False)
    src_tree = js2xml.pretty_print(src_text)
    print('treeeeeeeeeeeeeeeeeeeeeeeeeeeee')
    print(src_tree)
    #生成結果展示圖一
    selector = etree.HTML(src_tree)
    # print(selector)
    #自己去匹配自己想要的資料
    content = selector.xpath("//property[@name = '_id']/string/text()")[0]
    print(content)

圖一

生成後的結果

詳細程式碼地址

[email protected]:yzw1/python-Reptilian-content.git

參考文章

1. https://blog.csdn.net/fan3652/article/details/72780301(去除裡面的內容)
2. https://blog.csdn.net/qq_34246164/article/details/80700399
3. https://blog.csdn.net/freeking101/article/details/64461574