使用scrapy爬取dota2貼吧資料並進行分析

阿新 • • 發佈：2019-02-20

一直好奇貼吧裡的小夥伴們在過去的時間裡說的最多的詞是什麼，那我們就來抓取分析一下貼吧發文的標題內容，並提取分析一下，看看吧友們在說些什麼。

首先我們使用scrapy對所有貼吧文章的標題進行抓取

scrapy startproject btspider

cd btspider

scrapy genspider -t basic btspiderx tieba.baidu.com

修改btspiderx內容

# -*- coding: utf-8 -*-
import scrapy

from btspider.items import BtspiderItem


class BTSpider(scrapy.Spider):
    name = "btspider"
    allowed_domains = ["baidu.com"]
    start_urls = []
    for x in xrange(91320):
        if x == 0:
            url = "https://tieba.baidu.com/f?kw=dota2&ie=utf-8"
        else:
            url = "https://tieba.baidu.com/f?kw=dota2&ie=utf-8&pn=" + str(x*50)
        start_urls.append(url)

    def parse(self, response):
        for sel in response.xpath('//div[@class="col2_right j_threadlist_li_right "]'):
            item = BtspiderItem()
            item['title'] = sel.xpath('div/div/a/text()').extract()
            item['link'] = sel.xpath('div/div/a/@href').extract()
            item['time'] = sel.xpath(
                'div/div/span[@class="threadlist_reply_date pull_right j_reply_data"]/text()').extract()
            yield item

修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BtspiderItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    time = scrapy.Field()

這裡我們實際上儲存的只是title標題內容

修改pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json

class BtspiderPipeline(object):
    def __init__(self):
        self.file = codecs.open('info', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        # line = json.dumps(dict(item)) + "\n"
        titlex = dict(item)["title"]
        if len(titlex) != 0:
            title = titlex[0]
        #linkx = dict(item)["link"]
        #if len(linkx) != 0:
        #    link = 'http://tieba.baidu.com' + linkx[0]
        #timex = dict(item)["time"]
        #if len(timex) != 0:
        #    time = timex[0].strip()
        line = title + '\n' #+ link + '\n' + time + '\n'
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

修改settings.py

BOT_NAME = 'btspider'
SPIDER_MODULES = ['btspider.spiders']
NEWSPIDER_MODULE = 'btspider.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
   'btspider.pipelines.BtspiderPipeline': 300,
}

啟動爬蟲

scrapy crawl btspider

所有的標題內容會被儲存為info檔案

等到爬蟲結束，我們來分析info檔案的內容

github上有個示例，改改就能用

git clone https://github.com/FantasRu/WordCloud.git

修改main.py檔案如下：

# coding: utf-8
from os import path
import numpy as np
# import matplotlib.pyplot as plt
# matplotlib.use('qt4agg')
from wordcloud import WordCloud, STOPWORDS
import jieba


class WordCloud_CN:
    '''
    use package wordcloud and jieba
    generating wordcloud for chinese character
    '''

    def __init__(self, stopwords_file):
        self.stopwords_file = stopwords_file
        self.text_file = text_file

    @property
    def get_stopwords(self):
        self.stopwords = {}
        f = open(self.stopwords_file, 'r')
        line = f.readline().rstrip()
        while line:
            self.stopwords.setdefault(line, 0)
            self.stopwords[line.decode('utf-8')] = 1
            line = f.readline().rstrip()
        f.close()
        return self.stopwords

    @property
    def seg_text(self):
        with open(self.text_file) as f:
            text = f.readlines()
            text = r' '.join(text)

            seg_generator = jieba.cut(text)
            self.seg_list = [
                i for i in seg_generator if i not in self.get_stopwords]
            self.seg_list = [i for i in self.seg_list if i != u' ']
            self.seg_list = r' '.join(self.seg_list)
        return self.seg_list

    def show(self):
        # wordcloud = WordCloud(max_font_size=40, relative_scaling=.5)
        wordcloud = WordCloud(font_path=u'./static/simheittf/simhei.ttf',
                              background_color="black", margin=5, width=1800, height=800)

        wordcloud = wordcloud.generate(self.seg_text)

        # plt.figure()
        # plt.imshow(wordcloud)
        # plt.axis("off")
        # plt.show()
        wordcloud.to_file("./demo/" + self.text_file.split('/')[-1] + '.jpg')


if __name__ == '__main__':
    stopwords_file = u'./static/stopwords.txt'
    text_file = u'./demo/info'

    generater = WordCloud_CN(stopwords_file)
    generater.show()

然後啟動分析

python main.py

由於資料比較大，分析時間會比較長，可以拿到廉價的單核雲主機上後臺分析，等著那結果就好。

下邊是我分析兩個熱門遊戲貼吧的詞雲圖片

使用scrapy爬取dota2貼吧資料並進行分析

一直好奇貼吧裡的小夥伴們在過去的時間裡說的最多的詞是什麼，那我們就來抓取分析一下貼吧發文的標題內容，並提取分析一下，看看吧友們在說些什麼。首先我們使用scrapy對所有貼吧文章的標題進行抓取 scrapy startproject btspider cd btspider

python爬取拉勾網資料並進行資料視覺化

爬取拉勾網關於python職位相關的資料資訊，並將爬取的資料已csv各式存入檔案，然後對csv檔案相關欄位的資料進行清洗，並對資料視覺化展示，包括柱狀圖展示、直方圖展示、詞雲展示等並根據視覺化的資料做進一步的分析，其餘分析和展示讀者可自行發揮和擴充套件包括各種分析和不同的儲存方式等。。。。。一、爬取和分析

用python爬取二手房交易資訊並進行分析

用python爬取二手房交易資訊並分析第一步：編寫爬蟲爬取某平臺上海市十個區共900條二手房的交易資訊 #爬取上海十個區的二手房價資訊 import requests from bs4 import BeautifulSoup import csv #

scrapy 詳細例項-爬取百度貼吧資料並儲存到檔案和和資料庫中

Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。使用框架進行資料的爬取那，可以省去好多力氣，如不需要自己去下載頁面、資料處理我們也不用自己去寫。我們只需要關注資料的爬取規則就行，scrap

JAVA 爬取指定網站的資料並存入MySQL資料庫中 maven +httpclient+jsoup+mysql

最近在做一個小專案，因為要用的資料爬取，所以研究了好多天，分享一下自己的方法目錄結構：自己建立maven工程，匯入相關依賴：pom.xml <?xml version="1.0" enco

運用scrapy爬取鏈家網房價並儲存到本地

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面爬蟲結構：其中封裝了一個數據庫處理模組，還有一個user-agent池。。

scrapy爬取新浪微博並存入MongoDB中

spider.pyimport json from scrapy import Request, Spider from weibo.items import * class WeiboSpider(Spider): name = 'weibocn'

python爬蟲爬取NBA貼吧的所有精品貼

首先用直接的方法寫，先嚐試下能否爬取成功 #coding:utf-8 import urllib2,urllib import re ''' 1.準備url地址 2.準備請求頭 3.建立請求物件 4.發起請求獲取第一頁原始碼，接收響應 5.通過第一頁原始碼，找到總頁數和標題

Scrapy爬取淘寶網資料的嘗試

因為想學習資料庫，想要獲取較大量的資料，第一個想到的自然就是淘寶。。。。其中有大量的商品資訊，淘寶網反爬措施還是比較多，特別是詳情頁面還有噁心的動態內容該例子中使用Scrapy框架中的基礎爬蟲(CrawlSpider還有點沒搞清楚= = b) 先貼上整體程式碼 impo

爬取拉勾網，並進行資料分析

拉勾網是現在網際網路招聘比較火熱的一個網站，本篇文章主要是爬取拉勾網“資料分析師”這個崗位，並且對所爬取到的資訊，進行資料分析。資料採集拉勾網的崗位資訊主要是用json檔案儲存，在position這個json檔案中，我們找到了所需要的崗位資訊

python爬取糗事百科資料並儲存到sqlite中，命令列讀出

import requests import sqlite3 from bs4 import BeautifulSoup class QSBK: def __init__(self): self.page=0 self.items=[

Scrapy爬取知名技術網站文章並儲存到MySQL資料庫

之前的幾篇文章都是在講如何把資料爬下來，今天記錄一下把資料爬下來並儲存到MySQL資料庫。文章中有講同步和非同步兩種方法。所有文章文章的地址：http://blog.jobbole.com/all-posts/ 對所有文章

爬取貓眼短評相關資料視覺化分析《悲傷逆流成河》 | 郭敬明五年電影最動人之作

本篇所有原始碼已上傳github,點選這裡獲取一、我的感受知道《悲傷逆流成河》上映還是在qq空間看見學弟發了說說，突然想起初中追小四的書，每天看到晚上10點多，昨天看了槍版的《悲傷逆流成河》，整個故事情節幾乎和小說一模一樣，唯一不一樣的是原著裡的易遙是跳樓自殺的

python爬蟲建立代理池，爬取5000個代理IP並進行驗證！

前面已經介紹了urllib+正則表示式和BeautifulSoup進行爬取資料的方法，今天再解決一個實際問題——構建自己的代理池。通過爬蟲在網上進行資料的獲取，由於效率很快，換言之，訪問的速度過快，導致一段時間內的流量過大，會使得對方的伺服器壓力過

python 爬取豆瓣電影評論，並進行詞雲展示及出現的問題解決辦法

本文旨在提供爬取豆瓣電影《我不是藥神》評論和詞雲展示的程式碼樣例 1、分析URL 2、爬取前10頁評論 3、進行詞雲展示 1、分析URL 我不是藥神短評第一頁url https://movie.douban.com/subject/26752088/comments?start=0&limit=2

MongoDB+Python 爬取寶寶樹問答模組並進行簡單分析

前幾天剛學了MongoDB的。聽說有諸多好處。邊想著來感受下。恰逢一好友職位調整，需要了解目前母嬰市場圍繞的重心。所以便我去寶寶樹，爬取問答，看看媽媽們都在想什麼。整體思路 1，爬取各大問題分類及其連結。分成一級分類，二級分類。比如準備懷孕 - 懷孕

網上爬取圖片製作成資料集進行訓練

一、用pthon爬取圖片如圖：建立一個資料夾，下放每一類的資料夾（我的絕對路徑是：/home/user/dataset/）在每一個class下面，建立一個test.py檔案，用以爬取圖片 # coding=utf-8 """根據搜尋詞下載百度圖片"""

python3爬取指定百度貼吧頁面並儲存成本地文件（批量爬取貼吧頁面資料）

首先我們建立一個python檔案, tieba.py，我們要完成的是，輸入指定百度貼吧名字與指定頁面範圍之後爬取頁面html程式碼，我們首先觀察貼吧url的規律，比如：發現規律了吧，貼吧中每個頁面不同之處，就是url最後的pn的值，其餘的都是一樣的，我們

python爬蟲四：爬取貼吧資料

# -*- coding: utf-8 -*- import requests import time from bs4 import BeautifulSoup import io import sys #sys.stdout = io.TextIOWrapper(sys

scrapy爬取海量資料並儲存在MongoDB和MySQL資料庫中

前言一般我們都會將資料爬取下來儲存在臨時檔案或者控制檯直接輸出，但對於超大規模資料的快速讀寫，高併發場景的訪問，用資料庫管理無疑是不二之選。首先簡單描述一下MySQL和MongoDB的區別：MySQL與MongoDB都是開源的常用資料庫，MySQL是傳

使用scrapy爬取dota2貼吧資料並進行分析

相關推薦