搭建一個小型的證券知識圖譜

阿新 • • 發佈：2018-12-24

本專案主要實現邏輯如下：

資料獲取
資料處理
匯入neo4j

本專案需要用到兩種資料來源：一種是公司董事的資訊，另一種是股票的行業以及概念資訊。
董事資訊通過scrapy進行爬取，具體包含各個上市公司董事會成員姓名、職位、性別、年齡。股票的行業及概念資訊通過Tushare資訊進行獲取。

1.董事資訊獲取

我們通過訪問’http://pycs.greedyai.com/’ 來獲取上市公司的董事資訊，主要獲取董事的姓名、職位、性別、年齡
在這裡插入圖片描述

scrapy部分程式碼如下：

# -*- coding: utf-8 -*-
import scrapy
import re
from bs4 import BeautifulSoup
from securities.spiders.savecsv import save_csv
from securities.items import GetItem


class UserinfoSpider(scrapy.Spider):
    name = 'userinfo'
    allowed_domains = ['pycs.greedyai.com']
    start_urls = ['http://pycs.greedyai.com']

    def parse(self, response):
        body = response.css('body').extract()
        url = re.findall('"\.(.*?html)"', body[0])
        for u in url:
            yield scrapy.Request(url='http://pycs.greedyai.com' + u,callback=self.get_info)

    def get_info(self, response):
        soup = BeautifulSoup(response.text, 'html.parser')

        code = re.findall('\d{6}', soup.head.title.text)[0]
        list = []
        tbody_tag = soup.select('#ml_001 > table > tbody')[0]
        m = 0
        for cd in tbody_tag.children:
            if m % 2 != 0:
                n = 0
                for user_tag in cd.children:
                    if n % 8 == 3:
                        thead_tag = str(user_tag.div.table.thead)
                        name = re.findall('target="_blank">(.*?)</a>', thead_tag)[0]
                        jobs = re.findall('<td class="jobs" style="width: 150px;">(.*?)</td>', thead_tag)[0]
                        intro = re.findall('<td class="intro">(.*?)</td>', thead_tag)[0]
                        sex = re.findall('\u7537|\u5973', intro)[0]
                        age = re.findall('(\d*?)\u5c81', intro)[0]
                        info = GetItem()
                        info['name'] = name
                        info['sex'] = sex
                        info['age'] = age
                        info['code'] = code
                        info['jobs'] = jobs
                        yield info

                    n += 1
                break
            m = 1

抓取完的資訊儲存至executive_prep.csv檔案
在這裡插入圖片描述

2.獲取股票的行業及概念資訊

這部分直接通過Tushare獲取

import tushare as ts
df = ts.get_industry_classified()
df = ts.get_concept_classified()

獲取結果儲存至stock_industry_prep.csv和stock_concept_prep.csv檔案
在這裡插入圖片描述

3.資料處理

前兩步完成了資料的獲取，這裡要將獲取到的資料轉換成neo4j可識別的具體格式。
對於格式的要求，請參考：https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/

程式碼部分只是簡單的csv檔案操作，在此不做過多闡述，處理完的資料，分別儲存不同的csv檔案：“executive.csv”, “stock.csv”, “concept.csv”, “industry.csv”, “executive_stock.csv”, “stock_industry.csv”, “stock_concept.csv”。
在此只挑一個節點和一條關係做簡單演示。
在這裡插入圖片描述

executive節點，儲存董事姓名、性別、年齡、股票程式碼、職位資訊
在這裡插入圖片描述

executive_stock關係，宣告董事與公司的關係為work_in

4.匯入neo4j

將上述所有檔案匯入neo4j

bin/neo4j-admin import --nodes import/executive.csv --nodes import/stock.csv --nodes import/concept.csv --nodes import/industry.csv --relationships import/executive_stock.csv --relationships import/stock_industry.csv --relationships import/stock_concept.csv

在這裡插入圖片描述
能看到我們一共匯入了27889個節點和36742個關係，然後啟動neo4j，就可以看到我們匯入的資料了。

比如我們隨便檢視一支股票相關聯的資訊，比如“東方集團”

或者隨便看一個概念“資產注入”

小型的證券知識圖譜搭建到這裡就結束了，所有原始碼及測試資料已上傳至git，點選這裡可直接檢視，有疑問的同學請在部落格下方留言或提Issuees，謝謝！

搭建一個小型的證券知識圖譜

1.董事資訊獲取

2.獲取股票的行業及概念資訊

3.資料處理

4.匯入neo4j

搭建一個小型的證券知識圖譜

使用 Vue2.js Node.js 搭建一個小型的全棧專案

利用maven搭建一個小型SSM框架的web程式

知識圖譜—知識儲存—僅用neo4j搭建簡單的金融知識圖譜

小型動漫知識圖譜的構建 (Python+Neo4j) (純實踐內容，基於bilibili所有正版番劇的動漫、聲優、角色、型別)

知識圖譜升溫之勢已現，不要錯失下一個AI風口

基於linux上搭建紅樓夢知識圖譜---jdk與neo4j準備

基於知識圖譜+機器學習，搭建風控模型的專案落地

基於linux上搭建紅樓夢知識圖譜---後續

從零開始搭建一個知識付費平臺 - 需求分析

三：搭建一個Web Test Plan

前端知識圖譜

手動搭建一個完整的angular實踐項目

從頭開始搭建一個dubbo+zookeeper平臺

【微信開發】02.搭建一個屬於自己的微信公眾平臺

用Eclipse 搭建一個Maven Spring SpringMVC 項目

【分享】Java後臺開發精選知識圖譜

從第一次在家聽了一點小迪培訓寫起，搭建一個IIS服務器

CCAI 2017 | 德國DFKI科技總監Hans Uszkoreit：如何用機器學習和知識圖譜來實現商業智能化？

初學django搭建一個通訊錄應用

搭建一個小型的證券知識圖譜

1.董事資訊獲取

2.獲取股票的行業及概念資訊

3.資料處理

4.匯入neo4j

相關推薦