python:文獻引文網路構建——基於web of science

阿新 • • 發佈：2018-12-12

除了文獻內容的文字特徵外，文獻之間的引文關係也是判斷它們之間相似度的重要依據。觀察WOS中下載資料中的CR欄位，是每篇文獻的參考文獻情況：在這裡插入圖片描述從圖中可以看出，WOS是通過文獻的DOI來標註參考文獻的，所以要知道文獻之間的引用情況必須要把每篇文獻引用的DOI集提取出來：

def DOISET(raw,export_url,num):
    re_out=open(export_url,'a')
    DOIset1=[]
    DOIset2=[]
    raw=raw.strip()
    line=[]
    line=raw.split('; ')
    for words in 
 line:
        word=words.split(', ')
        for i in word:
            if 'DOI' in i:
                DOIset1.append(i)
            else:
                continue
    for DOI in DOIset1:
        DOIset2.append(DOI.replace('DOI ', ''))
    out_str=','.join(DOIset2)
    re_out.write(str(num)+'\t')
    re_out. 
write(out_str)
    re_out.write('\n')
    re_out.close()
  

 import mysql.connector
 def connect_mysql():
    conn=mysql.connector.connect(host='localhost', user='root', passwd='資料庫密碼', db='資料庫名稱', charset='utf8')
    cursor=conn.cursor()
    cursor.execute('select CR from test1.cris order by UT')
    rows= 
cursor.fetchall()
    i=0
    for row in rows:
        i+=1
        DOISET(row[0], '寫入文件的名稱',i)
    conn.close()

connect_mysql()
print('finish!')

結果如圖：

引文網路的構建是基於AMSLER網路原理，同時考慮文獻之間的共被引情況和耦合情況，只要文獻之間存在共被引或耦合情況，兩篇文獻間的引用情況就+1，依照這個理論構建引文網路。我的想法是： 1.共被引數：每篇文獻的DOI兩兩匹配，如果都在資料庫中，則引用數+1 2.耦合數：任意兩篇文獻引用DOI集進行匹配，有相同的一項兩篇文獻的引用數+1 3.目前暫時通過字典的方式來儲存矩陣{1：{2：[citationnum]}}

程式碼如下：

import mysql.connector
class compute_citation:
    def esCRset(self,filepath):            
    #從剛才的提取出的引用DOI集提取出來，儲存在列表中
        data_source=open(filepath,'r')
        data=data_source.readline()
        i=0
        datatemp=[]
        datatemp.append([])
        while(data!=""):
            i+=1
            datatemp_1=[]
            datatemp_2=[]
            datatemp_1=data.strip('\n').split("\t")
            datatemp_2=datatemp_1[1].split(",")
            datatemp.append(datatemp_2)
            data=data_source.readline()
        data_source.close()
        return datatemp

    def esDOIset(self):
    #提取出每篇文獻的DOI儲存在列表中
        datatemp={}
        conn=mysql.connector.connect(host='localhost', user='root', passwd='資料庫密碼', db='test1', charset='utf8')
        cursor=conn.cursor()
        cursor.execute('select DI from test1.cris order by UT')
        rows=cursor.fetchall()
        i=0
        for row in rows:
            i+=1
            if row[0]=='':
                continue
            else:
                datatemp[row[0]]=i
        conn.close()
        return datatemp

    def compute_bibli(self,filepath):
    #文獻耦合數計算
        CRset=self.esCRset(filepath)
        net={}
        for i in range(1,20479):
            net[i]={}
            list1=CRset[i]
            t=i+1
            if list1==['']:
                while(t<=20478):
                    net[i][t]=0
                    t+=1
            else:
                while(t<=20478):
                    num=0
                    list2=CRset[t]
                    if list2==['']:
                        net[i][t]=0
                        t+=1
                    else:
                        for x in list1: 
                            if x in list2:                            
                                num+=1
                            else:
                                continue
                        net[i][t]=num
                        t+=1
        return net
    def compute_add(self,filepath):
    #文獻共被引數計算
        DOIset=self.esDOIset()
        CRset=self.esCRset(filepath)
        net=self.compute_bibli(filepath)
        for CR in CRset:
            if CR==[''] or CR==[]:continue
            listok=[]
            for i in CR:
                if i in DOIset:
                    listok.append(DOIset[i])
            if len(listok)!=0:
                length=len(listok)
                listok.sort()
                for i in range(0,length-1):
                    p=i+1
                    while(p<length):
                        if listok[i]<listok[p]:net[listok[i]][listok[p]]+=1
                        else:net[listok[p]][listok[i]]+=1
                        p+=1
                        
compute=compute_citation()
dicx=compute.compute_add('C:/users/49509/desktop/citation.txt'）

PS：耦合數計算那裡比較慢，我1w篇大概運行了半個多小時，所以前期篩選文獻很重要啊，做這種計量分析我覺得七八千就差不多了，2w篇真的要了老命了。反思一下覺得自己寫程式碼都是很基礎的，效率必然也是低下的，可能我們專業基本沒有在鑽研演算法這些？只想把問題解決了就好了。引文網路也差不多建好了（雖然有些小細節的問題但暫時不想再看了），繼續搗鼓我的拓撲特徵去了~ 就醬~

python:文獻引文網路構建——基於web of science

python:文獻引文網路構建——基於web of science

Web of Science資料庫中文獻相關資訊下載與儲存

文獻管理-----web of science 匯入endnote

Web of Science爬蟲實戰（模擬瀏覽器）

使用Python+TensorFlow2構建基於卷積神經網路（CNN）的ECG心電訊號識別分類（二）

構建基於Javascript的移動web CMS——Hello,World

[ Python ] Flask 基於 Web開發大型程序的結構實例解析

python網路爬蟲（web spider）系統化整理總結（二）：爬蟲python程式碼示例(兩種響應格式：json和html)

python網路爬蟲（web spider）系統化整理總結（一）：入門

Hyper.js 2.1.0 canary 3 釋出，基於 Web 技術構建的終端

基於web的網路考勤系統（ajax+ligerUI+MVC+簡單工廠）

python網路程式設計-基於twsited(1)

構建虛擬Web主機 —— 基於域名（主機）

菜鳥自學selenium+python基於web的自動化（功能自動化）

怎樣構建基於SDN網路的自動化運維繫統？

基於springcloud構建一個web專案

使用Nancy構建基於mono的ASP.NET Web API

基於WEB的網路遠端作業處理系統之使用者介面的設計與實現，java設計與開發

神經網路演算法(基於Tensorflow、基於Python實現BP)

python基於web.py的簡易blog

python:文獻引文網路構建——基於web of science

相關推薦