neo4j︱Cypher完整案例csv匯入、關係聯通、高階查詢(三)
圖資料庫常規的有:neo4j(支援超多語言)、JanusGraph/Titan(分散式)、Orientdb,google也開源了圖資料庫Cayley(Go語言構成)、PostgreSQL儲存RDF格式資料。
第三篇,一個比較完整的csv匯入,並進行查詢的案例,涉及的資料量較大,更貼合實際場景。
NorthWind Introduction
如果要全部一次性執行的話,可以鍵入命令:
bin/neo4j-shell -path northwind.db -file import_csv.cypher
本文是官方的一個比較完整的案例,包括三部分:csv載入、建立實體關聯、查詢
其中csv載入與建立實體關聯可以瞭解到如何為Neo4j的資料集;
cypher的查詢也有難易之分,該案例中較好得進行了使用,有初級查詢與高階查詢。
很複雜是吧…來理一下邏輯:
一、載入基本實體資訊
保證資料格式
因為neo4j是utf-8的,而CSV預設儲存是ANSI的,需要用記事本另存為成UTF-8的。
// Create customers USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///customers.csv" AS row CREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone}); // Create products USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///products.csv" AS row CREATE (:Product {productName: row.ProductName, productID: row.ProductID, unitPrice: toFloat(row.UnitPrice)}); // Create suppliers USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///suppliers.csv" AS row CREATE (:Supplier {companyName: row.CompanyName, supplierID: row.SupplierID}); // Create employees USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///employees.csv" AS row CREATE (:Employee {employeeID:row.EmployeeID, firstName: row.FirstName, lastName: row.LastName, title: row.Title}); // Create categories USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///categories.csv" AS row CREATE (:Category {categoryID: row.CategoryID, categoryName: row.CategoryName, description: row.Description}); USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS row MERGE (order:Order {orderID: row.OrderID}) ON CREATE SET order.shipName = row.ShipName;
注意:
執行兩次會重複載入,注意!
“file:///customers.csv”中的’///’請注意!
CREATE (:Product {productName: row.ProductName)})
其中:
Product
為圖ID,可以通過Match (customers) return customers
進行檢視;row.ProductName
的用法,跟dataframe差不多;- 類似dict,其中的
productNam
e為Key
其中有一個比較奇怪的表格,那就是最後一個:orders.csv
為了查詢更快,可以建立索引:
CREATE INDEX ON :Product(productID); CREATE INDEX ON :Product(productName); CREATE INDEX ON :Category(categoryID); CREATE INDEX ON :Employee(employeeID); CREATE INDEX ON :Supplier(supplierID); CREATE INDEX ON :Customer(customerID); CREATE INDEX ON :Customer(customerName);
給每個節點比較重要的ID欄位建立索引。
不能同時執行,不然會報錯:
Neo.ClientError.Statement.SyntaxError
同時新增一個約束:
CREATE CONSTRAINT ON (o:Order) ASSERT o.orderID IS UNIQUE;
同時,如果需要修改其中一部分內容,可參考下面案例:
如果Janet is now reporting to Steven
那麼久可以如以下方式進行修改:
MATCH (mgr:Employee {EmployeeID:5})
MATCH (emp:Employee {EmployeeID:3})-[rel:REPORTS_TO]->()
DELETE rel
CREATE (emp)-[:REPORTS_TO]->(mgr)
RETURN *;
定位到emp,把有關聯的都先刪掉DELETE,然後create新的關聯。
同時csv載入的方式有兩種:本地載入+線上文件載入:
線上載入:
LOAD CSV FROM 'https://neo4j.com/docs/developer-manual/3.3/csv/artists.csv' AS line
CREATE (:Artist { name: line[1], year: toInteger(line[2])})
本地載入中有個Bug,就是怎麼寫地址,難道要這麼寫?file:///C:\Users\mattzheng\Desktop\categories.csv
,顯然是不對的。
那麼本地的話,需要把內容放到固定的資料夾之中,一個叫import
資料夾之中。
有可能在:在XXX\Neo4j\graph.db\import
資料夾內
也有可能在其他東西,筆者當時的資料夾路徑藏得很深是:C:\Users\matt\.Neo4jDesktop\neo4jDatabases\database-b82284eb-23ab-4a42-8a83-f13af055ecf0\installation-3.3.4\import
筆者也是誤打誤撞找到了這個連結,是通過報錯提醒得到的:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///C:\\Desktop\\categories.csv" AS row
CREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone});
然後他會報錯:
Couldn't load the external resource at: file:/C:\Users\matt\.Neo4jDesktop\neo4jDatabases\database-b82284eb-23ab-4a42-8a83-f13af055ecf0\installation-3.3.4\import\categories.csv
.
.
二、建立關聯
2.1 order與 products/employees關聯
order與 products and employees的關聯:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (product:Product {productID: row.ProductID})
MERGE (order)-[pu:PRODUCT]->(product)
ON CREATE SET pu.unitPrice = toFloat(row.UnitPrice), pu.quantity = toFloat(row.Quantity);
//同時,創立新的關聯屬性,on create的作用
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (employee:Employee {employeeID: row.EmployeeID})
MERGE (employee)-[:SOLD]->(order);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (customer:Customer {customerID: row.CustomerID})
MERGE (customer)-[:PURCHASED]->(order);
toFloat(row.UnitPrice)
當資料中為數值型,則需要規定關係型別。
文字型可以不用規定具體的類似是啥。
MATCH (order:Order {orderID: row.OrderID})的意思為將圖名稱Order賦值為order,同時選中orderID=row.OrderID這些內容;
[pu:PRODUCT]中,pu代表關係的統稱;PRODUCT代表關係的名稱
2.2 products,suppliers,categories關聯
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///products.csv" AS row
MATCH (product:Product {productID: row.ProductID})
MATCH (supplier:Supplier {supplierID: row.SupplierID})
MERGE (supplier)-[:SUPPLIES]->(product);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///products.csv" AS row
MATCH (product:Product {productID: row.ProductID})
MATCH (category:Category {categoryID: row.CategoryID})
MERGE (product)-[:PART_OF]->(category);
2.3 employees之間的關聯
在employees構建 ‘REPORTS_TO’關係來表達上下級關係。
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///employees.csv" AS row
MATCH (employee:Employee {employeeID: row.EmployeeID})
MATCH (manager:Employee {employeeID: row.ReportsTo})
MERGE (employee)-[:REPORTS_TO]->(manager);
那麼最終就會生成如下的內容:
三、初級查詢
查詢一:單獨查詢兩個關聯表
MATCH (:Order)<-[:SOLD]-(e:Employee)
return *
查詢二: product的價格,並排序:
match (p:Product)
return p.productName,p.unitPrice order by p.unitPrice DESC
limit 10;
邏輯:先從圖資料庫中定位p;order by 表示排序;limit 表 顯示限制。
查詢三:product 中’Chocolade’產品價格並排序:where、排序order使用
# 寫法一:
match (p:Product)
where p.productName = 'Chocolade'
return p.productName,p.unitPrice order by p.unitPrice DESC limit 10;
# 寫法二:
match (p:Product {productName : 'Chocolade'})
return p.productName,p.unitPrice order by p.unitPrice DESC limit 10;
寫法一通過where來進行定位,寫法二通過在match變數時,定義產品來進行產品定位。
查詢四:product 中’Chocolade’以及’Chai’產品價格並排序:where、排序order使用
match (p:Product)
where p.productName IN ['Chocolade','Chai']
return p.productName,p.unitPrice order by p.unitPrice DESC limit 10;
查詢五:條件篩選:where使用
MATCH (p:Product)
WHERE p.productName STARTS WITH "C" AND p.unitPrice > 100
RETURN p.productName, p.unitPrice;
意義為:選擇p.productName
中,首字母為’C’,同時unitPrice的價格大於100的範圍內。
Indexing的使用
如果要加速某一列屬性的查詢,可以設定Index
CREATE INDEX ON :Product(productName);
CREATE INDEX ON :Product(unitPrice);
查詢六:買了’Chocolade’的人有誰? :join用法
這邊涉及四個表格:
- Product產品表,productID;
- Customer顧客表 CustomerID;
- orders索引表,orderID + CustomerID
orders_Details索引表,orderID + productID
//正確:
MATCH (p:Product {productName:”Chocolade”})<-[:PRODUCT]-(:Order)<-[:PURCHASED]-(c:Customer)
RETURN distinct c.companyName;
//錯誤
//match後面,跟的是主表,主表不帶關係[],此時主表為Product
MATCH (c:Customer)-[:PURCHASED]
RETURN distinct c.companyName
//思考用法:用optional match之後為什麼錯誤?
match (c:Customer)
where (p:Product {productName:”Chocolade”})<-[:Product]-(:Order)<-[:PURCHASED]-(c)
return distinct c.companyName
這裡筆者的思考是,為什麼Product是主表,需要遵循邏輯關係,邏輯關係是Customer表->order表->Product表,而不是Product表反向。
思考用法:此時命令返回的是全部的c.companyName,而不是買了巧克力的,optional match也是一個根據關係生成變數步驟,不是新增約束的步驟;此時也不能用where,where後面跟的對變數的約束,而不能嫁接關係
查詢七:我買了啥+買了幾件?:統計功能
‘Drachenblut Delikatessen’買了啥,買了幾件東西。
客戶和訂單之間的匹配成為可選匹配,這與外連線相當。
//寫法1+普通match寫法
MATCH (p:Product)<-[pu:PRODUCT]-(:Order)<-[:PURCHASED]-(c:Customer {companyName:"Drachenblut Delikatessen"})
RETURN p.productName, toInt(sum(pu.unitPrice * pu.quantity)) AS volume
ORDER BY volume DESC;
//寫法2+OPTIONAL MATCH
MATCH (c:Customer {companyName:"Drachenblut Delikatessen"})
OPTIONAL MATCH (p:Product)<-[pu:PRODUCT]-(:Order)<-[:PURCHASED]-(c)
RETURN p.productName, toInt(sum(pu.unitPrice * pu.quantity)) AS volume
ORDER BY volume DESC ;
OPTIONAL MATCH在我看來更多的還是賦值操作,而且可以在match寫不下的時候,補充。
寫法二,match先定義變數,然後在OPTIONAL MATCH後面補充連線關係。
其中:toInt()整數、sum()求和;AS volume生成新一列列名為’volumne’
查詢八:僱員ID計數
MATCH (:Order)<-[:SOLD]-(e:Employee)
RETURN e.employeeID,count(*) AS cnt ORDER BY cnt DESC LIMIT 10
按照e.employeeID,進行分類count(*)計數。
e.employeeID | cnt |
---|---|
“4” | 156 |
“3” | 127 |
“1” | 123 |
查詢九:內容返回list/array格式
MATCH (o:Order)<-[:SOLD]-(e:Employee)
RETURN collect(e.lastName)
collect 將內容聚合成 (list,array)
.
四、高階查詢
查詢一:Which Employee had the Highest Cross-Selling Count of ‘Chocolade’ and Which Product?
查詢語句為:
MATCH (choc:Product {productName:'Chocolade'})<-[:PRODUCT]-(:Order)<-[:SOLD]-(employee),
(employee)-[:SOLD]->(o2)-[:PRODUCT]->(other:Product)
RETURN employee.employeeID, other.productName, count(distinct o2) as count
ORDER BY count DESC
LIMIT 5;
[:PRODUCT]-(:Order)代表的是:[]代表著關係名稱;()代表著圖名稱;
第一條邏輯:(employee)-(:Order)-(choc:Product)
,定位到employee生產了叫Chocolade的product
第二條邏輯:(employee)-()-(other:Product)
,定位到的僱員生產了哪些其他Product(所有的)
查詢二:How are Employees Organized? Who Reports to Whom?
MATCH path = (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN e.employeeID AS manager, sub.employeeID AS employee;
一個簡單的模式,尋找Employee關係中REPORTS_TO
的Employee。此時e代表僱主,sub代表僱員。
請注意,5號員工有人向他報告,但他也向2號員工報告。
這裡有一個邏輯是:僱員、僱主都在Employee庫中,所以要以REPORTS_TO
關係為切入點。
查詢三:Which Employees Report to Each Other Indirectly?
比查詢二更深入一些,間接的。
MATCH path = (e:Employee)<-[:REPORTS_TO*]-(sub)
WITH e, sub, [person in NODES(path) | person.employeeID][1..-1] AS path
RETURN e.employeeID AS manager, sub.employeeID AS employee, CASE WHEN LENGTH(path) = 0 THEN "Direct Report" ELSE path END AS via
ORDER BY LENGTH(path);
第一步跟查詢二的邏輯一樣,在同一個Employee庫彙總,查詢關係為:REPORTS_TO
的employee.
第二步,with用法,with從句可以連線多個查詢的結果,即將上一個查詢的結果用作下一個查詢的開始,
(哈哈哈… 後面有點不明白,查完資料再補充…)
查詢四:How Many Orders were Made by Each Part of the Hierarchy?
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[:REPORTS_TO*0..]-(sub)-[:SOLD]->(order)
RETURN e.employeeID, [x IN COLLECT(DISTINCT sub.employeeID) WHERE x <> e.employeeID] AS reports, COUNT(distinct order) AS totalOrders
ORDER BY totalOrders DESC;