mit6.830 - lab1 - 儲存模型 - 題解
1.Intro
github : https://github.com/CreatorsStack/CreatorDB
lab1實現資料庫基本的儲存邏輯結構,具體包括:Tuple
,TupleDesc
,HeapPage
,HeapFile
,SeqScan
, BufferPool
等。
- Tuple和TupleDesc是資料庫表的最基本元素了。Tuple就是一個若干個
Field
的,TupleDesc則是一個表的meta-data,包括每列的field name和type。 - HeapPage和HeapFile都分別是Page和DbFile interface的實現
- BufferPool是用來做快取的,getPage會優先從這裡拿,如果沒有,才會呼叫File的readPage去從檔案中讀取對應page,disk中讀入的page會快取在其中。
- SeqScan用來遍歷一個table的所有tuple,包裝了HeapFile的iterator。
畫了個大概的關係圖:
2.SimpleDB Architecture and Implementation Guide
2.1. The Database Class
Database類提供了database中要用到的靜態全域性物件。其中,包括了訪問catalog(database中所有表的集合)、buffer pool(現在駐留在記憶體中所有資料庫檔案頁的集合)以及log file的方法。在這個lab中不需要關心log file。
2.2. Fields and Tuples
資料庫中,行被稱為記錄(record)或元組(Tuple)
Exercise 1
Implement the skeleton methods in:
- src/simpledb/TupleDesc.java
- src/simpledb/Tuple.java
At this point, your code should pass the unit tests TupleTest and TupleDescTest. At this point, modifyRecordId() should fail because you havn't implemented it yet.
TupleDesc主要定義了Tuple結構,這裡是一個TDItem型別的陣列,一個TDItem物件包括fieldType和fieldName兩個屬性,通過這兩個屬性描述資料庫的行。
TupleDesc.java程式碼如下:
private List<TDItem> descList;
private int fieldNum;
public static class TDItem implements Serializable {
private static final long serialVersionUID = 1L;
/**
* The type of the field
* */
public final Type fieldType;
/**
* The name of the field
* */
public String fieldName;
public TDItem(Type t, String n) {
this.fieldName = n;
this.fieldType = t;
}
}
public TupleDesc(Type[] typeAr, String[] fieldAr) {
// some code goes here
if (typeAr.length != fieldAr.length) {
throw new IllegalArgumentException("The typeAr length must be equal than fieldAr length");
}
this.descList = new ArrayList<>(typeAr.length);
this.fieldNum = typeAr.length;
for (int i = 0; i < typeAr.length; i++) {
final TDItem item = new TDItem(typeAr[i], fieldAr[i]);
this.descList.add(item);
}
}
Tuple程式碼如下:
private TupleDesc tupleDesc;
private Field[] fields;
private RecordId recordId;
public Tuple(TupleDesc td) {
// some code goes here
this.tupleDesc = td;
this.fields = new Field[this.tupleDesc.numFields()];
}
public void setField(int i, Field f) {
// some code goes here
if (i >= this.tupleDesc.numFields()) {
return;
}
this.fields[i] = f;
}
2.3. Catalog
catalog類描述的是資料庫例項。包含了資料庫現有的表資訊以及表的schema資訊。現在需要實現新增新表的功能,以及從特定的表中提取資訊。提取資訊時通過表對應的TupleDesc物件決定操作的欄位型別和數量。
在整個 SimpleDb 中, CataLog 是全域性唯一的,可以通過方法Database.getCatalog()獲得,global buffer pool可以通過方法Database.getBufferPool()獲得。
Exercise 2
Implement the skeleton methods in:
- src/simpledb/Catalog.java
At this point, your code should pass the unit tests in CatalogTest.
為了維護一個表的資訊, 我額外建立了一個 TableInfo 類:
public class TableInfo {
private int tableId;
private String tableName;
private DbFile dbFile;
private String primaryKeyName;
}
Catalog.java程式碼如下
private final Map<Integer, TableInfo> tableInfoMap;
// 維護 name -> tableId 的對映關係
private final Map<String, Integer> nameToIdMap;
/**
* Constructor.
* Creates a new, empty catalog.
*/
public Catalog() {
// some code goes here
this.tableInfoMap = new HashMap<>();
this.nameToIdMap = new HashMap<>();
}
public void addTable(DbFile file, String name, String pkeyField) {
// some code goes here
final int tableId = file.getId();
final TableInfo tableInfo = new TableInfo(tableId, name, file, pkeyField);
this.tableInfoMap.put(tableId, tableInfo);
this.nameToIdMap.put(name, tableId);
}
2.4. BufferPool
buffer pool(在 SimpleDB 中是 BufferPool 類)也是全域性唯一的, 負責將最近訪問過的 page 快取下來。所有的讀寫操作通過buffer pool讀寫硬碟上不同檔案,BufferPool裡的 numPages 引數確定了讀取的固定頁數,我們可以直接搭配 Lru 最近未使用演算法, 來實現 BufferPool.
此外, Database類提供了一個靜態方法Database.getBufferPool(),返回整個SimpleDB程序的BufferPool例項引用。
Exercise 3
Implement the
getPage()
method in:
- src/simpledb/BufferPool.java
We have not provided unit tests for BufferPool. The functionality you implemented will be tested in the implementation of HeapFile below. You should use the
DbFile.readPage
method to access pages of a DbFile.
LruCache 程式碼:
public class LruCache<K, V> {
// LruCache node
public class Node {
public Node pre;
public Node next;
public K key;
public V value;
public Node(final K key, final V value) {
this.key = key;
this.value = value;
}
}
private final int maxSize;
private final Map<K, Node> nodeMap;
private final Node head;
private final Node tail;
public LruCache(int maxSize) {
this.maxSize = maxSize;
this.head = new Node(null, null);
this.tail = new Node(null, null);
this.head.next = tail;
this.tail.pre = head;
this.nodeMap = new HashMap<>();
}
public void linkToHead(Node node) {
Node next = this.head.next;
node.next = next;
node.pre = this.head;
this.head.next = node;
next.pre = node;
}
public void moveToHead(Node node) {
removeNode(node);
linkToHead(node);
}
public void removeNode(Node node) {
if (node.pre != null && node.next != null) {
node.pre.next = node.next;
node.next.pre = node.pre;
}
}
public Node removeLast() {
Node last = this.tail.pre;
removeNode(last);
return last;
}
public synchronized void remove(K key) {
if (this.nodeMap.containsKey(key)) {
final Node node = this.nodeMap.get(key);
removeNode(node);
this.nodeMap.remove(key);
}
}
public synchronized V get(K key) {
if (this.nodeMap.containsKey(key)) {
Node node = this.nodeMap.get(key);
moveToHead(node);
return node.value;
}
return null;
}
public synchronized V put(K key, V value) {
if (this.nodeMap.containsKey(key)) {
Node node = this.nodeMap.get(key);
node.value = value;
moveToHead(node);
} else {
// We can't remove page here, because we should implement the logic of evict page in BufferPool
// if (this.nodeMap.size() == this.maxSize) {
// Node last = removeLast();
// this.nodeMap.remove(last.key);
// return last.value;
// }
Node node = new Node(key, value);
this.nodeMap.put(key, node);
linkToHead(node);
}
return null;
}
}
BufferPool.getPage() 程式碼如下:
public Page getPage(TransactionId tid, PageId pid, Permissions perm) throws TransactionAbortedException,
DbException {
final Page page = this.lruCache.get(pid);
if (page != null) {
return page;
}
return loadPageAndCache(pid);
}
private Page loadPageAndCache(final PageId pid) throws DbException {
final DbFile dbFile = Database.getCatalog().getDatabaseFile(pid.getTableId());
final Page dbPage = dbFile.readPage(pid);
if (dbPage != null) {
this.lruCache.put(pid, dbPage);
if (this.lruCache.getSize() == this.lruCache.getMaxSize()) {
// 驅逐快取的 page, 如果空間滿了
evictPage();
}
}
return dbPage;
}
2.5. HeapFile access method
access method 提供了硬碟讀寫資料的方式, 包括heap files和B-trees 的讀寫,在這裡,只需要實現heap file訪問方法。
HeapFile物件包含一組“物理頁”,每一個頁大小固定,大小由 BufferPool.DEFAULT_PAGE_SIZE 定義,頁記憶體儲行資料。在SimpleDB中,資料庫中每一個表對應一個HeapFile物件,HeapFile中每一頁包含很多個slot,每個slot是留給一行的位置。除了這些slots,每個物理頁包含一個header,heade是每個tuple slot的bitmap。如果bitmap中對應的某個tuple的bit是1,則這個tuple是有效的,否則無效(被刪除或者沒被初始化)。HeapFile物件中的物理頁的型別是HeapPage,物理頁是快取在buffer pool中,通過HeapFile類讀寫。
計算每頁所需tuple數量
計算header所需byte數量
提示:所有的java虛擬機器都是big-endian。
- 大端模式是指資料的低位儲存在記憶體的高地址中,而資料的高位儲存在記憶體的低地址中.
- 小端模式是指資料的低位儲存在記憶體的低地址中,而資料的高位儲存在記憶體的高地址中。
Exercise 4
Implement the skeleton methods in:
- src/simpledb/HeapPageId.java
- src/simpledb/RecordID.java
- src/simpledb/HeapPage.java
Although you will not use them directly in Lab 1, we ask you to implement getNumEmptySlots() and isSlotUsed() in HeapPage. These require pushing around bits in the page header. You may find it helpful to look at the other methods that have been provided in HeapPage or in src/simpledb/HeapFileEncoder.java to understand the layout of pages.
You will also need to implement an Iterator over the tuples in the page, which may involve an auxiliary class or data structure.
At this point, your code should pass the unit tests in HeapPageIdTest, RecordIDTest, and HeapPageReadTest.
After you have implemented HeapPage, you will write methods for HeapFile in this lab to calculate the number of pages in a file and to read a page from the file. You will then be able to fetch tuples from a file stored on disk.
HeapPageId 和 RecordId 比較簡單..
HeapPage 程式碼如下, 主要是要理解 header 和 slot 的對應關係:
/**
* Returns the number of empty slots on this page.
*/
public int getNumEmptySlots() {
// some code goes here
int emptyNum = 0;
for (int i = 0; i < getNumTuples(); i++) {
if (!isSlotUsed(i)) {
emptyNum++;
}
}
return emptyNum;
}
/**
* Returns true if associated slot on this page is filled.
*/
public boolean isSlotUsed(int i) {
// some code goes here
// For Example, byte = 11110111 and posIndex = 3 -> we want 0
int byteIndex = i / 8;
int posIndex = i % 8;
byte target = this.header[byteIndex];
return (byte) (target << (7 - posIndex)) < 0;
}
/**
* Abstraction to fill or clear a slot on this page.
*/
private void markSlotUsed(int i, boolean value) {
// some code goes here
// not necessary for lab1
int byteIndex = i / 8;
int posIndex = i % 8;
byte v = (byte) (1 << posIndex);
byte headByte = this.header[byteIndex];
this.header[byteIndex] = value ? (byte) (headByte | v) : (byte) (headByte & ~v);
}
Exercise 5
Implement the skeleton methods in:
- src/simpledb/HeapFile.java
To read a page from disk, you will first need to calculate the correct offset in the file. Hint: you will need random access to the file in order to read and write pages at arbitrary offsets. You should not call BufferPool methods when reading a page from disk.
You will also need to implement the
HeapFile.iterator()
method, which should iterate through through the tuples of each page in the HeapFile. The iterator must use theBufferPool.getPage()
method to access pages in theHeapFile
. This method loads the page into the buffer pool and will eventually be used (in a later lab) to implement locking-based concurrency control and recovery. Do not load the entire table into memory on the open() call -- this will cause an out of memory error for very large tables.At this point, your code should pass the unit tests in HeapFileReadTest.
HeapFile.java程式碼如下:
想要 readPage from file, 我們可以利用 java 的 randomAccessFile 來達到這個目的
randomAccessFile 支援隨機 seek() 的功能:
// see DbFile.java for javadocs
public Page readPage(PageId pid) {
// some code goes here
final int pos = BufferPool.getPageSize() * pid.getPageNumber();
byte[] pageData = new byte[BufferPool.getPageSize()];
try {
this.randomAccessFile.seek(pos);
this.randomAccessFile.read(pageData, 0, pageData.length);
final HeapPage heapPage = new HeapPage((HeapPageId) pid, pageData);
return heapPage;
}
return null;
}
2.6. Operators
資料庫Operators(操作符)負責查詢語句的實際執行。在SimpleDB中,Operators是基於 volcano 實現的, 每個 operator 都需要實現 next() 方法
SimpleDP 和程式互動的過程中,現在root operator上呼叫getNext,之後在子節點上繼續呼叫getNext,一直下去,直到leaf operators 被呼叫。他們從硬碟上讀取tuples,並在樹結構上傳遞。如圖所示:
[
這個lab中,只需要實現一個SimpleDB operator, 也即 seqScan
Exercise 6.
Implement the skeleton methods in:
- src/simpledb/SeqScan.java
This operator sequentially scans all of the tuples from the pages of the table specified by the
tableid
in the constructor. This operator should access tuples through theDbFile.iterator()
method.At this point, you should be able to complete the ScanTest system test. Good work!
You will fill in other operators in subsequent labs.
SeqScan.java程式碼如下:
public SeqScan(TransactionId tid, int tableid, String tableAlias) {
// some code goes here
this.tid = tid;
this.tableId = tableid;
this.tableAlias = tableAlias;
// db file 的 iterator, 可以遍歷 file 的每個 page
this.dbFileIterator = Database.getCatalog().getDatabaseFile(tableid).iterator(tid);
}
public SeqScan(TransactionId tid, int tableId) {
this(tid, tableId, Database.getCatalog().getTableName(tableId));
}
public void open() throws DbException, TransactionAbortedException {
// some code goes here
this.dbFileIterator.open();
}
public boolean hasNext() throws TransactionAbortedException, DbException {
// some code goes here
return this.dbFileIterator.hasNext();
}
public Tuple next() throws NoSuchElementException, TransactionAbortedException, DbException {
// some code goes here
final Tuple next = this.dbFileIterator.next();
final Tuple result = new Tuple(getTupleDesc());
for (int i = 0; i < next.getTupleDesc().numFields(); i++) {
result.setField(i, next.getField(i));
result.setRecordId(next.getRecordId());
}
return result;
}
2.7. A simple query
這一小節是要說明怎麼綜合上面的部分,執行一次簡單的查詢。
假如有一個數據檔案"some_data_file.txt",內容如下:
1,1,1
2,2,2
3,4,4
可以將它轉換成SimpleDB可以查詢的二進位制檔案,轉換格式為java -jar dist/simpledb.jar convert some_data_file.txt 3
。其中引數3是告訴轉換器輸入有3列。
下列程式碼實現了對檔案的簡單查詢,效果等同於SQL語句的SELECT * FROM some_data_file
。
package simpledb;
import java.io.*;
public class test {
public static void main(String[] argv) {
// construct a 3-column table schema
Type types[] = new Type[]{ Type.INT_TYPE, Type.INT_TYPE, Type.INT_TYPE };
String names[] = new String[]{ "field0", "field1", "field2" };
TupleDesc descriptor = new TupleDesc(types, names);
// create the table, associate it with some_data_file.dat
// and tell the catalog about the schema of this table.
HeapFile table1 = new HeapFile(new File("some_data_file.dat"), descriptor);
Database.getCatalog().addTable(table1, "test");
// construct the query: we use a simple SeqScan, which spoonfeeds
// tuples via its iterator.
TransactionId tid = new TransactionId();
SeqScan f = new SeqScan(tid, table1.getId());
try {
// and run it
f.open();
while (f.hasNext()) {
Tuple tup = f.next();
System.out.println(tup);
}
f.close();
Database.getBufferPool().transactionComplete(tid);
} catch (Exception e) {
System.out.println ("Exception : " + e);
}
}
}