1. 程式人生 > >Coursera作業之哈夫曼編碼樹

Coursera作業之哈夫曼編碼樹


https://class.coursera.org/progfun-003/assignment/view?assignment_id=15

前言廢話:
此次作業比前幾次花的時間更多,共用了大概6小時,其中有一道的瞄了一眼網友的思路(貌似他的solution還是錯的,但是畢竟我還是瞄了一眼,給我了一些靈感和啟發)
還有一道題目是看了助教對於此題的一個小提示

通過這幾次的做題目,突然發現在Coursera上刷題目比玩遊戲還要有趣,第一次提交答案得分9.55,覺得不爽,一定要到10分才滿意,是不是這也算是一種強迫症呢?還是完美主義?
此外,我感覺這個課程的老師出的題目實在是太棒了!有些題目的解法真的很巧妙很經典,給出題目的一些Hint也都是恰到好處,有醍醐灌頂之效
多年了,讓我重新感受到了當年高中時做題的那種感受,有些題目,思考再三,突然靈機一動,寫下了寥寥隻字,correct,就和這門課程的作業很像,函數語言程式設計,程式碼不在多而在於精,往往10行不到的程式碼,一個功能複雜的函式就實現了

寫於2013.12.14 晚


以下省略一些對於哈夫曼編碼樹的簡介


圖中的數值代表權值,字串代表對應的節點的字元編碼串

注意:
Note that a given encoding is only optimal if the character frequencies in the encoded text match the weights in the code tree.
哈夫曼編碼樹只有在字元出現頻率和樹中的權值(weight)相等時,這個哈弗曼編碼樹才能稱作是最佳(optimal)的

在作業中,有一道練習如果沒考慮周全就會出現生成的哈弗曼編碼樹不是最佳的,以此扣分

編碼(Encoding)
對於給定的一個編碼樹,從根部開始遍歷到樹葉,往左就要加0,往右就加1,直到樹葉,比如上面的那個編碼樹字母D就編碼成1011

解碼(Decoding)
方法和編碼相反,給定一個編碼樹和一個編碼後的串,從根部遍歷到葉子,得到一個字元,並重復此步驟,比如10001010解碼後就是BAC

題目:
給定類:
abstract class CodeTree
case class Fork (left: CodeTree, right: CodeTree, chars: List[Char], weight: Int) extends CodeTree
case class Leaf(char: Char, weight: Int) extends CodeTree



上手題:

編寫函式weight,返回tree的權值
def weight(tree: CodeTree): Int = tree match ...
chars which returns the list of characters defined in a given Huffman tree.
def chars(tree: CodeTree): List[Char] = tree match ...

def weight(tree: CodeTree): Int = tree match {
  case Fork(left, right, chars, wght) => weight(left) + weight(right)
  case Leaf(char, wght)               => wght
}

def chars(tree: CodeTree): List[Char] = tree match {
  case Fork(left, right, chs, wght) => chars(left) ::: chars(right)
  case Leaf(ch, weight)             => List(ch)
}



這兩題沒什麼難度,通過前幾次作業對於遞迴思想的練習,很容易就能寫出,用到左右分而治之的思路

題目:
構建哈夫曼編碼樹
Given a text, it’s possible to calculate and build an optimal Huffman tree in the sense that the encoding of that text will be of the minimum possible length, meanwhile keeping all information (i.e., it is lossless).
給定一個字串,構建出一個最優的哈夫曼編碼樹,讓那個字串的編碼達到儘可能最短長度

也就是最終要能實現這樣一個函式
def createCodeTree(chars: List[Char]): CodeTree = ...

我們把這個函式的實現分為多個步驟(也就是先實現一些輔助函式)

寫一個函式times計算每個character在字串中的次數
比如times(List('a', 'b', 'a')),就返回List(('a', 2), ('b', 1))
類似於一個HashMap
def times(chars: List[Char]): List[(Char, Int)] = ...

我的思路是:
用一個accumulator累加器作為返回值ret,每次新增字元時首先check accumulator中有沒有這個字元,有的話,去除accumulator頭部那個,並新增一個頭部元素(key,value+1)的一個元組
因為scala的值都是immutable的,所以無法修改原先值,只能刪除並新增新的
程式碼如下:

def times(chars: List[Char]): List[(Char, Int)] = {
  def hashMap(chs: List[Char], list: List[(Char, Int)]): List[(Char, Int)] = {
    if (chs.isEmpty) list
    else hashMap(chs.tail, addElement(chs.head, list))
  }
  def addElement(ch: Char, list: List[(Char, Int)]): List[(Char, Int)] = {
    if (list.isEmpty) (ch, 1) :: list
    // 如果找到的話,那麼取出head,建立一個新的元組,並新增到頭部
    else if (list.head._1 == ch) (ch, list.head._2 + 1) :: list.tail
    // 沒找到就繼續遞迴找
    else list.head :: addElement(ch, list.tail)
  }
  hashMap(chars, List[(Char, Int)]())
}



實現一個函式,返回從小到大排序freqs的列表
def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = ...

其實就是用到了課上說過的插入排序
def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = {
  def insertSort(pair: (Char, Int), list: List[Leaf]): List[Leaf] = {
    if (list.isEmpty) List(Leaf(pair._1, pair._2))
    else if (pair._2 <= list.head.weight) Leaf(pair._1, pair._2) :: list
    else list.head :: insertSort(pair, list.tail)
  }
  def loopInsert(list: List[(Char, Int)], retList: List[Leaf]): List[Leaf] = {
    if (list.isEmpty) retList
    else loopInsert(list.tail, insertSort(list.head, retList))
  }
  loopInsert(freqs, List[Leaf]())
}


寫一個singleton函式判斷trees是不是隻含有一個樹
其實意思就是trees裡是不是隻包含一個元素
def singleton(trees: List[CodeTree]): Boolean = ...

def singleton(trees: List[CodeTree]): Boolean = {
  !trees.isEmpty && trees.tail.isEmpty
}



Write a function combine which (1) removes the two trees with the lowest weight from the list constructed in the previous step,
and (2) merges them by creating a new node of type Fork. Add this new tree to the list - which is now one element shorter - while preserving the order (by weight).

實現combine函式,這個函式的功能是取出trees中頭兩個數,把他們組合成一個新樹,並新增到原來的數的列表中去,同時刪除兩個舊的樹
看一下testcase就明白了
val leaflist = List(Leaf('e', 1), Leaf('t', 2))
assert(combine(leaflist) === List(Fork(Leaf('e', 1), Leaf('t', 2), List('e', 't'), 3)))


有一句話我漏看了:
while preserving the order (by weight).
題目的意思是combine之後的列表還要繼續是排序的,我一開始沒做,結果除錯了1個多小時

def combine(trees: List[CodeTree]): List[CodeTree] = ...

def combine(trees: List[CodeTree]): List[CodeTree] = {
    def insertSort(ins: CodeTree, list: List[CodeTree]): List[CodeTree] = {
      if (list.isEmpty) List(ins)
      else if (weight(ins) <= weight(list.head)) ins :: list
      else list.head :: insertSort(ins, list.tail)
    }

  if (trees.isEmpty || trees.tail.isEmpty) trees
  else {
    val l = trees.head
    val r = trees.tail.head
    // 我之前錯誤的是直接返回了,沒有寫insertSort函式並呼叫
    // Fork(l, r, chars(l) ::: chars(r), weight(l) + weight(r)) :: trees.tail.tail
    val fork = Fork(l, r, chars(l) ::: chars(r), weight(l) + weight(r)) //:: trees.tail.tail
    insertSort(fork, trees.tail.tail)
  }
}




Write a function until which calls the two functions defined above until this list contains only a single tree. This tree is the optimal coding tree. The function until can be used in the following way:
until(singleton, combine)(trees)
where the argument trees is of the type List[CodeTree].

這題看上去很花哨,要實現一個until函式,題目中沒有給出明確地函式簽名,只有
def until(xxx => ???, yyy => ??? )(zzz :???): List[CodeTree] = ???

其實就是寫一個函式,最後的調法就是
until(singleton, combine)(trees)
想了想,大概意思就是,不停地呼叫combine(trees),直到trees是singleton為止

def until(siglFunc: List[CodeTree] => Boolean,
          combFunc: List[CodeTree] => List[CodeTree])(trees: List[CodeTree]): List[CodeTree] = {
  if (siglFunc(trees)) trees
  else until(siglFunc, combFunc)(combFunc((trees)))
}



最後實現createCodeTree,用於建立編碼樹
def createCodeTree(chars: List[Char]): CodeTree = {
  until(singleton, combine)(makeOrderedLeafList(times(chars))).head
}



以上這些函式實現完成之後,哈夫曼編碼樹的構建函式就實現了
(僅僅是這棵樹有了,編碼,解碼暫時還沒實現)

解碼:
type Bit = Int
實現函式decode,輸入是編碼後的串bits,輸出是字串
思路大體是:
1. 如果串沒有到葉子節點,那麼繼續通過檢視串首的值是0或1確定遍歷左邊或右邊
2. 如果是葉子節點,那麼輸出一個字元,如果串還沒完的話,繼續遍歷(此時是遞迴串,而非串的tail,因為串的當前head並沒有被用到)
def decode(tree: CodeTree, bits: List[Bit]): List[Char] = ...
def decode(tree: CodeTree, bits: List[Bit]): List[Char] = {
  def help(t: CodeTree, b: List[Bit], ret: List[Char]): List[Char] = {
    t match {
      case Fork(left, right, chs, wght) => if (b.head == 0) help(left, b.tail, ret) else help(right, b.tail, ret)
      case Leaf(ch, weight) => {
        // here can't use help(tree, b.tail, ret ::: List(ch))
        if (b.isEmpty) ret ::: List(ch) else help(tree, b, ret ::: List(ch))
      }
    }
  }
  help(tree, bits, List())
}



編碼:
This section deals with the Huffman encoding of a sequence of characters into a sequence of bits.

定義一個函式encode,使得對於給定的編碼樹tree,輸入一個字串,返回一個編碼後的串

Your implementation must traverse the coding tree for each character, a task that should be done using a helper function.
你必須為每個字元遍歷整個編碼樹(效率很低,但是這是一個可以work的solution,題目要求先嚐試做一下)
def encode(tree: CodeTree)(text: List[Char]): List[Bit] = ...

def encode(tree: CodeTree)(text: List[Char]): List[Bit] = {
  def encdChar(t: CodeTree, c: Char, ret: List[Bit]): List[Bit] = {
    t match {
      // not ret::encdChar(left, c, ret ::: List(0)) ::: encdChar(right, c, ret ::: List(1))
      case Fork(left, right, chs, wght) => encdChar(left, c, ret ::: List(0)) ::: encdChar(right, c, ret ::: List(1))
      case Leaf(ch, weight)             => if (c == ch) ret else List()
    }
  }
  def encd(t: CodeTree, x: List[Char], ret: List[Bit]): List[Bit] = {
    if (x.isEmpty) ret
    else encdChar(t, x.head, ret) ::: encd(t, x.tail, ret)
  }
  encd(tree, text, List[Bit]())
}



很難看的一個函式,對於每個字元,都遍歷一遍樹的左邊和右邊,找到對應的編碼
(很容易出錯)


題目又翻花樣,說其實可以寫一個好一點的encode函式,暫時取名為quickEncode
def quickEncode(tree: CodeTree)(text: List[Char]): List[Bit] = ...

為了實現這個函式,我們首先要定義一個型別:
type CodeTable = List[(Char, List[Bit])]
類似於一個HashMap,給定一個字元,輸出一個編碼串


encoding步驟中會有一個codeBits函式,給定一個table,和字元,輸出一個編碼串
def codeBits(table: CodeTable)(char: Char): List[Bit] = ...

CodeTable 的建立是由函式convert做的,它遍歷整個編碼樹,生成這樣一個表
def convert(t: CodeTree): CodeTable = ...

而convert又是由mergeCodeTables函式實現的
def mergeCodeTables(a: CodeTable, b: CodeTable): CodeTable = ...

然後就是讓我們實現以上所提到的函式

def quickEncode(tree: CodeTree)(text: List[Char]): List[Bit] = {
  if (text.isEmpty) List()
  else codeBits(convert(tree))(text.head) ::: quickEncode(tree)(text.tail)
}


這個很簡單,codeBits查詢第一個char的編碼,並遞迴要編碼的後面的字串

def codeBits(table: CodeTable)(char: Char): List[Bit] = {
  if (table.head._1 == char) table.head._2
  else codeBits(table.tail)(char)
}



從表中查字元也很簡單,如果找到就返回,否者遞迴呼叫

def convert(tree: CodeTree): CodeTable = {
  //      def encdChar(t: CodeTree, bits: List[Bit], ret: CodeTable): CodeTable = {
  //        t match {
  //          case Fork(left, right, chs, wght) => encdChar(left, bits ::: List(0), ret) ::: encdChar(right, bits ::: List(1), ret)
  //          case Leaf(ch, weight)             => (ch, bits) :: ret
  //        }
  //      }
  //    encdChar(tree, List[Bit](), List[(Char, List[Bit])]())
  tree match {
    case Fork(left, right, chs, wght) => mergeCodeTables(convert(left), convert(right))
    case Leaf(ch, weight)             => List((ch, List[Bit]()))
  }
}


比較有意思的函式來了!
我一開始沒有用mergeCodeTables來實現convert,而是像之前一樣遍歷tree(註釋裡的程式碼),後來做完之後看到說要用mergeCodeTables
去論壇上看,也有同學提過類似的問題
但是助教回答說,要仔細思考mergeCodeTables函式的作用,並思考mergeCodeTables對於葉子節點有什麼意義

於是我開始思考mergeCodeTables的實現,因為不知道mergeCodeTables的實現,就沒法用mergeCodeTables寫出convert
思考了大約15分鐘後,發現了一個現象

其實對於任何兩個要合併的節點,都有以下規律:
    X
   / \
  X0  X1

什麼意思呢,就是說,假設M,N 兩個節點要合併成K,那麼M的編碼肯定是K編碼加上0,N的編碼是K編碼加上1

再深入地反過來想一想,假設M,N是葉子,他們現在的CodeTable都是空,當他們合併時
mergeCodeTables(M,N)
就要返回一個CodeTable,裡面是((M,0),(N,1))
再進一步,如果此時S節點要和這個新節點(M,N的合併)合併mergeCodeTables(S,((M,0),(N,1)))
那麼合併出來的節點就是((S,0),(M,10),(N,11))

規律就是,當有合併操作時,左邊老節點的bits的頭部要加0,右邊老節點bits頭部要加1

於是寫下了:
def mergeCodeTables(a: CodeTable, b: CodeTable): CodeTable = {
  // for each item in a, insert 0 in front of the item
  // for each item in b, insert 1 in front of the item
  def help(t: CodeTable, bit: Bit): CodeTable = {
    if (t.isEmpty) t
    else (t.head._1, bit :: t.head._2) :: help(t.tail, bit)
  }
  help(a, 0) ::: help(b, 1)
}



(不得不驚歎出題人的思路,妙哉妙哉!!!)