History of pruning algorithm development and python implemention(Continuous Updating)
name of tree | inventer | name of article | year |
---|---|---|---|
ID3 | Ross Quinlan | 《Discovering rules by induction from large collections of examples》 | 1979 |
ID3 | Ross Quinlan | Another origin:《Learning efficient classification procedures and their application to chess end games》 | 1983a |
CART | Leo Breiman | 《Classification and Regression Trees》 | 1984 |
C4.5 | Ross Quinlan | 《C4. 5: Programming for machine learning》 | 1993 |
C5.0 | Ross Quinlan | Commercial Edition of C4.5 ,no relevant papers | - |
name of post-pruning algorithm | name of article or book | year | inventer | the tree pruned | Remark |
---|---|---|---|---|---|
Pessimistic Pruning | 《Simplifying Decision Trees》part2.3 | 1986b(也有1987b的說法,這裡以論文上寫的時間為準) | Quinlan | C4.5 | Ross Quinlan invented “Pessimistic Pruning”,John Mingers rename it as “Pessimistic Error Pruning” in his article《An Empirical Comparison of Pruning Methods for Decision Tree induction》 |
Reduced Error Pruning | 《Simplifying Decision Trees》part2.2 | 1986b | Quinlan | C4.5 | 需要額外的驗證集才能剪枝 |
Cost-Complexity Pruning | 《Classification and Regression Trees》3.3節 | 1984 | L Breiman | CART | 針對分類樹剪枝 |
Error-Complexity Pruning | 《Classification and Regression Trees》8.5.1節 | 1984 | L Breiman | CART | 針對迴歸樹剪枝,ECP是在CCP的基礎上發展而來 |
Critical Value Pruning | 《Expert System-Rule Induction with Statistical Data》,還有一說是:《An Empirical Comparison of Pruning Methods for Decision Tree Induction》但是該文作者自己說是引用自1987年的論文 | 1987a | John Mingers | 論文中沒有明說哪一種 | 關於CVP演算法的出處眾說紛紜,這裡的出處是以《An Empirical Comparison of Pruning Methods for Ensemble Classifiers》P212提到的為準 |
Minimum-Error Pruning | 《Learning decision rules in noisy domains》 | 1986 | Niblett and Bratko | - | Can Not be Downloaded from Internet |
Error-Based Pruning | 《C4.5: Programs for Machine Learning 》4.2節 | 1993 | Quinlan | C4.5 | EBP is an evolution of PEP |
分類樹剪枝目的:
1.犧牲預測精度在可以接受的情況下,簡化決策樹(以便於提取知識);
2.提高驗證集精度(REP)
迴歸樹剪枝目的:減緩、消除過擬合
剪枝程式碼彙總:
------------REP-finished--------
REP原理與具體例項:
https://blog.csdn.net/appleyuchi/article/details/83041047
REP剪枝程式碼實現:
https://github.com/appleyuchi/Decision_Tree_Prune/tree/master/ID3-REP-post_prune-Python-draw
----------------PEP-finished-----------------
PEP剪枝演算法發展歷史、原理和舉例:
https://blog.csdn.net/appleyuchi/article/details/83795521
https://blog.csdn.net/appleyuchi/article/details/83902998
PEP-python-implemention:
https://blog.csdn.net/appleyuchi/article/details/83961060
-------------EBP-finished---------------------
EBP剪枝完整演算法原理、C語言實現與具體例項:
https://blog.csdn.net/appleyuchi/article/details/83863469
EBP剪枝演算法的python實現(其實是基於quinlan的EBP剪枝的python介面):
https://github.com/appleyuchi/Decision_Tree_Prune/tree/master/Quinlan-C4.5-Release8_and_python_interface_for_EBP/
這裡有人會質疑為何不直接採用weka中的J48的python介面?
注意,weka是以quinlan的C語言版本程式碼為準的,在某些資料集中,例如使用hypo資料集,weka的效果是非常糟糕的。
因為決策樹的目的是幫助分類,生成知識,
十分龐大的決策樹顯然是不利於使用的。
----------------------------------
Do we need test sets when pruning?
Attention,here test sets are actually “validation datasets”!
Pruning Algorithm | Need extra test datasets? | Pruning Style |
---|---|---|
REP(Reduced Error Pruning) | yes | bottom-up |
CCP(Cost Complexity Pruning) | ||
ECP(Error Complexity Pruning) | ||
CVP(Critical Value Pruning) | ||
MEP(Minimum Error Pruning) | no | bottom-up |
PEP(Pessimistirc Error Pruning) | no | up-bottom |
EBP(Error Based Pruning) | no | bottom-up |
markdown tables generation table
https://tool.lu/tables