learn Cookbook Book Review
The scikit-learn library is the premiere library for machine learning in Python.
The online documentation is quite good but sometimes can feel fragmented or limited by narrow examples.
In this post you will discover the book
Let’s get into it.
Book Overview
The subtitle for the book is:
Over 50 recipes to incorporate scikit-learn into every step of the data science pipeline, from feature extraction to model building and model evaluation.
It was published at the end of 2014 and it is just under 200 pages long. I like the form factor. Thick reference texts really put me off these days (think Numerical Recipes which sits proudly on my shelf). I would prefer to have 10 smaller focused reference texts, like a mini encyclopedia series.
I like that it is a small sharp focused text on scikit-learn recipes.
Book Audience
The book is not for the machine learning beginner. Take note.
It assumes:
- Familiarity with Python.
- Familiarity with the SciPy stack.
- Familiarity with machine learning.
These are reasonable assumptions for someone already using scikit-learn on projects, in which case the book becomes a desktop reference to consult for specific ad hoc machine learning tasks.
Book Contents
The book is comprised of 50 recipes? (maybe 57 recipes if I trust the table of contents and my own counting) separated into 5 chapters.
- Chapter 1: Premodel Workflow
- Chapter 2: Working with Linear Models
- Chapter 3: Building Models with Distance Metrics
- Chapter 4: Classifying Data with scikit-learn
- Chapter 5: Postmodel Workflow
The chapters generally map onto the workflow of a standard data science project:
- Acquire and prepare data.
- Try some linear models
- Try some nonlinear models
- Try some more non-linear models.
- Finalize the model
It is an okay structure for a book, the problem is that scikit-learn alone does not service all of these steps well. It excels at the modelling part and does a fair job of data pre-processing, but it is poor at the data loading and data analysis steps which are generally ignored.
Next we will step through each chapter in turn.
Need help with Machine Learning in Python?
Take my free 2-week email course and discover data prep, algorithms and more (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Chapter Walkthrough
In this section we take a closer look at the recipes in each of the five chapters.
Chapter 1: Premodel Workflow
This chapter focuses on data preparation. That is re-formatting the data to best expose the structure of the problem to the machine learning algorithms we may choose to use later on.
There are 17 recipes in this chapter and I would group them as follows:
- Data Loading: Loading your own data and using the built-in datasets.
- Data Cleaning: Tasks like imputing missing values.
- Data Pre-Processing: Scaling and feature engineering.
- Dimensionality Reduction: SVD, PCA and factor analysis.
- Other: Pipelines, Gaussian Processes and gradient descent.
I’m sad that I had to devise my own structure here. I’m also sad that there is an “other” category. It is indicative that the organization of the recipes in chapters could be cleaner.
I would like more and separate recipes on scaling methods. I find myself doing a lot of scaling on datasets before I can use them. It’s perhaps the most common pre-processing step required to get good results.
Chapter 2: Working with Linear Models
The focus of this chapter is linear models. This shorter chapter contains 9 recipes.
Generally, the recipes in this chapter cover:
- Linear Regression
- Regularized Regression
- Logistic Regression
- More exotic variations on regression like boosting.
This is again another strange grouping of recipes.
I guess I feel that the focus of linear models could have extended further to LDA, Perceptron and other models supported by the platform, not limited to regression.
Chapter 3: Building Models with Distance Metrics
Many algorithms do use a distance measure at their core.
The first that may come to mind is KNN, but in fact you could interpret this more broadly and pull in techniques like support vector machines and related techniques that use kernels.
This chapter focuses on techniques that use distance measures and focuses really on K-Means almost exclusively (8 of the 9 recipes in this chapter). There is one KNN recipe at the end of the chapter.
The chapter should have been called clustering or K-Means.
Also, it is good to note my bias here in that I don’t use clustering methods at all, I find them utterly useless for predictive modeling.
Chapter 4: Classifying Data with scikit-learn
From the title, this chapter is about classification algorithms.
I would organize the 11 recipes in this chapter as follows:
- Decision Trees (CART and Random Forest)
- Support Vector Machines
- Discriminant Analysis (LDA and QDA)
- Naive Bayes
- Other (semi-supervised learning, gradient descent, etc.)
I would put LDA and QDA in the linear models chapter (Chapter 2) and I would have added a ton more algorithms. A big benefit of scikit-learn is that it offers so many algorithms out of the box.
Those algorithms that are covered in this chapter is fine, what I am saying is I would double or triple the number and make recipes for algorithms the focus of the book.
Chapter 5: Postmodel Workflow
This chapter contains 11 recipes on general post modeling tasks.
This is technical not accurate as you would perform these tasks as a part of modeling, nevertheless, I see what the author was going for.
I would summarize the recipes in this chapter as follows:
- Resampling methods (Cross validation and variations).
- Algorithm Tuning (Grid search, random search, manual search, etc.).
- Feature Selection.
- Other (model persistence, model evaluation and baselines).
A good chapter covering important topics. Very important topics.
Generally, I would introduce each algorithm in the context of k-fold cross validation, because evaluating algorithms any other way might not be a good idea for most use cases.
I’m also surprised to see feature selection so late in the book. I would have expected this to have appeared in Chapter 1. It belongs up front with data preparation.
Thoughts On The Book
The book is just fine. I would recommend it for someone looking for a good desktop reference to support the online docs for scikit-learn.
I generally like the way each recipe is presented. In fact it is good to the point of verbosity, whereas in other books the recipes can be too brief. The structure is as follows:
- Recipe name and description.
- Getting ready (e.g. the preconditions or requirements).
- How to do it (actual code and steps required to achieve a result).
- How it works (additional explanation of the API or processes).
- There’s more (optional additional variations on the recipe that are useful).
Given the above soft recommendation, I did note a some things while reading.
I was frustrated with the content of many recipes. So much so that I would never use them make them cannon in my own library of scikit-learn recipes I use from project to project.
I have used scikit-learn a fair bit and I took the time to read and try most of the API. Many recipes in the book are hand-crafted functions that actually already exist in the scikit-learn API. Maybe the API has been updated since publication, or not, but this did bother me. Less code is less maintenance and if you are using a library like scikit-learn then you should use all of it, and well.
Also, generally there are a few equations sprinkled through the explanations. They are mainly there to provide a shortcut description of a technique and avoid the exposition. It’s fine, but they may as well be left out and point to a good reference text instead and keep a laser focus on the scikit-learn API.
Some recipes are too long. I light tight, focused and self-contained. Something I can copy and paste and use to jumpstart a process in my own project.
You cannot cover the whole scikit-learn API, and the coverage in this book was pretty good. It covered the key parts of the library. I would like to see it cover some aspects that differentiate the library such as Pipelines in greater detail, learning line graphs and model calibrations.
Summary
In this post you discovered the book Scikit-Learn Cookbook by Trent Hauck.
You learned that it is a book of 50+ recipes for using scikit-learn covering topics such as:
- Data preparation.
- Linear and nonlinear algorithms.
- Model evaluation and algorithm tuning.
It is a reasonable cookbook that can be used as a desktop reference to supplement the online documentation for the scikit-learn library.
Do you have any questions about the book? Have you read the book? Leave a comment and let me know what you thought of it.
Frustrated With Python Machine Learning?
Develop Your Own Models in Minutes
…with just a few lines of scikit-learn code
Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…
Finally Bring Machine Learning To
Your Own Projects
Skip the Academics. Just Results.
相關推薦
learn Cookbook Book Review
Tweet Share Share Google Plus The scikit-learn library is the premiere library for machine learn
Book Review: Machine Learning with Python Cookbook
Additional Considerations The only criticism I can place is that I wish there were more topics covered in the content. Some specific areas I would have li
Book Review: How Google Tests Software
When I found out about the book “How Google Tests Software“, it didn’t take long until I had ordered a copy. I find it quite fascinating to read abou
Book Review: Clean Code
I finally got around to reading Clean Code by Robert C. Martin (Uncle Bob). It is often high on lists of the best books for software development, and
Book Review: The Effective Engineer
Last month we finished reading “The Effective Engineer” by Edmond Lau in the book club at work. It is a great book full of practical advice on how to
Book Review: Release It!
In the book club at work, we recently finished reading Release It! by Michael T. Nygard. It is a book I have been meaning to read for a long time, but
AI SUPERPOWERS: Book Review
AI SUPERPOWERS: Book ReviewNew best-selling AI book by Kai-Fu LeeAn excellent overview of Chinese activity in artificial intelligence from ground zero pers
Clean Code by Uncle Bob book review (first chapters)
Clean Code by Uncle Bob highlights (chapters 1–3)Probably most people working as professional software developers for some time have heard of the book “Cle
Hello World, book review: Algorithms, and how to live with them
The trajectory of books about new technologies follows a similar pattern: first, hype; then, backlash; then, finally, a more considered view of what it mig
Marginally Interesting: Book review: "Start with Why" by Simon Sinek and Apple's Patent Wars
Tweet I recently read “Start with Why” by Simon Sinek which was recom
Marginally Interesting: Book Review: 'Debt: the first 5000 years' by David Graeber
Tweet Just finished reading “Debt: The first 5000 years” by David Graeb
Bootstrapping Machine Learning: Book Review
Tweet Share Share Google Plus Louis Dorard has released his book titled Bootstrapping Machine Le
Data Science From Scratch: Book Review
Tweet Share Share Google Plus Programmers learn by implementing techniques from scratch. It is a
cisco learn book index
------------------------------------------------------------------ Routing TCP/IP Volume 1 , Second Edition ----------------------------------------------
線性代數回顧(Linear Algebra Review)
view ont 線性代數 lin geb ebr review 3.1 代數 3.1 矩陣和向量 3.2 加法和標量乘法 3.3 矩陣向量乘法 3.4 矩陣乘法 3.5 矩陣乘法的性質 3.6 逆、轉置 3.1 矩陣和向量 線性代數回顧(
SQL Cookbook:操作多個表
規則 原因 解決 pre nio union all logs 使用 所有 1、記錄集的疊加 使用union all union all包含重復的結果,union篩選掉重復項(可能需要排序) 1 select * from film where film_id <
窩上課不聽,how to learn C language easily(1)
程序 簡單 小數 如果 如何 好處 class 數組 指針 C language 學習心得 附:為啥起這麽霸氣側漏,招大神們鄙視的標題,正如我在《C language》隨筆的介紹中寫的,這是一個寫個妹紙們看的C language的文章。沒錯!!寫這篇文章的靈感也來自於上周C
SQL Cookbook:使用字符串
1-1 log sql 數據 cookbook eight 笛卡爾 卡爾 where子句 1、遍歷字符串 SQL中不提供叠代操作,所以要連接一張用來作為遍歷指針的表,來實現這個過程 1 select substr(e.ename, iter.pos, 1) as C 2
項目管理系列--好用的代碼評審(Code Review)工具
mos users solution con codes flex reat hat test 1. Gerrit Gerrit is a web based code review system, facilitating online code reviews for
The Languages and Frameworks You Should Learn in 2017
pan end req targe lov dev rapi automatic min Martin Angelov December 8th, 2016 The software development industry continues its relent