1. 程式人生 > >Data Science From Scratch: Book Review

Data Science From Scratch: Book Review

Programmers learn by implementing techniques from scratch.

It is a type of learning that is perhaps slower than other types of learning, but fuller in that all of the micro decisions involved become intimate. The implementation is owned from head to tail.

I recently finished reading the paperback version and I think it might be one of my favorite beginner machine learning books for the year. Grab a copy!

Amazon Image

Overview of the Book

Let’s take a birds eye look at this book.

The Author: Joel Grus

The author of this book is Joel Grus, a software engineer at Google.

In previous roles he’s been a Data Scientist and Analyst at startups and engineer at Google. He got his PhD from Caltech. A very fine background.

Learn more about Joel on his LinkedIn profile and blog and Twitter.

The Target Audience: Beginners

The target audience for the book is intermediate programmers interested in getting started in data science and machine learning.

Python is not a prerequisite to read this book (there is a Python crash course in Chapter 2), but it would speed things up if you were already a Python programmer.

The book does not assume any mathematical background in machine learning (there is a crash course in Chapters 4-7), but again, some background in stats, probability and algebra would speed things along.

The Book Approach: Write Code (in Python)

This is an introductory book to data science and machine learning.

The majority of the text focuses on the implementation of machine learning algorithms. There is a brief introduction to Python and the coverage of some basic math, data visualization and data gathering subjects.

It will take you from an a beginner programmer to being able to implement machine learning algorithms to address various data science problems.

Code From Scratch

The approach taken in the book is to describe the concepts and then to implement them in Python from scratch. This means without the use of machine learning and data handling libraries (e.g. scikit-learn).

The stated goal by the author of implementing algorithms from scratch is:

…building tools and implemented algorithms by hand in order to better understand them.

Good code examples must be readable first and efficient and effective second. They are written for understandability as a teaching aid, not production level code. Take note that the programming you will be doing from scratch will be instructional only, not operationalizable.

I put a lot of thought into creating implementations and examples that are clear, well commented, and readable. In most case, the tools we build will be illuminating but impractical.

Book Contents

The book is 311 pages long and contains 25 chapters. It’s a classic O’Reilly book and is the perfect form factor to have open in front of you while you bash away at the keyboard implementing the code examples.

In this section we take a look at the table of contents:

a data scientist is someone who extracts insights from messy data

  • Chapter 1: Introduction (What is data science?)
  • Chapter 2: A Crash Course in Python (syntax, data structures, control flow, and other features)
  • Chapter 3: Visualizing Data (bar, line and scatter plots with matplotlib)
  • Chapter 4: Linear Algebra (vectors and matricies)
  • Chapter 5: Statistics (central tendency and correlations)
  • Chapter 6: Probability (Bayes’ Theorem, Random Variables, Normality)
  • Chapter 7: Hypothesis and Inference (confidence intervals, P values, Bayesian inference)
  • Chapter 8: Gradient Descent (gradients, steps, stochastic variation)
  • Chapter 9: Getting Data (scraping HTML, JSON APIs)
  • Chapter 10: Working with Data (basic viz, data transforms)

machine learning… [refers] to creating and using models that are learned from data […] this might be called predictive modeling or data mining

  • Chapter 11: Machine Learning (fitting, bias-variance, feature selection)
  • Chapter 12: k-Nearest Neighbors (also curse of dimensionality)
  • Chapter 13: Naive Bayes
  • Chapter 14: Simple Linear Regression (also gradient descent)
  • Chapter 15: Multiple Regression (also bootstrap, regularization)
  • Chapter 16: Logistic Regression (also SVM)
  • Chapter 17: Decision Trees (also random forest)
  • Chapter 18: Neural Networks (perceptron and back-prop)
  • Chapter 19: Clustering (k-Means)

Natural language process (NLP) refers to computational techniques involving language.

  • Chapter 20: Natural Language Processing (n-gram, grammars, Gibbs sampling)
  • Chapter 21: Network Analysis (Centrality and PageRank)
  • Chapter 22: Recommender Systems (user- and item-based)
  • Chapter 23: Databases and SQL (basic usage)
  • Chapter 24: MapReduce (various worked examples)
  • Chapter 25: Go Forth and Do Data Science (libs you should use)

Implementing things “from scratch” is great for understanding how they work. But it’s generally not great for performance …, ease of use, rapid prototyping, or error handling. In practice, you’ll want to use well-designed libraries that solidly implement the fundamentals.

Opinions of the Book

I generally liked the table of contents, except I would make some changes.

I would drop some of the later chapters like NLP, Network Analysis, and so on (Chapters 20-24) and rename the book “Machine Learning Algorithms from Scratch“. It would be a less sexy but a more honest and accurate title.

Data Science is about formulating the questions then gathering the data and building the models to answer them. We don’t really need a data science from scratch book unless it was a bunch of business case studies plus the modeling. From scratch in data science really means the algorithms part.

I’m not upset, in fact I had a great time reading this book, but I could imagine someone expecting systematic processes for formulating and working through business-data problems in addition the modeling feeling a little bit misleading.

I did not implement all of the algorithms from scratch. I read the whole book, studied all of the examples, but I only implemented a few for fun.

I found the code easy to read, commented just enough. I think going vanilla Python (over NumPy) was a good move. It lowered the bar just enough so that all you need is some basic Python syntax and away you go.

Resources

I’ve gathered up some additional resources related to the book if you’re interested in diving deeper.

Final Thoughts

I like the book. I had fun, I think primarily because I have always liked working through programming books and because I’ve written a book just like this myself (i.e. Clever Algorithms).

If you’ve been around the block and you’re hard core into scikit-learn or R right now and not interested in the distraction, this book is probably not for you. But remember, the learning never ends and it can be fun to go over the beginner stuff again and tighten up the screws.

If you know some Python (or you’re a solid dev and want to get into Python) and you want to get intimate with machine learning algorithms by implementing them, then this book is for you.

Did you read Data Science From Scratch? What did you think? Leave a comment.

相關推薦

Data Science From Scratch: Book Review

Tweet Share Share Google Plus Programmers learn by implementing techniques from scratch. It is a

AI, Machine Learning and Data Science Announcements from Microsoft Ignite

Microsoft Ignite, Microsoft's annual developer conference, wrapped up last week and many of the big announcements focused on artificial intelligence and ma

Learnings from a Data Science Conference, Open Data Science Europe

Learnings from a Data Science Conference, Open Data Science EuropeLast week I attended Open Data Science Europe hosted at the Novotel, London West. This is

How to Scale Machine Learning Data From Scratch With Python

Tweet Share Share Google Plus Many machine learning algorithms expect data to be scaled consiste

Data Science Screencasts: A Data Origami Review

Tweet Share Share Google Plus Data Origami is a new website by Cameron Davidson-Pilon that provi

Bioconductor(Bioconductor for Genomic Data Science教程)

mic arc nbsp nba for hub 教程 enc 文件 Bioconductor for Genomic Data Science ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteri

在博客園使用LaTex編輯論文級別data science文章

博客園 Go 效果 公式 過程 第一個 基本 CI 一行 第一個例子我們看看在行文過程中,我們需要一段公式: $p={12\over q}$ ,隨後我們觀察效果。再來另外一個使用\ (來做分界符的行內\(p={12\over q}\)latex公式 在下面的例子,我們有一大

計算機視覺學習記錄 - Implementing a Neural Network from Scratch - An Introduction

dict 實踐 {} ann gen lua tps rst 損失函數 0 - 學習目標   我們將實現一個簡單的3層神經網絡,我們不會仔細推到所需要的數學公式,但我們會給出我們這樣做的直觀解釋。註意,此次代碼並不能達到非常好的效果,可以自己進一步調整或者完成課後練習來進行

Python data science two pandas basic

from pandas import Series import pandas as pd s=Series([1,2,'ww','tt']) s #series可以自定義索引 s2=Series(['wangxing','man',24],index=['name','sex','

Python data science thd numpy basic

Numpy最重要的一個特 (ndarray)點是其N維陣列物件,該物件是一個快速而靈活地大資料集容器 建立ndarray建立陣列最簡單的方法就是使用array函式,它接收一切陣列性的物件,然後產生一個新的含有傳入陣列的NumPy物件 data=[2,3,4] arr1=np.arra

Python data science one

在常見的資料探勘中,dirty data的內容: 缺失值,異常值,不一致的值,重複的資料以及含有特殊符號(如#,*,等) 異常值往往十分的具有價值,重視異常值的出現,分析其產生的原因,常常成為發現問題而進而改進決策的契機 異常值分析:1st進行簡單的統計量分析,最常用的是最大值,最小值,

kaggle 2018 data science bowl 細胞核分割學習筆記

一、 獲獎者解決方案 1. 第一名解決方案(Unet 0.631) 主要的貢獻 targets: 預測touching borders,將問題作為instance分割 loss function:組合交叉熵跟soft dice loss,避免pixel imbalance問題

Data Science in Python

Comprehensive learning path – Data Science in Python Journey from a Python noob to a Kaggler on Python So, you want to become a d

ANZ Chengdu Data Science Competition——BASELINE 澳新銀行存款大資料建模預測

# -*- coding: utf-8 -*- """ Created on Fri Nov 9 09:58:21 2018 @author: Lenovo """ import lightgbm as lgb import pandas as pd from sklearn.model_

Flutter on Raspberry Pi (mostly) from scratch

Flutter on Raspberry Pi (mostly) from scratch https://medium.com/flutter-io/flutter-on-raspberry-pi-mostly-from-scratch-2824c5e7dcb1 This doc

Data Science Competition中的工具彙總

除了基礎的pandas,scikit-learn,numpy,matplotlib,seaborn以外 ( 1 ) category_encoders github 屬於scikit-learn compatible projects之一,下面是Binary Encoding和One-hot Encodi

七個用於資料科學(data science)的命令列工具

資料科學是OSEMN(和 awesome 相同發音),它包括獲取(Obtaining)、整理(Scrubbing)、探索(Exploring)、建模(Modeling)和翻譯(iNterpreting)資料。作為一名資料科學家,我用命令列的時間非常長,尤其是要獲取、

Lesser Known Python Libraries for Data Science

WgetExtracting data especially from the web is one of the vital tasks of a data scientist. Wget is a free utility for non-interactive download of files fro

Book Review: How Google Tests Software

When I found out about the book “How Google Tests Software“, it didn’t take long until I had ordered a copy. I find it quite fascinating to read abou

Book Review: Clean Code

I finally got around to reading Clean Code by Robert C. Martin (Uncle Bob). It is often high on lists of the best books for software development, and