1. 程式人生 > >Reproducible Machine Learning Results By Default

Reproducible Machine Learning Results By Default

It is good practice to have reproducible outcomes in software projects. It might even be standard practice by now, I hope it is.

You can take any developer off the street and they should be able to follow your process to check out the code base from revision control and make a build of the software ready to use. Even better if you have a procedure for setting up an environment and for releasing the software to users/operational environments.

It is the tools and the process make the outcome reproducible. In this post you will learn that it is just as important to make the outcomes of your machine learning projects reproducible and that practitioners and academics in the field of machine learning struggle with this.

As a programmer and a developer you already have the tools and the process to leap ahead, if you have the discipline.

Reproducible Computational Research

Reproducible Computational Research
Photo credit ZEISS Microscopy, some rights reserved

Reproducibility of Results in Computational Sciences

Reproducibility of experiments is one of the main principles of the scientific method. You write up what you did but other scientists don’t have to take your word for it, they follow the same process and expect to get the same result.

Work in the computational sciences involves code, running on computers that reads and writes data. Experiments that report results that do not clearly specify any of these elements are very likely not easily reproducible. If the experiment cannot be reproduced, then what value is the work.

This is an open problem in computational sciences and is becoming ever more concerning as more fields rely on computational results of experiments. In this section we will review this open problem by looking a few papers that consider the issue.

Ten Simple Rules for Reproducible Computational Research

This was an article in PLoS Computational Biology in 2013 by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor and Eivind Hovig. In the paper, the authors list simple 10 rules that if followed are expected to result in more accessible (reproducible!?) computational research. The rules have been summarized below.

  • Rule 1: For Every Result, Keep Track of How It Was Produced
  • Rule 2: Avoid Manual Data Manipulation Steps
  • Rule 3: Archive the Exact Versions of All External Programs Used
  • Rule 4: Version Control All Custom Scripts
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
  • Rule 7: Always Store Raw Data behind Plots
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
  • Rule 9: Connect Textual Statements to Underlying Results
  • Rule 10: Provide Public Access to Scripts, Runs, and Results

The authors are commenting from the field of computational biology. Nevertheless, I would argue the rules do not go far enough. I find them descriptive and I would be a lot more prescriptive.

For example, with rule 2 “Avoid Manual Data Manipulation Steps”, I would argue that all data manipulation be automated. For rule 4 “Version Control All Custom Scripts”, I would argue that the entire automated process to create work product be in revision control.

If you are developer familiar with professional process, you mind should be buzzing with how useful dependency management, build systems, markup systems for documents that can execute embedded code, and continuous integration tools could really bring some rigor.

Accessible Reproducible Research

An article by Jill Mesirov published in Science magazine in 2010. In this short article the author introduces a terminology for systems that facilitate reproducible computational research by scientists, specifically:

  • Reproducible Research System (RRS): Comprised of a Reproducible Research Environment and a Reproducible Research Publisher.
  • Reproducible Research Environment (RRE): The computational tools, management of data, analyses and results and the ability to package them together for redistribution.
  • Reproducible Research Publisher (RRP): The document preparation system which links to the Reproducible Research Environment and provides the ability to embed analyses and results.

A prototype system is described that was developed for Gene Expression analysis experiments called the GenePattern-Word RRS.

Again, looking through the eyes of software development and the tools available, the RRE sounds like revision control plus a build system with dependency management plus a continuous integration server. The RRP sounds like a markup system with linking and a build process.

An invitation to reproducible computational research

This was a paper written by David Donoho in Biostatistics, 2010. This is a great paper, I really agree with the points it makes. For example:

“Computational reproducibility is not an afterthought — it is something that must be designed into a project from the beginning.”

I could not articulate it clearer myself. In the paper, the author lists the benefits for building reproducibility into computational research. For the researcher the benefits are:

  • Improved work and work habits.
  • Improved teamwork.
  • Greater impact. (Less inadvertent competition and More acknowledgement)
  • Greater continuity and cumulative impact.

The benefits the author lists for the taxpayer that funds the research are:

  • Steward ship of public goods.
  • Public access to public goods.

I made some of the same arguments to colleagues off the cuff and it is fantastic to be able to point to this paper that does a much better job of making a case.

Making scientific computations reproducible

Published in Computing in Science & Engineering, 2000 by Matthias Schwab, Martin Karrenbach and Jon Claerbout. The opening sentences of this paper are terrific:

“Commonly research involving scientific computations are reproducible in principle but not in practice. The published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself. Consequently authors are usually unable to reproduce their own work after a few months or years.”

The paper describes the standardization of computational experiments through the adoption of GNU make, standard project structure and the distribution of experimental project files on the web. These practices were standardized in the Stanford Exploration Project (SEP).

The motivating problem addressed by the adoption was the loss of programming effort when a graduate student left the group because of the inability to reproduce and build upon experiments.

The ideas of a standard project structure and build system seem so natural to a developer.

Reproducibility by Default in Machine Learning

The key point I want to make is to not disregard the excellent practices that have built up to standard in software development when starting in the field of machine learning. Use them and build upon them.

I have a blue print I use for machine learning projects and it’s improving with each project I complete. I hope to share it in the future. Watch this space. Until then, here are some tips for reusing software tools to make reproducibility a default for applied machine learning and machine learning projects in general:

  • Use a build system and have all results produced automatically by build targets. If it’s not automated, it’s not part of the project, i.e. have an idea for a graph or an analysis? automate its generation.
  • Automate all data selection, preprocessing and transformations. I even put in wget’s for accruing data files when working on machine learning competitions. I want to get up and running from scratch on new workstations and fast servers.
  • Use revision control and tag milestones.
  • Strongly consider checking in dependencies or at least linking.
  • Avoid writing code. Write thin scripts and use standard tools and use standard unix commands to chain things together. Writing heavy duty code is a last resort during analysis or a last step before operations.
  • Use a markup to create reports for analysis and presentation output products. I like to think up lots of interesting things in batch and implement them all and let my build system create them when it next runs. This allows me to evaluate and think deeply about the observations at a later time when I’m not in idea mode.

Pro Tip

Use a Continuous Integration server to run your test harness often (daily or hourly).

Continuous Integration

Continuous Integration
Photo credit regocasasnovas, some rights reserved

I have conditions in my test harness to check for the existence of output products and create them if they are missing. That means that each time I run the harness, only things that have changed or results that are missing are computed. This means I can let my imagination run wild and keep adding algorithms, data transforms and all manner of crazy ideas to the harness and some server somewhere will compute missing outputs on the next run for me to evaluate.

This disconnect I impose between idea generation and result evaluation really speeds up progress on a project.

I find a bug in my harness, I delete the results and rebuild them all again with confidence on the next cycle.

Summary

In this post you have learned that the practice of machine learning is project work with source data, code, computations with intermediate work product and output work products. There also likely all manner of things in between.

If you manage a machine learning project like a software project and reap the benefits of reproducibility by default. You will also get added benefits of speed and confidence which will result in better outcomes.

Resources

If you would like to read further on these issues, I have listed the resources used in the research of this post below.

Have you encountered the challenge of reproducible machine learning projects? Do you have idea of other tools of software development that could aid in this cause? Leave a comment and share your experiences.

相關推薦

Reproducible Machine Learning Results By Default

Tweet Share Share Google Plus It is good practice to have reproducible outcomes in software proj

Datmo: the Open Source tool for tracking and reproducible Machine Learning experiments

As data scientists frequently training models while in grad school and at work, we've faced many challenges in the model building process. This problem has

Decoding Regulations Using Machine Learning (sponsored by IBM)

IBM Analytics's Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. Follow O'Reilly on: Twitter: http://twitter.com/oreillymed

How to Improve Machine Learning Results

Tweet Share Share Google Plus Having one or two algorithms that perform reasonably well on a pro

How to Use Machine Learning Results

Tweet Share Share Google Plus Once you have found and tuned a viable model of your problem it is

Get Your Dream Job in Machine Learning by Delivering Results

Tweet Share Share Google Plus You can rise up and take on your desire to become an a machine lea

[Checked (vid only)] Cousera - Machine Learning by Andrew Ng

ati all rst which got hms sta rms aspect Just finished watching all videos of this course - thank you Andrew for elaborating all basic ML

學習筆記之Machine Learning by Andrew Ng | Coursera

Machine Learning | Coursera https://www.coursera.org/learn/machine-learning Machine learning is the science of getting computers to act without being

AWS Machine Learning by Example AWS機器學習示例 Lynda課程中文字幕

AWS Machine Learning by Example 中文字幕 AWS機器學習示例 中文字幕AWS Machine Learning by Example 藉助Amazon Web Services(AWS)深入瞭解機器學習 在這個實踐課程中,講師Jonathan Fer

[0] Andrew Ng - Machine Learning - by Stanford University

Introduction Welcome to Machine Learning! In this module, we introduce the core idea of teaching a computer to learn concepts using data—without being

H2O.ai Named "Top 3 Artificial Intelligence (AI) and Machine Learning (ML) Software Solution" by Enterprise Management Associate

H2O.ai, the open source leader in AI, has been named a "Top 3 Vendor" in Artificial Intelligence and Machine Learning by industry analyst firm Enterprise M

A powerful machine learning system used by Microsoft has been released to the world

A machine learning system that's so advanced it's been used to gain a new understanding of childhood asthma has been made available for everyone to use. Mi

Text and Rich Media Analytics Powered by Machine Learning

About 80% of big data is unstructured data - text, speech, image and video. How can we extract value from this massive and high growth asset? Micro Focus I

4 Challenges Faced by Organizations before Venturing into Machine Learning

Businesses contemplating to implement machine learning (ML) are faced with a number of challenges, ranging from the ignorance of its benefits to the inadeq

Helping computers fill in the gaps between video frames: Machine learning system efficiently recognizes activities by observing

In a paper being presented at this week's European Conference on Computer Vision, MIT researchers describe an add-on module that helps artificial intellig

Machine learning and data are fueling a new kind of car, brought to you by Intel

Here's why Intel just offered $15.3 billion for Mobileye, an Israeli company that specializes in machine vision and learning for cars. The automobile is be

Help improve lives through Machine Learning by joining the AWS DeepLens Challenge!

Today, we’re unveiling a fresh approach to the AWS DeepLens Challenge. We are bringing you four challenges to choose from–sustainability, games, h

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels for train set. Here we use drop() method in Pandas li

Begin Machine Learning By Finding The Landmarks

Tweet Share Share Google Plus Where do you begin in machine learning? Is actually breaking groun