1. 程式人生 > 實用技巧 >機器碼是會變得嘛_資料機器人使生活變得輕鬆

機器碼是會變得嘛_資料機器人使生活變得輕鬆

機器碼是會變得嘛

I’m diverging from the previous articles in the series. I’m going to review two tools that are heads and shoulders above the others. The design and beautiful visualizations do not come cheap. That doesn’t mean we can’t admire them and use them as a bar to which we strive. I will start with DataRobot. It’s an enterprise tool that you may find yourself having access to through work or school.

我與本系列的前幾篇文章有所不同。 我將回顧兩個首屈一指的工具。 設計和精美的視覺化並不便宜。 這並不意味著我們不能佩服它們並將它們用作我們努力的標準。 我將從DataRobot開始。 這是一種企業工具,您可能會發現自己可以通過工作或上學訪問。

為什麼選擇DataRobot? (Why DataRobot?)

I have experience using this tool and love it for the business cases for which I use it. My business case is to have a straightforward interface for a non-data scientist to run and deploy models in an automated way. DataRobot adds new features on a regular cadence, each built nicely within the existing user experience. I could go on about the benefits, but I will control my inner fan-girl.

我有使用此工具的經驗,並且喜歡使用它的業務案例。 我的業務案例是為非資料科學家提供一個直接的介面,以自動化方式執行和部署模型。 DataRobot定期新增新功能,並且在現有使用者體驗中很好地構建了每個功能。 我可以繼續講講好處,但我會控制我內在的迷迷女孩。

To keep things even with the other tools, I will focus on the most basic tasks to run a simple .csv file with autoML without any manual interventions or hyper-parameter tuning.

為了使其他工具保持工作狀態,我將專注於最基本的任務,以通過autoML執行簡單的.csv檔案,而無需任何人工干預或超引數調整。

設定和費用 (The setup and cost)

Straight up, DataRobot is outside of the budget range of the individual data scientist. The implementation and cost are definitely in the realm of businesses. AWS Marketplace offers a one-year subscription for $98,000. Pocket change, I’m sure. But if you use AWS govCloud, it is $9.33/hr (it varies). Interesting.

直截了當,DataRobot超出了單個數據科學家的預算範圍。 實施和成本絕對在企業領域。 AWS Marketplace提供98,000美元的一年期訂購。 我敢肯定,零錢。 但是,如果您使用AWS govCloud,則每小時$ 9.33 (不同)。 有趣。

資料 (The Data)

To keep parity across the tools in this series, I will stick to the Kaggle training file. Contradictory, My Dear Watson. Detecting contradiction and entailment in the multilingual text using TPUs. In this Getting Started Competition, we’re classifying pairs of sentences (consisting of a premise and a hypothesis) into three categories — entailment, contradiction, or neutral.

為了使本系列中的工具保持一致,我將堅持使用Kaggle培訓檔案。 矛盾的,親愛的沃森。 使用TPU檢測多語言文字中的矛盾和牽連 。 在本入門競賽中,我們將成對的句子(由前提和假設組成)分為三類-蘊涵,矛盾或中立。

6 Columns x 13k+ rows — Stanford NLP documentation

6列x 13k +行— Stanford NLP 文件

  • id

    ID
  • premise

    前提
  • hypothesis

    假設
  • lang_abv

    lang_abv
  • language

    語言
  • label

    標籤

載入資料 (Loading the data)

You create a project by uploading a dataset. This interface is where you begin.

您可以通過上傳資料集來建立專案。 該介面是您開始的地方。

Image for post
screenshot by the author
作者的螢幕截圖

After the data is loaded, there are opportunities to change datatypes or remove features. There are some data distribution data. A bonus is that there are warnings if there might be data leakage. If data leakage is detected, DataRobot removes that feature from the final training dataset.

載入資料後,就有機會更改資料型別或刪除功能。 有一些資料分發資料。 一個額外的好處是,如果有資料洩漏,則會發出警告。 如果檢測到資料洩漏,DataRobot將從最終訓練資料集中刪除該功能。

Image for post
project screenshot by the author
作者的專案截圖
Image for post
screenshot by the author
作者的螢幕截圖

訓練模型 (Training your model)

Once you choose your target, you hit the big Start button with Modeling Mode set to AutoPilot. When you do that, you will see progress on the right side. As models are trained, they become available on the leaderboard as they complete.

選擇目標後,您將“建模模式”設定為“自動駕駛”時點選了“開始”按鈕。 完成此操作後,您將在右側看到進度。 訓練模型後,完成後即可在排行榜上使用它們。

One good thing about having access to the early model results is that you can review for significant issues. Many times some data issues become glaringly apparent with the Insights, and I could halt the process and try again. This quick and easy review helps with rapid iteration.

獲得早期模型結果的一件好事是,您可以檢視重大問題。 很多情況下,一些資料問題在“見解”中變得非常明顯,我可以暫停該過程,然後重試。 快速簡便的審查有助於快速迭代。

評估培訓結果 (Evaluate Training Results)

The leaderboard begins to fill with the completed models. You can choose several valid metrics in the dropdown. There are also some helpful tags to let you know WHY the leaders are up at the top.

排行榜開始填充完成的模型。 您可以在下拉選單中選擇幾個有效指標。 還有一些有用的標籤,可讓您知道領導者為何居於首位。

Image for post
leaderboard screenshot by the author
作者的排行榜螢幕截圖

You can compare the models against each other.

您可以相互比較模型。

Image for post
learning curve screenshot by the author
作者的學習曲線截圖

One tab I use often is speed versus accuracy. There are times when you are scoring millions of records when speed trumps accuracy if the accuracy drop is minor.

我經常使用的一個選項卡是速度與準確性。 有時,如果精度下降幅度較小,那麼速度會比精度高得多,因此您需要為數百萬條記錄評分。

Image for post
speed versus accuracy screenshot by the author
作者的速度與準確性截圖
Image for post
head to head model comparisons screenshot by the author
作者的頭對頭模型比較螢幕截圖

The Insights tab is handy. You can quickly see if one of your features is popping. It’s up to your business expertise to know if that’s appropriate or not. This tab is where I find data issues early in the autoML model training. If I see something that doesn’t seem correct, I can iterate faster than waiting for the entire process to finish.

資料分析標籤非常方便。 您可以快速檢視您的功能之一是否正在彈出。 由您的業務專家決定是否合適。 在自動ML模型培訓的早期,我可以在此選項卡中找到資料問題。 如果我發現似乎不正確的內容,則可以比等待整個過程更快地進行迭代。

Image for post
insights screenshot by the author
作者的見解截圖

DataRobot model explainability is the best of the tools I have reviewed so far. Each prediction is assigned which features influenced the final score, indicating not only strength but also direction.

到目前為止,DataRobot模型的可解釋性是我評測過的最好的工具。 每個預測都分配了影響最終得分的特徵,這些預測不僅指示強度,還指示方向。

Image for post
prediction explanation screenshot by the author
作者的預測解釋截圖

Not to be underestimated, documentation can be a real drain on your time. For this simple dataset, DataRobot generates a 7000+ word document with all of the charts, model parameters, and challenger model details. This documentation is a unique feature that I haven’t found in any other tools, though I have asked for it when asked. All done with a single click.

別小看,文件可能會真正浪費您的時間。 對於這個簡單的資料集,DataRobot會生成一個7000多個word文件,其中包含所有圖表,模型引數和挑戰者模型詳細資訊。 該文件是我在其他任何工具中都找不到的獨特功能,儘管我在詢問時會要求它。 一鍵完成所有操作。

Image for post
compliance reporting screenshot by the author
作者的合規報告螢幕截圖
Image for post
compliance documentation screenshot by the author
作者的合規性文件截圖

結論 (Conclusions)

To loosely compare results between tools, I reran the dataset in classification mode. The metrics are just slightly higher than Azure. For the most part, the model results are similar.

為了比較工具之間的結果,我以分類模式重新運行了資料集。 指標僅略高於Azure。 在大多數情況下,模型結果相似。

For my business case, this is the top of the pile so far. Head-to-head in image processing or time-series may provide different results. That would be a challenge for another series.

對於我的業務案例,這是到目前為止的頭等大事。 影象處理或時間序列中的對立可能會提供不同的結果。 這將是另一個系列的挑戰。

The ease of use, visualizations, access to challenger model details, model explainability, and the automated documentation stand out from the others. Of course, you are paying dearly for this.

易用性,視覺化,訪問挑戰者模型的詳細資訊,模型的可解釋性以及自動化的文件與眾不同。 當然,您為此付出了高昂的代價。

Next, I will show you H2O.ai Driverless AI. In my opinion, they are the closest comparison to DataRobot at this time. They have gone to great lengths to get top data visualization designers on the project so I’m expecting great things.

接下來,我將向您展示H2O.ai無人駕駛AI。 我認為,它們是目前與DataRobot的最接近的比較。 他們竭盡全力以吸引該專案的頂級資料視覺化設計師,所以我期望一切順利。

If you missed one of the articles in the series, I have posted them below.

如果您錯過了該系列的文章之一,我將它們張貼在下面。

翻譯自: https://towardsdatascience.com/datarobot-makes-life-easy-8505637241e5

機器碼是會變得嘛