1. 程式人生 > 實用技巧 >資料模型最佳實踐_資料科學家應瞭解軟體工程最佳實踐

資料模型最佳實踐_資料科學家應瞭解軟體工程最佳實踐

資料模型最佳實踐

意見 (Opinion)

介紹 (Introduction)

I have been eagerly researching, speaking to friends and testing some new ideas that will contribute to making me a more indispensable Data Scientist — Of course, there is no way I am going to attempt to progress in my career without sharing with the people who have helped me to progress so far as things stand (in-case you don’t know, that’s you guys and gals!)

我一直在熱切地研究,與朋友交談並測試一些新想法,這些想法將使我成為更加不可或缺的資料科學家。當然,在沒有與那些有經驗的人分享的情況下,我將無法嘗試自己的職業發展就事情發展而言,已經幫助我取得了進展(以防萬一,這就是你們,加爾斯!)

Following a recent poll I carried out on my LinkedIn profile, I was surprised to see the number of people that thought Data Scientist must know Programming standards and follow engineering best practices.

在最近對我的LinkedIn個人資料進行的一項民意調查之後,我驚訝地看到看到認為資料科學家必須瞭解程式設計標準並遵循工程最佳實踐的人數。

Image for post
Figure 1: Poll Results
圖1:投票結果

Statisticians are often disappointed by the lack of fundamental statistics knowledge that many Data Scientist (including myself) possess. Mathematicians believe that before application there must be a solid understanding of the principles applied of which in various scenarios, I admittedly do not. Software Engineers expect Data Scientist to carry out their experiments whilst following basic programming principles.

對於許多資料科學家(包括我自己)所擁有的基本統計知識的缺乏,統計學家通常會感到失望。 數學家認為,應用之前必須對所應用的原理有深刻的瞭解,但我承認在各種情況下都沒有。 軟體工程師希望資料科學家在遵循基本程式設計原則的同時進行實驗。

What stung me the most is that every “yes” voter is currently working as a Data Scientist and many of them in leading roles (at the time of the poll) — comprising of the likes of 4x Kaggle Grandmaster Abhishek Thakur. Ok, I admit, the role you want determines how deep an understanding of Statistics and other Math concepts such as Probability, Linear Algebra and Calculus is required — although the basics are absolutely essential — but Software engineering practices?

最讓我吃驚的是,每個“是”的選民目前都在擔任資料科學家,其中許多人擔任領導角色(在民意調查時),其中包括4x Kaggle宗師Abhishek Thakur。 好的,我承認,您想要的角色確定需要對統計和其他數學概念(如概率,線性代數和微積分)有多深入的理解(儘管基礎知識絕對必要),但需要軟體工程實踐嗎?

I was once among the Data Scientists who believe we are solely Data Scientists, not Software engineers, hence our responsibility is to extract valuable insights from data and that is still a fact, however this poll disrupted my mental model and threw me into a deep trail of thought…

我曾經是資料科學家中的一員,他們認為我們只是資料科學家,而不是軟體工程師,因此我們的責任是從資料中提取有價值的見解,這仍然是事實,但是這項民意調查打亂了我的思維模式,使我陷入了深深的困境。的思想

Why must I know the Fundamentals of Software Engineering when Job title is Data Scientist?

當職位為資料科學家時,為什麼我必須瞭解軟體工程基礎知識?

I remembered the goal — To become an Indispensable Data Scientist. Am I saying that if I don’t know/learn the fundamentals of Software Engineering I am not indispensable? Mmm, Yeah. basically — Note this statement makes an assumptions however, such as you are a Data Scientist writing code that will most likely make it to production.

我記得這個目標-成為一名不可或缺的資料科學家。 我是說如果我不瞭解/學習軟體工程的基礎知識,那我不是必不可少的嗎? 嗯是的 基本上 -請注意,該宣告是一個假設,例如,您是一位資料科學家,正在編寫程式碼,很有可能將其投入生產。

On that note, I’ve curated a list of things that are fundamental principles of software engineering that should apply to Data Scientist. Not having a Software Engineering background, I consulted many friends that are Software Engineers to assist me make the list as well as teach me how to write better production code.

關於這一點,我整理了一系列適用於資料科學家的軟體工程基本原理。 由於沒有軟體工程背景,我諮詢了許多軟體工程師朋友來幫助我列出清單,並教我如何編寫更好的生產程式碼。

Here are some of the best practices Data Scientist should know:

以下是資料科學家應瞭解的一些最佳做法:

清潔程式碼 (Clean Code)

Image for post
Photo by CDC on Unsplash
CDCUnsplash

Note: I want to start of by apologizing to R users as I have not done much research into coding in R hence many of the clean code tips will be mainly Python users.

注意 :我想向R使用者道歉,因為我還沒有對R編碼進行過多研究,因此許多幹淨的程式碼提示主要是Python使用者。

The first programming language I learnt was Python because I am a fluent English speaker and to me Python significantly resembled the English language. Technically, this refers to the high readability of the Python programming language, which was a deliberate implementation by the designers of Python, following the realization that code is read more often than it is written.

我學習的第一門程式語言是Python,因為我會說流利的英語,而對我而言,Python非常類似於英語。 從技術上講,這是指Python程式語言的高度可讀性,這是Python的設計人員在意識到程式碼讀取的頻率比編寫的頻率更高的情況下有意實現的。

When a veteran Python developer (a Pythonista) calls portions of code not “Pythonic”, they usually mean that these lines of code do not follow the common guidelines and fail to express its intent in what is considered the best (hear: most readable) way. — The Hitchhikers Guide to Python

當經驗豐富的Python開發人員(Pythonista)呼叫部分程式碼而不是“ Pythonic”時,它們通常意味著這些程式碼行不遵循通用準則,並且無法表達其被認為是最佳的意圖(聽覺:最易讀)方式。 — 《 Python旅行者指南》

I am going to list a few factors that constitute clean code, but I do not plan to go into too much detail since I believe there are many great resources out there that cover these topics better than I ever could such as PEP8 and Clean Code In Python:

我將列出構成乾淨程式碼的一些因素,但是我不打算贅述太多,因為我相信有很多很棒的資源可以比我以前更好地涵蓋這些主題,例如PEP8Clean Code In。 Python

  • Meaningful and Pronounceable naming conventions

    有意義且可發音的命名約定
  • Clarity beats consistency

    清晰度勝過一致性
  • Searchable Names

    可搜尋名稱
  • Make your Code Easy to Read!

    使您的程式碼易於閱讀!

Remember, not only will others read your code, but you will too and if you can’t remember what something means then imagine what hope someone else will have.

請記住,不僅別人會閱讀您的程式碼,而且您也會閱讀,如果您不記得某事意味著什麼,那麼請想象一下別人會有什麼希望。

模組化 (Modularity)

Image for post
Photo by Volodymyr Hryshchenko on Unsplash
Volodymyr HryshchenkoUnsplash拍攝的照片

This one can be partially blamed on the way we learn Data Science. I would be surprised if a Data Scientist could not spin up a Jupyter Notebook and begin doing some explorations. But that is all Jupyter notebooks is for, EXPERIMENTS! Unfortunately, many of the courses out there on learning Data Science do not do a good job of transporting us from a Jupyter Notebook to scripts — which are much more effective for Production environments.

這可以部分歸因於我們學習資料科學的方式。 如果資料科學家無法啟動Jupyter筆記本並開始進行一些探索,我會感到驚訝。 但這就是Jupyter筆記本的全部用途, 實驗! 不幸的是,許多有關學習資料科學的課程不能很好地將我們從Jupyter Notebook遷移到指令碼,這對於生產環境更有效。

When we talk of Modular code we mean code that is separated into independent modules. Executed effectively, modularization allows makes packaging, testing and maintainable code that may be reused.

當我們談論模組化程式碼時,是指被分成獨立模組的程式碼。 有效地執行模組化可以使打包,測試和可維護的程式碼可以重複使用。

In the video linked below, Abhishek Thakur builds a Machine Learning package for a Kaggle competition and was my first exposure to modularity. In the past, I’ve also heard Abhishek mention that the way he learn more about modularity and software engineering best practices as a whole was by reading through the Scikit Learn code on Github.

在下面連結的視訊中,Abhishek Thakur為Kaggle競賽構建了一個機器學習包 ,這是我第一次接觸模組化。 過去,我還聽過Abhishek提到過,他通過閱讀Github上的Scikit Learn程式碼,瞭解了整個模組化和軟體工程最佳實踐的更多資訊。

演示地址

Some other things that contribute to writing good modularized code are:

有助於編寫良好的模組化程式碼的其他一些事情是:

  • Don’t Repeat Yourself (DRY) — Don’t repeat yourself (DRY, or sometimes do not repeat yourself) is a principle of software development aimed at reducing repetition of software patterns, replacing it with abstractions or using data normalization to avoid redundancy. (Source: Wikipedia)

    d on't [R EPEATŸ我們自己(DRY) -不重複自己(DRY,或有時不重複自己)是一個旨在降低軟體模式的重複,用抽象取代它,或者使用資料標準化軟體開發的原則避免冗餘。 (來源: 維基百科 )

  • Single Responsibility Principle (SRP) — The single-responsibility principle (SRP) is a computer-programming principle that states that every module, class or function in a computer program should have responsibility over a single part of that program’s functionality, which it should encapsulate. (Source: Wikipedia)

    小號英格爾- [R esponsibility P rinciple(SRP) -單責任原則(SRP)是一個計算機程式設計原理,指出在計算機程式中的每個模組,類或函式應該在該程式的功能,單個部分責任,它應該封裝。 (來源: 維基百科 )

  • Open-Closed Principle — In object-oriented programming, the open–closed principle states “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification”; that is, such an entity can allow its behaviour to be extended without modifying its source code. (Source: Wikipedia)

    開放-封閉原則—在面向物件的程式設計中,開放-封閉原則指出“軟體實體(類,模組,功能等)應開放以進行擴充套件,而封閉以進行修改”; 也就是說,這樣的實體可以允許其行為得以擴充套件而無需修改其原始碼。 (來源: 維基百科 )

重構 (Refactoring)

Code refactoring may be defined as the process of restructuring existing code without changing the external behaviour of the code at runtime.

程式碼重構可以定義為在不更改程式碼在執行時的外部行為的情況下重構現有程式碼的過程。

Image for post
Photo by Kilian Seiler on Unsplash
Kilian SeilerUnsplash拍攝的照片

Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality. — Wikipedia

重構旨在改善軟體(其非功能屬性)的設計,結構和/或實現,同時保留其功能。 — 維基百科

There are many advantages to refactoring our code, for example, improved readability of our code and reduced complexity, which in-turn leads to our source code being much easier to maintain and we are equipped with an internal architecture that improves the extensibility of the code we write.

重構程式碼有很多優點,例如,提高程式碼的可讀性和降低複雜性,這反過來又使我們的原始碼更易於維護,並且我們配備了內部體系結構,可提高程式碼的可擴充套件性我們寫。

Furthermore, we can not talk about Code Refactoring without talking about improving performance. The goal is to write a program that performs faster and uses less memory, especially is we have an end-user that will be executing some task.

此外,我們不能不談提高效能而談論程式碼重構。 目標是編寫一個執行速度更快且使用更少記憶體的程式,尤其是我們有一個終端使用者將執行某些任務。

For more on refactoring in Python, see the link below!

有關在Python中進行重構的更多資訊,請參見下面的連結!

測試中 (Testing)

Image for post
Photo by Antoine Dautry on Unsplash
Antoine DautryUnsplash上的 照片

Note: I learnt briefly about testing (and the majority of other ideas covered in this post) in the Deployment of Machine Learning Models udemy course.

注意 :我在“ 部署機器學習模型”授課課程中簡要了解了測試(以及本文中涉及的其他大多數想法)。

Data Science is a funny field in a sense that our code may still run even though there are errors in our code, whereas in software related projects the code will throw an error. Consequently, we will end up with misleading insights (and possibly no job). Hence, test are imperative and if you know how to do them, your price goes up.

資料科學是一個有趣的領域,從某種意義上說,即使我們的程式碼中存在錯誤,我們的程式碼仍然可以執行,而在與軟體相關的專案中,程式碼將引發錯誤。 因此,我們最終會產生誤導性的見解(可能沒有工作)。 因此,測試勢在必行,如果您知道該怎麼做,價格就會上漲。

Here are some reasons why we run test:

我們進行測試的一些原因如下:

  • Ensure we get the correct outputs

    確保我們獲得正確的輸出
  • Easier updates to code

    輕鬆更新程式碼
  • Prevents pushing broken code to production

    防止將損壞的程式碼推送到生產中

I’m sure there are more reasons, but for now I will stop here. Check out the link below for more on testing.

我敢肯定還有更多原因,但是現在我將在這裡停止。 檢視下面的連結以獲取更多有關測試的資訊。

程式碼審查 (Code Review)

Image for post
Photo by Obie Fernandez on Unsplash
Obie Fernandez Unsplash

Code reviews are done to improve code quality by promoting the best programming practices that will allow for code to ready for production. Additionally, it’s beneficial for everyone since it tends to have positive effects on team and company culture.

通過推廣最佳程式設計實踐來進行程式碼審查,以提高程式碼質量,從而使程式碼可以投入生產。 此外,它對每個人都有益,因為它往往會對團隊和公司文化產生積極影響。

The main reason for a code review is to catch errors though the reviews are extremely useful for improving readability as well as ensuring the coding standards are met.

儘管程式碼審查對於提高可讀性和確保符合編碼標準極為有用,但程式碼審查的主要原因是捕獲錯誤。

A really great article that goes more into depth is linked below…

下面連結了一篇更深入的非常好的文章。

結語 (Wrap Up)

It’s fair to say that this is definitely a whole load of things to learn, but for the exact same reason it increases the over value of a Data Science practitioner. Being able to whip up a Jupyter Notebook is no longer enough to make you stand out as a Data Scientist because everyone can do it. If you want to be above average, you’d have to do above average things and in this instance, it may involve learning the software engineering best practices.

可以公平地說,這絕對是學習的全部內容,但是由於完全相同的原因,它增加了資料科學從業者的過高價值。 能夠啟動Jupyter膝上型電腦已經不足以讓您脫穎而出成為資料科學家,因為每個人都可以做到。 如果要高於平均水平,則必須做高於平均水平的事情,在這種情況下,這可能涉及學習軟體工程最佳實踐。

Let’s continue the conversation on LinkedIn…

讓我們繼續在LinkedIn上進行對話…

翻譯自: https://towardsdatascience.com/data-scientist-should-know-software-engineering-best-practices-f964ec44cada

資料模型最佳實踐