分步式資料庫_建立真實資料科學檔案專案的分步指南
分步式資料庫
As an inspiring data scientist, building interesting portfolio projects is key to showcase your skills. When I learned coding and data science as a business student through online courses, I disliked that datasets were made up of fake data or were solved before like Boston House Prices or the
作為一個鼓舞人心的資料科學家,構建有趣的專案組合是展示您的技能的關鍵。 當我通過線上課程學習商科專業的編碼和資料科學時,我不喜歡資料集是由偽造的資料組成的,或者不像波士頓房屋價格或Kaggle上的泰坦尼克號資料集那樣被求解過。
In this blogpost, I want to show you how I develop interesting data science project ideas and implement them step by step, such as exploring Germany’s biggest frequent flyer forum Vielfliegertreff. If you are short on time feel free to skip to the conclusion TLDR
在本博文中,我想向您展示我如何開發有趣的資料科學專案構想並逐步實施它們,例如探索德國最大的飛行常客論壇Vielfliegertreff。 如果您時間有限,請隨時跳過TLDR的結論。
步驟1:選擇與您相關的激情話題 (Step 1: Choose your passion topic that is relevant)
As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:
首先,我考慮一個可能滿足以下三個要求的專案,使其成為最有趣和最有趣的專案:
Solving my own problem or burning question
解決我自己的問題或棘手的問題
Connected to some recent event to be relevant or especially interesting
與最近的活動相關或特別有趣
Has not been solved or covered before
之前尚未解決或覆蓋
As these ideas are still quite abstract, let me give you a rundown how my three projects fulfilled the requirements:
由於這些想法還很抽象,請允許我簡要介紹一下我的三個專案如何滿足要求:
As a beginner do not strive for perfection, but choose something you are genuinely curious about and write down all the questions you want to explore in your topic.
作為一個初學者,不要追求完美,而要選擇您真正好奇的東西,並寫下您想在主題中探索的所有問題。
步驟2:開始將您自己的資料集收集在一起 (Step 2: Start Scraping together your own dataset)
Given that you followed my third requirement, there will be no dataset publicly available and you will have to scrape data together yourself. Having scraped a couple of websites, there are 3 major frameworks I use for different scenarios:
鑑於您遵循了我的第三個要求,因此不會公開提供任何資料集,並且您必須自己將資料收集在一起。 抓取了兩個網站後,我針對不同的情況使用了3個主要框架:
For Vielfliegertreff, I used scrapy as framework for the following reasons:
對於Vielfliegertreff,出於以下原因,我使用scrapy作為框架:
There was no Java Script enabled elements that were hiding data.
沒有啟用Java Script的元素可以隱藏資料。
The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.
網站結構非常複雜,必須從每個論壇主題,所有主題,所有環節到所有釋出網站頁面。 使用scrapy,您可以輕鬆實現複雜的邏輯,從而產生有組織的方式導致新回撥函式的請求。
There were quite a lot of posts so crawling the entire forum will defnitley take some time. Scrapy allows you to asynchronously scrape websites at an incredible speed.
有很多帖子,因此在整個論壇中進行爬網將需要一些時間。 Scrapy允許您以驚人的速度非同步抓取網站。
To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to scrape around 3000 pages / minute:
為了讓您瞭解刮擦的強大程度,我Swift對MacBook Pro(13英寸,2018年,四個Thunderbolt 3埠)進行了基準測試,採用了2,3 GHz四核Intel Core i5,能夠刮擦約3000頁/分鐘:
To be nice and not to get blocked, it is important that you scrape gently, by for example enabling scrapy’s auto-throttle feature. Furthermore, I also saved all data to a SQL lite database via an items pipeline to avoid duplicates and turned on to log each url request to make sure I do not put more load on the server if I stop and restart the scraping process.
為了變得友善而不被阻塞,輕柔地刮擦非常重要,例如通過啟用scrapy的自動油門功能。 此外,我還通過項管道將所有資料儲存到SQL lite資料庫中,以避免重複,並開啟日誌記錄每個url請求,以確保如果停止並重新啟動抓取過程,則不會對伺服器造成更多負載。
Knowing how to scrape gives you the freedom to collect datasets by yourself and teaches you also important concepts about how the internet works, what a request is and the structure of html/xpath.
知道如何抓取使您可以自由地自己收集資料集,並且還教您有關網際網路如何工作,請求是什麼以及html / xpath的結構的重要概念。
For my project I ended up with 1.47 gb of data which was close to 1 million posts in the forum.
對於我的專案,我最終獲得了1.47 gb的資料,該資料在論壇中接近100萬個帖子。
步驟3:清理資料集 (Step 3: Cleaning your dataset)
With your own scraped messy dataset the most challenging part of the portfolio project comes, where data scientists spend on average 60% of their time:
使用您自己的混亂資料集,投資組合專案中最具挑戰性的部分來了,資料科學家平均花費60%的時間:
Unlike clean Kaggle datasets, your own dataset allows you to build skills in data cleaning and show a future employer that you are ready to deal with real life messy datasets. Additionally, you can explore and take advantage of the python ecosystem by leveraging libraries that solve some common data cleaning tasks that others solved before.
與乾淨的Kaggle資料集不同,您自己的資料集可讓您建立資料清理技能,並向未來的僱主表明您已準備好處理現實生活中的混亂資料集。 此外,您可以利用庫來解決並利用python生態系統,這些庫可以解決其他人之前解決的一些常見資料清理任務。
For my dataset from Vielfliegertreff, there were a couple of common tasks like turning the dates into a pandas timestamps, converting numbers from strings into actual numeric data types and cleaning a very messy html post text to something readable and usable for NLP tasks. While some tasks are a bit more complicated, I would like to share my top 3 favourite libraries that solved some of my common data cleaning problems:
對於我來自Vielfliegertreff的資料集,有一些常見的任務,例如將日期轉換為熊貓時間戳,將數字從字串轉換為實際的數字資料型別以及將非常混亂的html帖子文字清理為對NLP任務可讀且可用的東西。 儘管有些任務比較複雜,但我想分享我最喜歡的3個librarie ,它們解決了一些常見的資料清理問題:
dateparser: Easily parse localized dates in almost any string formats commonly found on web pages.
dateparser :可以輕鬆解析網頁上常見的幾乎任何字串格式的本地化日期。
clean-text: Preprocess your scraped data with clean-text to create a normalized text representation. This one is also amazing to remove personally identifiable information, such as emails or phone numbers etc.
clean-text :使用clean-text預處理您抓取的資料,以建立規範化的文字表示形式。 刪除個人身份資訊(例如電子郵件或電話號碼等)也非常出色。
fuzzywuzzy: Fuzzy string matching like a boss.
Fuzzywuzzy :模糊字串匹配,像一個老闆。
步驟4:資料探索與分析 (Step 4: Data Exploration and Analysis)
When completing the Data Science Nanodegree on Udacity, I came across the Cross-Industry Standard Process for Data Mining (CRISP-DM), which I thought was quite an interesting framework to structure your work in a systematic way.
在完成有關Udacity的資料科學奈米學位時,我遇到了跨行業的資料探勘標準流程(CRISP-DM) ,我認為這是一個非常有趣的框架,可以系統地組織您的工作。
With our current flow, we implicitly followed the CRISP-DM for our project:
通過當前的流程,我們隱含地遵循了CRISP-DM的專案:
Expressing business understanding by coming up with the following questions in step 1:
通過在步驟1中提出以下問題來表達業務理解:
- How is COVID-19 impacting online frequent flyer forums like Vielfliegertreff? COVID-19對Vielfliegertreff等線上飛行常客論壇有何影響?
- What are some of the best posts in the forums? 論壇中最好的帖子是什麼?
- Who are the experts that I should follow as a new joiner? 作為新加入者,我應該追隨哪些專家?
- What are some of the worst or best things people say about airlines or airports? 人們對航空公司或機場所說的最壞或最好的話是什麼?
And with the scraped data we are now able to translate our initial business questions from above into specific data explanatory questions:
通過抓取的資料,我們現在能夠將我們最初的業務問題從上面轉換為具體的資料解釋性問題:
- How many posts are posted on a monthly basis? Did the posts decrease in the beginning of 2020 after COVID-19? Is there also some sort of indication that less people joined the plattform not being able to travel? 每月釋出多少個帖子? 在COVID-19之後,職位在2020年初是否減少了? 是否還有某種跡象表明,加入平臺的人越來越少而無法旅行?
- What are the top 10 number of posts by the number of likes? 按贊次數排名前10位的帖子數是多少?
- Who is posting the most and also receives on average the most likes for the post? These are the users I should follow regularly to see the best content. 誰在該帖子上釋出的次數最多,平均也收到最多的贊? 這些是我應定期關注以檢視最佳內容的使用者。
- Could a sentiment analysis on every post in combination with named entity recognition to identify cities/airports/airlines lead to interesting positive or negative comments? 對每個帖子進行情感分析並結合命名實體識別以識別城市/機場/航空公司,是否會引起有趣的正面或負面評論?
For the Vielfliegertreff project one can definitely say that there has been a trend of declining posts over the years. With COVID-19 we can clearly see a rapid decrease in posts from January 2020 onwards when Europe was shutting down and closing borders which also heavily affected travelling:
對於Vielfliegertreff專案,可以肯定地說,這些年來職位呈下降趨勢。 有了COVID-19,我們可以清楚地看到,自2020年1月起歐洲關閉和關閉邊境,這也嚴重影響了出行,職位Swift減少:
Also user sign ups have gone down over the years and the forum seems to see less and less of its rapid growth since start in January 2009:
多年來,使用者註冊量也有所下降,自2009年1月開始以來,該論壇的快速增長似乎越來越少:
Last but not least, I wanted to check what the most liked post was about. Unfortunately, it is in Germany, but it was indeed a very interesting post, where a German guy was allowed to spend some time on a US aircraft carrier and experienced a catapult take off in a C2 airplane. The post has some very nice pictures and interesting details. Feel free to check it out here if you can understand some German:
最後但並非最不重要的一點是,我想檢查一下最喜歡的帖子。 不幸的是,這是在德國,但這確實是一個非常有趣的職位,在那裡,一個德國人被允許在美國的航母上呆了一段時間,並經歷了C2飛機上的彈射器起飛。 該帖子有一些非常漂亮的圖片和有趣的細節。 如果您能理解一些德語,請隨時在此處檢視:
步驟5:通過Blogpost或Web App共享您的工作 (Step 5: Share your work via a Blogpost or Web App)
Once you are done with those steps you can go one step further and create a model that classifies or predicts certain data points. For this project I did not attempt further to use machine learning in a specific way, although I had some interesting ideas about classifying sentiment of posts in connection with certain airlines.
完成這些步驟後,您可以再進一步一步,建立一個模型來分類或預測某些資料點。 對於這個專案,儘管我對分類某些航空公司的職位情緒有一些有趣的想法,但我沒有嘗試進一步以特定的方式使用機器學習。
In another project however, I modeled a price prediction algorithm that allows a user to get a price estimate for any type of tractor. The model was then deployed with the awesome streamlit framework, which can be found here (be patient with loading, it might load a bit slower).
但是,在另一個專案中,我對價格預測演算法進行了建模,該演算法允許使用者獲得任何型別的拖拉機的價格估計。 然後,使用令人敬畏的精簡框架來部署該模型,該框架可以在此處找到(耐心等待載入,載入速度可能會慢一些)。
Another way to share your work is like me through blog posts on Medium, Hackernoon, KDNuggets or other popular websites. When writing blog posts, about portfolio projects or other topics, such as awesome interactive AI applications, I always try to make them as fun, visual and interactive as possible. Here are some of my top tips:
分享您作品的另一種方式是像我一樣,通過Medium, Hackernoon , KDNuggets或其他流行網站上的部落格文章。 在撰寫有關投資組合專案或其他主題(如超棒的互動式AI應用程式)的部落格文章時,我總是盡力使它們儘可能有趣,直觀和互動。 以下是一些我的重要提示:
Include nice pictures for easy understanding and to break up some of the long text
包括精美的圖片,以便於理解並分解一些較長的文字
Include interactive elements, like tweets or videos that let the user interact
包括互動元素,例如允許使用者互動的推文或視訊
Change boring tables or charts for interactive ones through tools and frameworks like airtable or plotly
通過諸如airtable或plotly之類的工具和框架,為互動式表格更改無聊的表格或圖表
結論與TLDR(Conclusion & TLDR)
Come up with a blog post idea that answers a burning question you had or solves your own problem. Ideally the timing of the topic is relevant and has not been analysed by anyone else before. Based on your experience, website structure and complexity, choose a framework that matches the scraping job best. During data cleaning leverage existing libraries to solve painful data cleaning tasks like parsing timestamps or cleaning text. Finally, choose how you can best share your work. Both an interactive deployed model/dashboard or a well written medium blog post can differentiate you from other applicants on the journey to become a data scientist.
提出一個部落格想法,回答您遇到的緊迫問題或解決自己的問題。 理想情況下,主題的時間安排是相關的,並且以前沒有其他人進行過分析。 根據您的經驗,網站結構和複雜性,選擇最適合抓取工作的框架。 在資料清理期間,利用現有庫來解決繁瑣的資料清理任務,例如解析時間戳或清理文字。 最後,選擇如何最好地共享您的工作。 互動式部署的模型/儀表板或寫得很好的部落格文章都可以使您與成為資料科學家的其他申請人區分開。
As always feel free to share with me some great data science resources or some of your best portfolio projects!
一如既往,隨時與我分享一些很棒的資料科學資源或一些最佳的專案組合!
分步式資料庫