Understanding Seattle Airbnb Home Data

阿新 • • 發佈：2018-12-28

The data I used was from one Udacity project. It includes the housing information of Seattle Airbnb in 2016 such as the listing price, house amenities, neighbourhood, host information, review star.

As a beginner of data sciences, I defined and aimed to answer three questions:- What are the busiest times of the year to visit Seattle? - Can we use other listing information to predict the housing price?- The vibe of each Seattle neighborhood using listing descriptions?

Price Fluctuation

The first question is quite straightforward as a visualization question. By looking at the time trend of the housing price, we will find that there are the listing price fluctuates the same pattern in each week. If we have good intuition, we could guess when it comes to weekend, the listing price will go up because the demand increases during the weekend.

When we smoothed out the weekly effect, by looking at the average listing price each week, we find that January, February seems to be the bottom of the price (117$) and it gradually increased and reached the peak in July (157$), which was a 25% increase compared the start of the year. Further analysis of the outlier was done to make sure that the trend stays consistent.

Predicting the price

The second question is to build a model trying to predict the listing price based on other listing information. We could first take a look at the distribution of price, the distribution is slightly right skewed. Large proportion of listing price is under 200$ per night.

Before we put the data into models for training, we first perform different cleaning method on numeric and categorical variables, by filling in null value, scaling and encoding data. What’s more, we performed principal component analysis (PCA) to reduce the dimensionality while keep the variance of the data.

By trying different supervised learning model including linear regression, lasso regression and random forest and using RMSE and R Squared to evaluate the model performance on the 25% test set, finally we chose lasso regression (alpha 0.8) as the final model with RMSE 3580.413, R² 0.571. Even though it is so far the best model, the evaluation result can be further improved by:

Include some of the variables that I exclude in the analysis (I still need to learn how to process these feature)
Combine and construct some of the new and more meaningful feature.

Neighbourhood Vibe

The third last question is about the vibe of the neighbourhood, which involves knowledge of text mining that I haven’t touched before. By referring to the book Data Science for Business — What You Need to Know About Data Mining and Data Analytic Thinking, I found that there is a useful metrics called TF-IDF (Term Frequency and Inverse Document Frequency), which could indicate how important one word is to specific documents.

Word Count for the Top 5 Neighbourhood Summary

I first extract the top five neighbourhood with most word count as our target analysis. As we could notice that there are not many word for text mining, but still we could use it as a practice of this techniques.

Top 5 Key Words for Each of the Neighbourhood (TF-IDF y-axis)

Afterwards, I calculated the TF-IDF value and listed the top five word for each of these neighbourhood. What’s more, we could compare the words with wikipedia introduction better understand the context and why these words stand out, for example:

Ballard: fishing, scandinavian, phinney, delancey, chittenden : Ballard is place great for fishing. You could appreciate the scandinavian culture. And Phinney seems to a famous spot there
Belltown: Adults, clipper, words, caffeinated, energized : Belltown seems to be great place for adult to hangout, caffeinated probably indicates that there are lots of coffee/chocolate shop nearby.
Capitol Hill: volunteer, admire, zipcar, fifteen, anderson: For capital hill, there’re are zipcars. And one of the famous splot is called Cal Anderson Park.

Understanding Seattle Airbnb Home Data

Price Fluctuation

Predicting the price

Neighbourhood Vibe

Understanding Seattle Airbnb Home Data

Airbnb Engineering & Data Science

記一次修復mysq啟動/usr/local/mysql/bin/mysqld: Can't create/write to file '/home/data/logs/mysql/mysqld.pid

//home/sxl/miniconda2/lib/libpng16.so.16: undefined reference to `data-cfemail="c9a0a7afa5a

Understanding Feature Engineering (Part 3) — Traditional Methods for Text Data

Understanding Feature Engineering (Part 2) — Categorical Data

Understanding Feature Engineering (Part 1) — Continuous Numeric Data

The Big Data of Selling a Home

From Data to Action With Airbnb Plus

Airbnb Engineering and Data Science at KDD 2018

Understanding Data and Machine Learning Models with Visualizations (Part 1)

Superset: Scaling Data Access and Visual Insights at Airbnb

Druid @ Airbnb Data Platform

Faster RCNN訓練出現問題：Selective search data not found at: /home/py-faster-rcnn/data/selective_search_dat

車輛密度估計--Understanding Traffic Density from Large-Scale Web Camera Data

未能加載文件或程序集“System.Data.SQLite”

如何用delphi中Data Module 管理數據庫連接和Adoquery 及 datasourse 等

實戰：MySQL Sending data導致查詢很慢的問題詳細分析(轉)

latch: undo global data問題的處理

Pro Android學習筆記（一三七）：Home Screen Widgets（3）：配置Activity

Understanding Seattle Airbnb Home Data

Price Fluctuation

Predicting the price

Neighbourhood Vibe

相關推薦