Understanding Seattle Airbnb Home Data
The data I used was from one Udacity project. It includes the housing information of Seattle Airbnb in 2016 such as the listing price, house amenities, neighbourhood, host information, review star.
As a beginner of data sciences, I defined and aimed to answer three questions:- What are the busiest times of the year to visit Seattle? - Can we use other listing information to predict the housing price?- The vibe of each Seattle neighborhood using listing descriptions?
Price Fluctuation
The first question is quite straightforward as a visualization question. By looking at the time trend of the housing price, we will find that there are the listing price fluctuates the same pattern in each week. If we have good intuition, we could guess when it comes to weekend, the listing price will go up because the demand increases during the weekend.
When we smoothed out the weekly effect, by looking at the average listing price each week, we find that January, February seems to be the bottom of the price (117$) and it gradually increased and reached the peak in July (157$), which was a 25% increase compared the start of the year. Further analysis of the outlier was done to make sure that the trend stays consistent.
Predicting the price
The second question is to build a model trying to predict the listing price based on other listing information. We could first take a look at the distribution of price, the distribution is slightly right skewed. Large proportion of listing price is under 200$ per night.
Before we put the data into models for training, we first perform different cleaning method on numeric and categorical variables, by filling in null value, scaling and encoding data. What’s more, we performed principal component analysis (PCA) to reduce the dimensionality while keep the variance of the data.
By trying different supervised learning model including linear regression, lasso regression and random forest and using RMSE and R Squared to evaluate the model performance on the 25% test set, finally we chose lasso regression (alpha 0.8) as the final model with RMSE 3580.413, R² 0.571. Even though it is so far the best model, the evaluation result can be further improved by:
- Include some of the variables that I exclude in the analysis (I still need to learn how to process these feature)
- Combine and construct some of the new and more meaningful feature.
Neighbourhood Vibe
The third last question is about the vibe of the neighbourhood, which involves knowledge of text mining that I haven’t touched before. By referring to the book Data Science for Business — What You Need to Know About Data Mining and Data Analytic Thinking, I found that there is a useful metrics called TF-IDF (Term Frequency and Inverse Document Frequency), which could indicate how important one word is to specific documents.
I first extract the top five neighbourhood with most word count as our target analysis. As we could notice that there are not many word for text mining, but still we could use it as a practice of this techniques.
Afterwards, I calculated the TF-IDF value and listed the top five word for each of these neighbourhood. What’s more, we could compare the words with wikipedia introduction better understand the context and why these words stand out, for example:
- Ballard: fishing, scandinavian, phinney, delancey, chittenden : Ballard is place great for fishing. You could appreciate the scandinavian culture. And Phinney seems to a famous spot there
- Belltown: Adults, clipper, words, caffeinated, energized : Belltown seems to be great place for adult to hangout, caffeinated probably indicates that there are lots of coffee/chocolate shop nearby.
- Capitol Hill: volunteer, admire, zipcar, fifteen, anderson: For capital hill, there’re are zipcars. And one of the famous splot is called Cal Anderson Park.