Summary #
I’ve built a Streamlit web app to estimate the short-term rental prices of a property based on its characteristics and location in London UK for December 2024, using data from Inside Airbnb, a data and advocacy website about Airbnb’s impact on residential communities, and from other public data sources. The estimator web app takes as input several major property features, borough where the property is located, yearly availability of the property, days from last review (if any), distance of the property to the nearest Tube station, local amenities near the property, and borough crime rate. If you’re curious, take a look at the web app at the link below.
Short-Term Rental Price Estimator • Streamlit
Data sources #
Due to the terms of service of major UK home realtors which don’t permit web scraping, I decided to use the Inside Airbnb website and filter the data on short-term rentals of entire flats or buildings for London, UK. The data set is anonymously scraped from Airbnb host profiles in a number of major international cities. The data for London itself can be found here at the webpage for London. The specific data set used in this analysis was scraped 11 December 2024. Because of planning regulations in the Greater London area made to protect communities and keep homes available, short-term rentals are limited to 90 nights per year.
For the analysis I employed several features of the Inside Airbnb data set, notably the borough of the property location, the property and room types, the amount of people the property can accomodate, the number of bedrooms and bathrooms, the price per night, the availability of the property over the last year, the number of days from the last review (if any), and latitude and longitude of the property. For the purposes of anonymity, these geographic coordinates are randomly offset by 150 meters.
Beyond the main data source from Inside Airbnb, the data were also enriched by adding crime rate per borough, distance of the property to the nearest Tube Underground station, and local amenities in the vicinity of the property. A more detailed list of these data sources can be found in the GitHub repository of this work. Moreover, I added a section concerning the data preparation in the GitHub repository of this work as well.
Exploratory data analysis #
A few initial observations can already be gleaned from the series of histograms in Figure 1. The top two histograms of latitude and logitude show a bimodal distribution that can be ascribed primarily to the Thames river in the first plot, but is harder to ascertain in the second. This could be due to the presence of more property listings around the major London parks, mostly present in the east and west of the city.
From the price histogram we observe a sharp drop in short-term rental prices per night, with a distribution heavily skewed towards the positive x axis. I chose to limit the x axis to £1000 as the upper limit, but there are several outliers that are even further up in price. The outliers however are still present in the model analysis. A similar distribution behavior is visible in the plot for the number of days since the last review.
Borough plots #
From Figure 2 one sees that most short-term rentals are present in the borough of Westminister, with Kensington & Chelsea, Camden, and Tower Hamlets listed in second, third, and fourth places respectively. The borough with the least number of rentals is Sutton.
As for the median price, the borough with the highest median price per rental belongs unsurprisingly to Kensington & Chelsea, which is the borough with the most exclusive and expensive properties of the city, followed closely by the boroughs of Westminster, Camden and Lambeth.
Model generation #
A few data science regression algorithms from Scikit-Learn were used to model the data. These were linear regression, random forest regressor, stochastic gradient descent regressor, support vector regressor and XGBoost regressor. The performance of the algorithms was determined based on the best (lowest) root mean squared error (RMSE). The support vector regressor algorithm achieved the best RSME of 0.37884, which was determined using 10-fold cross validation. More information on the model generation can be found in the corresponding section in the GitHub repository.
| Model | RMSE mean | RMSE std. dev |
|---|---|---|
| Linear Regression | 0.39468 | 0.03332 |
| Random Forest Regressor | 0.38261 | 0.03354 |
| Stochastic Gradient Descent | 0.39836 | 0.03358 |
| Support Vector Regressor | 0.37884 | 0.03443 |
| XGBoost Regressor | 0.39484 | 0.02909 |
The RMSE for the support vector regressor using the test data set was slightly better at 0.36268. A grid search analysis on the support vector regressor produced the best RMSE value with C=1.0 and epsilon=0.1, which are the default values for the support vector regressor.
| Metric | Value |
|---|---|
| RMSE | 0.36268 |
| \(R^2\) | 0.73458 |
Ordinary least squares (OLS) from Statsmodels allows us to calculate the F-statistic to determine the likelihood of association between the predictors and the outcome. In the regression results the F-statistics returns a value of 741.5, which is much greater than 1 and points to a very high association between at least one predictor and the outcome.
The mean and residual standard error of the price (in GBP) is 154.24 ± 55.87 (lower end 98.37, upper end 210.11). The error percentage of the residual standard error to the mean is 36.2%. This is the expected average variation of the price compared to the mean.
Conclusions #
The project was really enjoyable, and the part I liked the most was creating a new data set by data enrichment from other data sources. Once the model was generated, I set up an interactive web app with Streamlit that allows users to determine the rent price for the rental properties according to the features described above. Check it out at:
Short-Term Rental Price Estimator • Streamlit
Enjoy!