Summary #
I’ve built a Streamlit web app to estimate the short-term rental prices of a property based on its characteristics and location in London UK for December 2024, using data from Inside AirBNB, a website that provides data and advocacy about Airbnb’s impact on residential communities, and from other public data sources. The estimator web app takes as input several major property features, borough where the property is located, yearly availability of the property, days from last review (if any), distance of the property to the nearest Tube station, local amenities near the property, and borough crime rate. If you’re curious, take a look at the web app at the link below.
Short-Term Rental Price Estimator • Streamlit
Data sources #
Due to the terms of service of major UK home realtors which don’t permit web scraping, I decided to use the Inside AirBNB website and filter the data on short-term rentals of entire flats or buildings for London, UK. For the purposes of this study, short-term rentals are those where the overnight stay is from 1 to 999 days.
I decided to employ just a few of the features of the data set, notably the borough of the property location, the property and room types, the amount of people the property can accomodate, the number of bedrooms and bathrooms, the price per night, the availability of the property over the last year, the number of days from the last review (if any), and latitude and longitude of the property, which for the purposes of anonymity are randomly offset by 150 meters.
The data were also enriched by adding crime rate per borough, distance of the property to the nearest Tube Underground station, and local amenities in the vicinity of the property. Beyond the main source of data, there are several, other sources used for data enrichment. Here follows a list of them.
Short-term housing data #
The main bulk of the data comes from the Inside AirBNB website, which is a project that allows local communities to understand, decide and control the role of renting residential homes to tourists. The data set is anonymously scraped from AirBNB host profiles in a number of major international cities. The data for the city of London can be found at the webpage for London. The specific data set used in this analysis was scraped 11 December 2024.
Crime data #
The crime rate data by London borough are retrieved from the CrimeRate webpage for the Greater London Crime Statistics website. It regards the crime rate in each borough over the period from October 2023 to September 2024.
Transport data #
I added the distances from each rental unit to the closest Tube Underground station using the StopPoint endpoint from the Transport for London (TfL) developer API, from which I extracted the geographical coordinates of each Tube station. Afterwards, I calculated the distance from the nearest Tube station to the rental unit using the GeoPy Python package.
Amenities data #
I also decided to retrieve data on the amenities located in the vicinities of the property rentals using Foursquare. Specifically, I used the Place/Search endpoint, the details of which can be found here. At most three amenity categories for each property location are retrieved using the Foursquare API, which are then set to one of ten broad category types. These can be easily viewed in the web app under one of the Nearby amenity category drop-drop menus.
Data preparation #
In addition, null values were removed from the data set and only properties reviewed within the last six months were retained in the data set. Also, only properties that were occupied at least 90 days in the past year were preserved for the analysis, and just the most frequent property types, present at least 30 times, were kept for the analysis. These types can be easily selected and viewed in the web app under the Property Type drop-drop menu.
Exploratory data analysis #
The histogram of the price distribution of short-term rentals is heavily skewed towards the positive end of the x axis, so it is a good idea to replace the price feature with the logarithm of the price feature plus 1. This new feature is much more normally distributed compared to the previous feature.
The price of most short-term rentals is less than £200 per night, but the distribution is heavily skewed towards the positive x axis. I chose to limit the x axis to £1000 as the upper limit, but there are several outliers that are even further up in price. The outliers however are still present in the model analysis.
Most short-term rentals are in the Westminister borough, with Kensington & Chelsea, Camden, and Tower Hamlets listed in second, third, and fourth places respectively. The borough with the least number of rentals is Sutton.
Cluster analysis #
An analysis of possible clusters of rental properties in London was undertaken by finding the number of clusters that maximized the Silhouette coefficient. The maximum Silhouette score (0.524) is achieved with just one big cluster of properties that covers the entire city, with no other discernible subclusters visible. The maximum Silhouette score was calculated using DBSCAN in Scikit-Learn and eps=0.03 and min_sample=400. Fine-tuning the eps and min_sample parameters, which are the most important parameters for DBSCAN, doesn’t offer more than one cluster even at the expense of lower Silhouette scores.
Model generation #
A few data science regression algorithms from Scikit-Learn were used to model the data. These were linear regression, random forest regressor, stochastic gradient descent regressor, support vector regressor and XGBoost regressor. The performance of the algorithms was determined based on the best (lowest) root mean squared error (RMSE). The support vector regressor algorithm achieved the best RSME of 0.37884, which was determined using 10-fold cross validation. The RMSE for the support vector regressor using the test data set was slightly better at 0.36268.
A grid search analysis on the support vector regressor produced the best RMSE value with C=1.0 and epsilon=0.1, which are the default values for the support vector regressor.
Ordinary least squares (OLS) from Statsmodels allows us to calculate the F-statistic to determine the likelihood of association between the predictors and the outcome. In the regression results the F-statistics returns a value of 741.5, which is much greater than 1 and points to a very high association between at least one predictor and the outcome.
The mean and residual standard error of the price (in GBP) is 154.24 ± 55.87 (lower end 98.37, upper end 210.11). The error percentage of the residual standard error to the mean is 36.2%. This is the expected average variation of the price compared to the mean.
Conclusions #
The project was really enjoyable, and the part I liked the most was creating a new data set by data enrichment from other data sources. Once the model was generated, I set up an interactive web app with Streamlit that allows users to determine the rent price for the rental properties according to the features described above. Check it out at:
Short-Term Rental Price Estimator • Streamlit
Enjoy!
For the data analysis, the following software packages were used: Scikit-Learn (version 1.6.1), Matplotlib (version 3.10.0), Statsmodels (version 0.14.4), XGBoost (version 3.0.1), contextily (version 1.6.2), GeoPy (version 2.4.1), GeoPandas (version 1.0.1), Shapely (2.0.6), Streamlit (version 1.45.0), Pandas (version 2.2.3), joblib (version 1.4.2) and NumPy (version 1.26.4).