Paris is always a good idea: Identifying the best -- and most mispriced -- airbnb units in Paris, using BeautifulSoup, pandas, MatplotLib/Seaborn, and scikit-learn

December 15, 2021

“If you are lucky enough to have lived in Paris as a young man, then wherever you go for the rest of your life, it stays with you, for Paris is a moveable feast.” This was true for Ernest Hemingway, and it is certainly true for me. I studied abroad in Paris from February to June 2009 and it was a seminal point in my life - I fell in love with the city and the culture and grew tremendously as a person.

Since my study abroad tour de force in 2009, I have been back to the city I love several times, each time staying at an AirBnb location in a different neighborhood to experience the city from a different perspective. As I look forward to 2022 and a (hopeful) return of global leisure travel, my goal is to use Python to find the most desired Airbnbs location in the city of Paris...and later, all of Europe.

paris-header

Key takeaways -- the best Airbnb locations in Paris, with predicted prices the result of multiple regression:

For units that accommodate 4 guests or less (with links to visit the listing), sorted by highest expected price:

Title Listing url Accommodates Neighborhood Price* Expected Price
CLASSY STUDIO TOUR EIFFEL https://www.airbnb.com/rooms/24236332 2 Palais-Bourbon 150 430.31
Duplex 110 m2, 3 bedrooms, heart of marais https://www.airbnb.com/rooms/1852465 4 Hôtel-de-Ville 250 343.46
Cœur de paris https://www.airbnb.com/rooms/17664057 4 Popincourt 200 337.15
Nice apartment/balcony/Montmartre Opera https://www.airbnb.com/rooms/6058893 4 Opéra 210 319.94
Beautiful XIX c. Style Apartment https://www.airbnb.com/rooms/348747 4 Élysée 260 319.03

For units that accommodate less than 6 guests:

Title Listing url Accommodates Neighborhood Price* Expected Price
Superb 4 bdrms 185m2 Apt Trocadero https://www.airbnb.com/rooms/2249308 6 Passy 650 463.18
A typical Parisian apartment: https://www.airbnb.com/rooms/10800486 6 Vaugirard 300 454.73
Flat 234m2 , 4 bedrooms, near Trocadéro https://www.airbnb.com/rooms/19903973 6 Passy 280 439.52
Nice family place near Trocadero https://www.airbnb.com/rooms/3487818 6 Passy 290 434.46
NEW-MAGIC VIEW - PARIS 7 VARENNE- https://www.airbnb.com/rooms/9353416 6 Palais-Bourbon 450 397.57

For all units - no limitation on guest count:

Title Listing url Accommodates Neighborhood Price* Expected Price
Luxury 190 m² "Pied à terre" in the Heart of Paris https://www.airbnb.com/rooms/15658427 14 Bourse 1,386 804.27
République, renovated 2016 6 bdrms,3WC,3 bathrooms https://www.airbnb.com/rooms/13243111 16 Entrepôt 280 712.90
Luxury 6Bdr 5Bth in Heritage Building - LOUVRE VIEW https://www.airbnb.com/rooms/35984086 12 Louvre 810 682.61
Maison Parisienne - Central Paris https://www.airbnb.com/rooms/16905683 10 Popincourt 707 660.66
Wonderful house in the center of Paris https://www.airbnb.com/rooms/20798740 10 Popincourt 720 653.59

* Price refers to the price at the instance the information was scraped.
Expected price is the output of regression models, and includes the following parameters: # of guests, # of beds, # of bedrooms, availability in the past 90 days, and guest reviews (scale of 1-5) for: overall rating, accuracy, cleanliness, location, and value.

The raw code working file for data preprocessing and data exploration can be viewed here.
The raw code working file for regression analysis can be viewed here.

The Datasets:

The dataset used for this project comes from Insideairbnb.com, which is actually an anti-Airbnb lobbying group that scrapes listings, reviews and calendar data for multiple cities around the world. The dataset that I used from the site was scraped on September 30, 2021. My inital data set compiled 50,133 Airbnb listings located in Paris. All in after data cleansing, my dataset included 50 variables (15 numeric, 34 categorical, 1 datetime) and across the 50,133 observations, 2.8% of the data included null or missing values (71,154 missing cells).

For a detailed summary of the Paris data set please click here.

I extended the analysis to include airbnb listings for additional European cities on the platform. This dataset has the same # of variables (15 numeric, 34 categorical, 1 datetime) and 280,558 observations for 20 additional cities, from Amsterdam to Vienna.

For a cursory look at the Paris dataset, below I have aggregrated and visualized each unit by neighborhood (in Paris these are referred to as 'arrondisements') and provided summary statistics, including unit count by arrondisement and average price. The box plot below depicts the distribution of nightly listing prices. The Saint-Germain-des-Pres/Odeon and Le Marais arrondisements have the highest average nightly prices, which is no surprise as these areas are arguably the prettiest and the most culturally important.

paris-arr-code
paris-hood-summary
paris-hood-boxplot-code
paris-hood-boxplot

Below I used longitude and latitude data to plot all airbnb units and color-coded by arrondisement.

paris-map-code
airbnb-color-code

I have also conducted similar analysis for a several other EU-27 cities on the Airbnb platform - the goal of which was to use regression models to find the highest expected price units, as well as to identify the biggest gap between expected price and actual price. Please see links below:

break

Data acquisition

The underlying data I used for my analysis is located on insideairbnb.com (and is not affiliated with Airbnb the company). In order to acquire the data in the most efficient and systematic (repeatable) method, I wrote a function that relied on the BeautifulSoup package, which is used for web-scraping information from websites. My function pulled in the data from the website and then saved the information as a csv file with the date of each information package (the insideairbnb.com site publishes new listings data on a monthly basis). The code is reproduced below:

paris-data-function

Next, I wrote functions that converted all csv files from the prior function to pandas dataframes, and then I appended all these monthly dataframes into a master dataframe for each city. As part of my function I also wrote code that provided summary information of each monthly dataframe that was loaded into the master dataframe. The code below shows that the Decmember 2012 dataframe contianes 65,917 listings, with an average price of 114.28 per night, whereas the July 2021 dataset contains 51,040 listings with an average price of 124.53 per night.

paris-df-function
break

Regression Analysis

In order to predict the prices of each airbnb unit, I relied on regression analysis and the scikit-learn package. But in order to get to this step, I did a significant amount of data cleaning and pre-processing - please see the Appendix section for more detail. The key dependent variables I relied on for my regression analysis included # of accommodations, # of bedrooms, # of beds, availabiltiy in the past 90 days, and user review scores for accuracy, cleanliness, location, value, and overall experience. I ran several regressions for Paris (and all other European cities) with constraints for # of guests with the goal of predicting the expected price of the unit.

paris-join-review-data
paris-regress-code
paris-regress-coeff

From there I was able to use the pandas 'query' function to find the highest predicted value units, taking into account published user reviews and various features such as # of accommodations, etc. See below for the top 5 highest expected nightly prices. It makes intuitive sense that the predicted price is positively correlated with # of guests and # of bedrooms/beds. Higher scores tend to also be positively correlated with better review scores, but the # of reviews and length of available duration does have an indiscriminant effect on the predicted price. It's also important to point out that certain optional features such as 'instant bookable' and 'super hosts' resulted in higher predicted prices. The presence of advertised amenities such as 'air conditioning', 'internet connectivity' and 'tv', among others, were also postively correlated to price.

paris-regress-highest

For those seeking more affordability and better value, I was also able to identify the biggest gap between predicted price based on the regression, and listed price:

paris-regress-discount
break

Exploratory data analysis

As part of my analysis and before using sci-kit learn to run predict nightly prices, I conducted some exploraory data analysis on my dataset. My analysis falls into a few themes and was largely dictated by the type of variables. Below I show code and analysis for time series data, numerical data, categorical variable and boolean features.

Time Series

Observations:

airbnb-first-host-date
airbnb-time-series-1
airbnb-time-series-2
airbnb-price-yoy
airbnb-price-sheet

Numerical features

Observations:

airbnb-describe
airbnb-distribution
airbnb-distribution2
airbnb-price_accommodations

Categorical features

airbnb-pearson-code
airbnb-pearson-plot-code
airbnb-coffee-correlation-code
paris-bedroom-corr-code
paris-bedroom-corr-plot
paris-accomodates-corr-code
paris-accommodates-corr-plot
paris-beds-corr-code
paris-beds-corr-plot

Boolean features

airbnb-coffee-machine-correlation
airbnb-coffee-correlation
paris-instant-corr-code
paris-instant-corr-plot
paris-airc-corr-code
paris-airc-corr-plot
paris-tv-corr-code
paris-tv-corr-plot
break

Appendix 1: Cleaning and pre-processing - raw data (n=50,133)

The dataset does have some limitations and is quite extensive; to facilitate analysis I started by importing necessary python packages and libraries, then explored and ultimately cleaned the data.

Importing the libraries and data

import-paris
Importing the file - which contains 50,133 AirBnb listings

Dropping initial columns

To simplify the dataset, I dropped certain columns that did not prove helpful in predicting price. I also checked for missing data - and dropped those columns that had a majority of data that is null.

drop-initial
isna-sum

Now I checked for the boolean and categorical data, to see if any of them are worth including:

airbnb-hist

From the above, it can be seen that several columns only contain one category and can be dropped.

Description of each column:

Cleaning individual variables

After identifying those variables that I expected to prove most useful in my analysis, I still had to clean the data and convert some data to numerical data types.

Observations:

host-response-time
host-response-time-plot
host-response-rate
host-response-bins
host-response-rate-plot
property-type
property-type-replace
property-type-plot
amenities
amenities-drop
price-airbnb
break

Appendix 2: Feature analysis

paris-pairplot
paris-bedroom-corr
break

Appendix 3: Regression analysis

paris-linear-reg
paris-linear-reg-plot
paris-linear-reg-plot
paris-linear-reg

Multiple regression provided a much more reliable exercise in predicting the pricing of units based on several features from the dataset.

airbnb-multi-reg-prep
airbnb-multi-reg-code