December 15, 2021
“If you are lucky enough to have lived in Paris as a young man, then wherever you go for the rest of your life, it stays with you, for Paris is a moveable feast.” This was true for Ernest Hemingway, and it is certainly true for me. I studied abroad in Paris from February to June 2009 and it was a seminal point in my life - I fell in love with the city and the culture and grew tremendously as a person.
Since my study abroad tour de force in 2009, I have been back to the city I love several times, each time staying at an AirBnb location in a different neighborhood to experience the city from a different perspective. As I look forward to 2022 and a (hopeful) return of global leisure travel, my goal is to use Python to find the most desired Airbnbs location in the city of Paris...and later, all of Europe.
For units that accommodate 4 guests or less (with links to visit the listing), sorted by highest expected price:
Title | Listing url | Accommodates | Neighborhood | Price* | Expected Price |
---|---|---|---|---|---|
CLASSY STUDIO TOUR EIFFEL | https://www.airbnb.com/rooms/24236332 | 2 | Palais-Bourbon | 150 | 430.31 |
Duplex 110 m2, 3 bedrooms, heart of marais | https://www.airbnb.com/rooms/1852465 | 4 | Hôtel-de-Ville | 250 | 343.46 |
Cœur de paris | https://www.airbnb.com/rooms/17664057 | 4 | Popincourt | 200 | 337.15 |
Nice apartment/balcony/Montmartre Opera | https://www.airbnb.com/rooms/6058893 | 4 | Opéra | 210 | 319.94 |
Beautiful XIX c. Style Apartment | https://www.airbnb.com/rooms/348747 | 4 | Élysée | 260 | 319.03 |
For units that accommodate less than 6 guests:
Title | Listing url | Accommodates | Neighborhood | Price* | Expected Price |
---|---|---|---|---|---|
Superb 4 bdrms 185m2 Apt Trocadero | https://www.airbnb.com/rooms/2249308 | 6 | Passy | 650 | 463.18 |
A typical Parisian apartment: | https://www.airbnb.com/rooms/10800486 | 6 | Vaugirard | 300 | 454.73 |
Flat 234m2 , 4 bedrooms, near Trocadéro | https://www.airbnb.com/rooms/19903973 | 6 | Passy | 280 | 439.52 |
Nice family place near Trocadero | https://www.airbnb.com/rooms/3487818 | 6 | Passy | 290 | 434.46 |
NEW-MAGIC VIEW - PARIS 7 VARENNE- | https://www.airbnb.com/rooms/9353416 | 6 | Palais-Bourbon | 450 | 397.57 |
For all units - no limitation on guest count:
Title | Listing url | Accommodates | Neighborhood | Price* | Expected Price |
---|---|---|---|---|---|
Luxury 190 m² "Pied à terre" in the Heart of Paris | https://www.airbnb.com/rooms/15658427 | 14 | Bourse | 1,386 | 804.27 |
République, renovated 2016 6 bdrms,3WC,3 bathrooms | https://www.airbnb.com/rooms/13243111 | 16 | Entrepôt | 280 | 712.90 |
Luxury 6Bdr 5Bth in Heritage Building - LOUVRE VIEW | https://www.airbnb.com/rooms/35984086 | 12 | Louvre | 810 | 682.61 |
Maison Parisienne - Central Paris | https://www.airbnb.com/rooms/16905683 | 10 | Popincourt | 707 | 660.66 |
Wonderful house in the center of Paris | https://www.airbnb.com/rooms/20798740 | 10 | Popincourt | 720 | 653.59 |
* Price refers to the price at the instance the information was scraped.
Expected price is the output of regression models, and includes the following parameters: # of guests, # of beds, # of bedrooms, availability in the past 90 days, and guest reviews (scale of 1-5) for: overall rating, accuracy, cleanliness, location, and value.
The raw code working file for data preprocessing and data exploration can be viewed here.
The raw code working file for regression analysis can be viewed here.
The dataset used for this project comes from Insideairbnb.com, which is actually an anti-Airbnb lobbying group that scrapes listings, reviews and calendar data for multiple cities around the world. The dataset that I used from the site was scraped on September 30, 2021. My inital data set compiled 50,133 Airbnb listings located in Paris. All in after data cleansing, my dataset included 50 variables (15 numeric, 34 categorical, 1 datetime) and across the 50,133 observations, 2.8% of the data included null or missing values (71,154 missing cells).
For a detailed summary of the Paris data set please click here.
I extended the analysis to include airbnb listings for additional European cities on the platform. This dataset has the same # of variables (15 numeric, 34 categorical, 1 datetime) and 280,558 observations for 20 additional cities, from Amsterdam to Vienna.
For a cursory look at the Paris dataset, below I have aggregrated and visualized each unit by neighborhood (in Paris these are referred to as 'arrondisements') and provided summary statistics, including unit count by arrondisement and average price. The box plot below depicts the distribution of nightly listing prices. The Saint-Germain-des-Pres/Odeon and Le Marais arrondisements have the highest average nightly prices, which is no surprise as these areas are arguably the prettiest and the most culturally important.
Below I used longitude and latitude data to plot all airbnb units and color-coded by arrondisement.
I have also conducted similar analysis for a several other EU-27 cities on the Airbnb platform - the goal of which was to use regression models to find the highest expected price units, as well as to identify the biggest gap between expected price and actual price. Please see links below:
The underlying data I used for my analysis is located on insideairbnb.com (and is not affiliated with Airbnb the company). In order to acquire the data in the most efficient and systematic (repeatable) method, I wrote a function that relied on the BeautifulSoup package, which is used for web-scraping information from websites. My function pulled in the data from the website and then saved the information as a csv file with the date of each information package (the insideairbnb.com site publishes new listings data on a monthly basis). The code is reproduced below:
Next, I wrote functions that converted all csv files from the prior function to pandas dataframes, and then I appended all these monthly dataframes into a master dataframe for each city. As part of my function I also wrote code that provided summary information of each monthly dataframe that was loaded into the master dataframe. The code below shows that the Decmember 2012 dataframe contianes 65,917 listings, with an average price of 114.28 per night, whereas the July 2021 dataset contains 51,040 listings with an average price of 124.53 per night.
In order to predict the prices of each airbnb unit, I relied on regression analysis and the scikit-learn package. But in order to get to this step, I did a significant amount of data cleaning and pre-processing - please see the Appendix section for more detail. The key dependent variables I relied on for my regression analysis included # of accommodations, # of bedrooms, # of beds, availabiltiy in the past 90 days, and user review scores for accuracy, cleanliness, location, value, and overall experience. I ran several regressions for Paris (and all other European cities) with constraints for # of guests with the goal of predicting the expected price of the unit.
From there I was able to use the pandas 'query' function to find the highest predicted value units, taking into account published user reviews and various features such as # of accommodations, etc. See below for the top 5 highest expected nightly prices. It makes intuitive sense that the predicted price is positively correlated with # of guests and # of bedrooms/beds. Higher scores tend to also be positively correlated with better review scores, but the # of reviews and length of available duration does have an indiscriminant effect on the predicted price. It's also important to point out that certain optional features such as 'instant bookable' and 'super hosts' resulted in higher predicted prices. The presence of advertised amenities such as 'air conditioning', 'internet connectivity' and 'tv', among others, were also postively correlated to price.
For those seeking more affordability and better value, I was also able to identify the biggest gap between predicted price based on the regression, and listed price:
As part of my analysis and before using sci-kit learn to run predict nightly prices, I conducted some exploraory data analysis on my dataset. My analysis falls into a few themes and was largely dictated by the type of variables. Below I show code and analysis for time series data, numerical data, categorical variable and boolean features.
Observations:
Observations:
The dataset does have some limitations and is quite extensive; to facilitate analysis I started by importing necessary python packages and libraries, then explored and ultimately cleaned the data.
To simplify the dataset, I dropped certain columns that did not prove helpful in predicting price. I also checked for missing data - and dropped those columns that had a majority of data that is null.
Now I checked for the boolean and categorical data, to see if any of them are worth including:
From the above, it can be seen that several columns only contain one category and can be dropped.
After identifying those variables that I expected to prove most useful in my analysis, I still had to clean the data and convert some data to numerical data types.
Observations:
Multiple regression provided a much more reliable exercise in predicting the pricing of units based on several features from the dataset.