Pandas

Loading files and changing display options

The below code uses pandas to read in data from a csv file, then prints an f-string that summarizes the size of the loaded-in DataFrame. Then I change the viewing options to show a minimum of 20 rows, a max of 100 rows, and the # of visible columns is set to the # of total columns in the dataset.

load-csv
Using pandas to read in a csv file of IMDB movie data and changing view options
break

The excellent pandas_profiling package can be used to create a profile report for a given DataFrame in html format

Click here for a sample profile report from the popular titanic survivors dataset. The report succintly summarizes the data set and serves as a great starting point for analysis of any large datasets. The example below categorizes the data by variable type but there is so much more in the report.

titanic_profile

Identifying and examining missing values and dropping null values

The below code counts/sums the total # of missing values by variable.

null-sum
Identifying and counting all n/a values by feature
missing-%
drop_na
Visualizing patterns in missing data

Calculate # of missing values in each column, and then sort

shark-isna

Dropping columns with null values, and also conditionally drop based on missing data threshholds

shark-isna
break

Using a for loop to fill null values with the median values

null-forloop
Replacing null values with median values
break

When dealing with messy datetime data - use "errors='coerce'" to make non-dates into NaT values

shark-NaT
When dealing with dates and there is mixed data - use "errors='coerce'" to replace all non-date data types with NaT, and then drop null values to clean data

Formatting integers/floats in DataFrames through method-chaining

negative-red
Using styling formatting to change negative values to red

break

Q Cut and percentiles/quantiles

qcut
using q cut for percentiles
break

Filtering DataFrames based on multiple conditions

filter
Filtering DataFrames by multiple conditionals