Pandas
- Basic structures
- Indexing - loc, iloc
- Filtering
- Moving window / rolling technique
- Groupby, aggregation
import os, glob
import pandas as pd
Basic structures
Index
Ndarray. Stores axis labels. Ordered, sliceable set.
Series
Pandas series is Index + array of values. Similar to dict.
Dataframe
Pandas DataFrame is a set of columns (Series) that share the same Index.
Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html
Visualization imports
%matplotlib inline
import matplotlib.pyplot as plt
TIP: make the visualizations more beautiful :-)
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = (24, 6) # redefine the default size of the plots
plt.rcParams['font.family'] = 'DejaVu Sans'
plt.style.use('ggplot')
Load the data and have a first peek
Pandas are able to load from many possible sources. The loading functions are named read_*
where * is usually the source type name.
data_dir = 'data'
temperature = pd.read_csv(
os.path.join(data_dir, '20191119_clementinum_temperature.csv'),
names=['year', 'month', 'day', 'temperature_avg', 'temperature_max', 'temperature_min', 'rainfall', 'flag'],
header=1
)
temperature.head()
Merging, joining
Sometimes you need to combine different data frames.
Basic inspection methods
head, tail, info, describe
temperature.info()
temperature.describe()
temperature['flag'].unique().tolist()
Cleaning up the data
Let's get rid of the flag
column and create a DateTime as an index. Notice the method chaining.
TIP: It is possible to chain also custom made function. Explore the pandas pipe method for more information.
temperature_cleaned = (
temperature
.drop('flag', axis='columns')
.set_index(pd.to_datetime(temperature[['year', 'month', 'day']]))
)
temperature_cleaned.tail(2)
How many years we have available?
Now, we will use our knowledge of the data to acquire some more basic informations by using the unique function.
available_years = len(temperature['year'].unique())
f"Years with at least one entry: {available_years}, " \
f"Oldest entry: {temperature['year'].min()}, " \
f"Newest entry: {temperature['year'].max()}"
Indexing and Filtering
To select a subset of rows we will do the following:
- create a boolean series describing our selection criteria
- use the booolean series to select only the rows, where the series == True
interesting_year = 2018
is_inspected_year = temperature_cleaned['year'] == interesting_year
is_inspected_year.head()
temperature_2018 = temperature_cleaned[is_inspected_year]
temperature_2018.head(2)
It is possible to limit rows and also columns by using the loc[row_criteria, column_criteria]
function.
rainfall_temperature_columns = ['temperature_avg', 'rainfall']
temperature_rainfall_2018 = temperature_cleaned.loc[is_inspected_year, rainfall_temperature_columns]
temperature_rainfall_2018.head(2)
Notice the simplicity of plotting the pandas data.
TIP: Also, notice the semicolon. It is suppressing any text output.
temperature_rainfall_2018.plot(title='Rainfall and average temperature in Prague in 2018');
plt.ylabel("°C / mm");
The boolean series might be combined by using logic operators:
- & ... and
- | ... or
- ~ ... not
Let's practive this. We will visualize the snowy months for each year.
Snowy month := there is at least one day in the month where the temperature < 0 and rainfall > 0.
Moreover, we want to leverage the alpha
(aka opacity) matplotlib parameter to distinguish between months with more and less snowy days. (The months with more snowy days are will be more dark.)
rained = temperature_cleaned['rainfall'] > 0
above_zero = temperature_cleaned['temperature_min'] > 0
snow_days = temperature_cleaned.loc[rained & ~above_zero]
snow_days['month'].plot(title="Snowy months in Prague", marker='o', linestyle='None', alpha=0.25, color='black');
plt.ylabel("Month");
Indexing - loc vs iloc
- loc: label-based selector
- iloc: position-based selector
unsorted_letters = pd.DataFrame(data=['A', 'B', 'C', 'D', 'E'], index=[4, 5, 1, 2, 3], columns=['letters'])
unsorted_letters
unsorted_letters.loc[:2]
unsorted_letters.iloc[:2]
letter_index = unsorted_letters.reset_index().set_index('letters')
letter_index
letter_index.iloc[:2]
Q: What will be the output of the following cell?
letter_index.loc[:2]
letter_index.loc[:'B']
Note: Label-based slicing is inclusive.
Rolling operations
The rolling operation (moving window) requires a window of size N and a function f: [x1, ..., xN] -> y
Our example sums the previous 10 * 365 data points (in our case 10 years) of data. We are using the moving median to smooth the data and see long term trends.
temperature_cleaned[['temperature_avg', 'temperature_min', 'temperature_max']].rolling(window=10*365).median().plot(title="10-year median of temperatures in Prague");
plt.ylabel('°C');
Aggregations - Difference between days and months
We want to explore the temperatures during different months and weekdays. First of all, we will select our columns of interest.
TIP: Different ways how to define new column:
# providing a list of values
unsorted_letters["small_letters"] = ['a', 'b', 'c', 'd', 'e']
unsorted_letters
# vectorized calculation
unsorted_letters["both_letters"] = unsorted_letters['letters'] + unsorted_letters['small_letters']
unsorted_letters
# using assign method
unsorted_letters.assign(shifted=unsorted_letters['both_letters'].shift())
temperature_columns = ['temperature_avg', 'temperature_min', 'temperature_max']
weekday_temperature = temperature_cleaned[temperature_columns].copy()
weekday_temperature['weekday'] = weekday_temperature.index.dayofweek
weekday_temperature.head()
The next step is to group our data by the weekday
column.
TIP: It is also possible to group by the index values, just use grouby(level=0)
weekday_temperature_grouped = weekday_temperature.groupby('weekday')
weekday_temperature_grouped
Different aggregation function might be called on the groupby object.
The weekday contains number 0-6 (0 ... Monday, 6 ... Sunday). If we use a mean aggregation, we can see the weekends are slightly colder than the work/school days. What a pitty!
median_weekday_temperatures = weekday_temperature_grouped.mean()
median_weekday_temperatures
TIP: There is a way to access the actual grouped data.
weekday_temperature_grouped.get_group(0).head() # Monday data
TIP: It is possible to define a very specific combination of aggregation functions for each column.
import numpy as np
multiple_aggregation = weekday_temperature_grouped.agg({
"temperature_avg": [np.mean],
"temperature_min": [np.mean, np.min],
"temperature_max": [np.mean, np.max]
})
multiple_aggregation
Example of accessing the multi-dimensional columns.
multiple_aggregation['temperature_min']['amin']
Outlier detection
Q: What visualization will help me to spot outliers?
temperature_cleaned['temperature_avg'].plot.box(
figsize=(24,10),
showfliers=True
);
IQR outlier detection
IQR = quartile3 - quartile1
outlier := (x > quartile3 + 1.5 * IQR) or (x < quartile1 - 1.5 * IQR)
def is_iqr_outlier(x, quartile1, quartile3):
iqr = quartile3 - quartile1
return (x > quartile3 + 1.5 * iqr) or (x < quartile1 - 1.5 * iqr)
quartile1 = temperature_cleaned.temperature_avg.quantile(0.25)
quartile3 = temperature_cleaned.temperature_avg.quantile(0.75)
is_temperature_outlier = temperature_cleaned.temperature_avg.apply(
lambda temperature_avg: is_iqr_outlier(x=temperature_avg, quartile1=quartile1, quartile3=quartile3)
)
is_temperature_outlier.head()
temperature_cleaned[is_temperature_outlier]
Q: Could we avoid using the apply function in this case and be a little bit more elegant?
iqr_distance = 1.5 * (quartile3 - quartile1)
upper_bound = quartile3 + iqr_distance
lower_bound = quartile1 - iqr_distance
is_temperature_outlier_condition = (
(temperature_cleaned.temperature_avg > upper_bound) |
(temperature_cleaned.temperature_avg < lower_bound)
)
temperature_cleaned[is_temperature_outlier_condition]
Split - apply - combine
Motivation: I want to compare distribution of temperatures between different months
temperature_cleaned.boxplot(by='month', column='temperature_avg');
plt.title('Average temperature distribution for each month');
plt.suptitle('');
plt.ylabel('Temperature [°C]');
But how to compare the distributions? I want normalize temperatures into <0, 1> interval for each month.
Min-max scaling
x' = (x - min) / (max - min)
Split - apply - combine usage
- Split data set on months.
- Apply min-max scaling for each group separately.
- Combine the groups back together.
def min_max_scale(x: pd.Series) -> pd.Series:
minimum = x.min()
maximum = x.max()
return (x - minimum) / (maximum - minimum)
month_temperatures = temperature_cleaned.copy()
month_temperatures['temperature_avg_scaled'] = temperature_cleaned.groupby('month').apply(min_max_scale)['temperature_avg']
month_temperatures.head()
month_temperatures.boxplot(by='month', column='temperature_avg_scaled');
plt.title('Scaled average temperature distribution for each month');
plt.suptitle('');
Pivoting (stack, unstack, pivot)
rainfall_per_day = temperature_cleaned.groupby(['month', 'day']).mean()
rainfall_per_day
rainfall_per_day['rainfall'].plot(title='Mean rainfall per each day throughout the year');
plt.ylabel('Rainfall [mm]');
rainfall_per_day_stacked = rainfall_per_day.stack()
rainfall_per_day_stacked
rainfall_per_day[['rainfall']].unstack()
temperature_heatmap = temperature_cleaned.pivot_table(values='temperature_avg', index='day', columns='month', aggfunc=np.mean)
temperature_heatmap
temperature_heatmap.style.background_gradient(cmap='BuPu', axis='columns')
Performance comparison of different iteration approaches
row_size = 50000
numbers = pd.DataFrame.from_dict({'numbers': range(row_size)})
numbers
%%timeit -r 3 -n 2
summed_for = list()
for i in range(len(numbers)):
summed_for.append(numbers.numbers[i] + 1)
summed_for[:5]
%%timeit -r 3 -n 2
summed_iter = list()
for _, row in numbers.iterrows():
summed_iter.append(row.numbers + 1)
summed_iter[:5]
%%timeit -r 3 -n 2
summed_zip = list()
for a, b in zip(numbers.numbers, [1]*len(numbers)):
summed_zip.append(a + b)
summed_for[:5]
%%timeit -r 3 -n 2
def add(x):
return x + 1
summed_apply = numbers.applymap(add)
summed_apply.numbers.tolist()[:5]
%%timeit -r 3 -n 2
summed_columns = numbers.numbers + 1
summed_columns.tolist()[:5]
Conclusion: The vectorized pandas column operations are by far the most efficient way to apply transformations.