Jdi na navigaci předmětu

Pandas

  • Basic structures
  • Indexing - loc, iloc
  • Filtering
  • Moving window / rolling technique
  • Groupby, aggregation
import os, glob
import pandas as pd

Basic structures

Index

Ndarray. Stores axis labels. Ordered, sliceable set.

Series

Pandas series is Index + array of values. Similar to dict.

Dataframe

Pandas DataFrame is a set of columns (Series) that share the same Index.

Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html

Visualization imports

%matplotlib inline
import matplotlib.pyplot as plt

TIP: make the visualizations more beautiful :-)

%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = (24, 6) # redefine the default size of the plots
plt.rcParams['font.family'] = 'DejaVu Sans'
plt.style.use('ggplot')

Load the data and have a first peek

Pandas are able to load from many possible sources. The loading functions are named read_* where * is usually the source type name.

data_dir = 'data'
temperature = pd.read_csv(
    os.path.join(data_dir, '20191119_clementinum_temperature.csv'),
    names=['year', 'month', 'day', 'temperature_avg', 'temperature_max', 'temperature_min', 'rainfall', 'flag'],
    header=1
)
temperature.head()

Merging, joining

Sometimes you need to combine different data frames.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Basic inspection methods

head, tail, info, describe

temperature.info()
temperature.describe()
temperature['flag'].unique().tolist()

Cleaning up the data

Let's get rid of the flag column and create a DateTime as an index. Notice the method chaining.

TIP: It is possible to chain also custom made function. Explore the pandas pipe method for more information.

temperature_cleaned = (
    temperature
    .drop('flag', axis='columns')
    .set_index(pd.to_datetime(temperature[['year', 'month', 'day']]))
)

temperature_cleaned.tail(2)

How many years we have available?

Now, we will use our knowledge of the data to acquire some more basic informations by using the unique function.

available_years = len(temperature['year'].unique())

f"Years with at least one entry: {available_years}, " \
f"Oldest entry: {temperature['year'].min()}, " \
f"Newest entry: {temperature['year'].max()}"

Indexing and Filtering

To select a subset of rows we will do the following:

  • create a boolean series describing our selection criteria
  • use the booolean series to select only the rows, where the series == True
interesting_year = 2018
is_inspected_year = temperature_cleaned['year'] == interesting_year
is_inspected_year.head()
temperature_2018 = temperature_cleaned[is_inspected_year]
temperature_2018.head(2)

It is possible to limit rows and also columns by using the loc[row_criteria, column_criteria] function.

rainfall_temperature_columns = ['temperature_avg', 'rainfall']
temperature_rainfall_2018 = temperature_cleaned.loc[is_inspected_year, rainfall_temperature_columns]
temperature_rainfall_2018.head(2)

Notice the simplicity of plotting the pandas data.

TIP: Also, notice the semicolon. It is suppressing any text output.

temperature_rainfall_2018.plot(title='Rainfall and average temperature in Prague in 2018');
plt.ylabel("°C / mm");

The boolean series might be combined by using logic operators:

  • & ... and
  • | ... or
  • ~ ... not

Let's practive this. We will visualize the snowy months for each year.

Snowy month := there is at least one day in the month where the temperature < 0 and rainfall > 0.

Moreover, we want to leverage the alpha (aka opacity) matplotlib parameter to distinguish between months with more and less snowy days. (The months with more snowy days are will be more dark.)

rained = temperature_cleaned['rainfall'] > 0
above_zero = temperature_cleaned['temperature_min'] > 0

snow_days = temperature_cleaned.loc[rained & ~above_zero]

snow_days['month'].plot(title="Snowy months in Prague", marker='o', linestyle='None', alpha=0.25, color='black');
plt.ylabel("Month");

Indexing - loc vs iloc

  • loc: label-based selector
  • iloc: position-based selector
unsorted_letters = pd.DataFrame(data=['A', 'B', 'C', 'D', 'E'], index=[4, 5, 1, 2, 3], columns=['letters'])
unsorted_letters
unsorted_letters.loc[:2]
unsorted_letters.iloc[:2]
letter_index = unsorted_letters.reset_index().set_index('letters')
letter_index
letter_index.iloc[:2]

Q: What will be the output of the following cell?

letter_index.loc[:2]
letter_index.loc[:'B']

Note: Label-based slicing is inclusive.

Rolling operations

The rolling operation (moving window) requires a window of size N and a function f: [x1, ..., xN] -> y

Our example sums the previous 10 * 365 data points (in our case 10 years) of data. We are using the moving median to smooth the data and see long term trends.

temperature_cleaned[['temperature_avg', 'temperature_min', 'temperature_max']].rolling(window=10*365).median().plot(title="10-year median of temperatures in Prague");
plt.ylabel('°C');

Aggregations - Difference between days and months

We want to explore the temperatures during different months and weekdays. First of all, we will select our columns of interest.

TIP: Different ways how to define new column:

# providing a list of values
unsorted_letters["small_letters"] = ['a', 'b', 'c', 'd', 'e']
unsorted_letters
# vectorized calculation
unsorted_letters["both_letters"] = unsorted_letters['letters'] + unsorted_letters['small_letters']
unsorted_letters
# using assign method
unsorted_letters.assign(shifted=unsorted_letters['both_letters'].shift())
temperature_columns = ['temperature_avg', 'temperature_min', 'temperature_max']

weekday_temperature = temperature_cleaned[temperature_columns].copy()
weekday_temperature['weekday'] = weekday_temperature.index.dayofweek

weekday_temperature.head()

The next step is to group our data by the weekday column.

TIP: It is also possible to group by the index values, just use grouby(level=0)

weekday_temperature_grouped = weekday_temperature.groupby('weekday')
weekday_temperature_grouped

Different aggregation function might be called on the groupby object.

The weekday contains number 0-6 (0 ... Monday, 6 ... Sunday). If we use a mean aggregation, we can see the weekends are slightly colder than the work/school days. What a pitty!

median_weekday_temperatures = weekday_temperature_grouped.mean()
median_weekday_temperatures

TIP: There is a way to access the actual grouped data.

weekday_temperature_grouped.get_group(0).head()  # Monday data

TIP: It is possible to define a very specific combination of aggregation functions for each column.

import numpy as np

multiple_aggregation = weekday_temperature_grouped.agg({
    "temperature_avg": [np.mean],
    "temperature_min": [np.mean, np.min],
    "temperature_max": [np.mean, np.max]
})
multiple_aggregation

Example of accessing the multi-dimensional columns.

multiple_aggregation['temperature_min']['amin']

Outlier detection

Q: What visualization will help me to spot outliers?

temperature_cleaned['temperature_avg'].plot.box(
    figsize=(24,10),
    showfliers=True
);

IQR outlier detection

  • IQR = quartile3 - quartile1
  • outlier := (x > quartile3 + 1.5 * IQR) or (x < quartile1 - 1.5 * IQR)
def is_iqr_outlier(x, quartile1, quartile3):
    iqr = quartile3 - quartile1
    return (x > quartile3 + 1.5 * iqr) or (x < quartile1 - 1.5 * iqr)

quartile1 = temperature_cleaned.temperature_avg.quantile(0.25)
quartile3 = temperature_cleaned.temperature_avg.quantile(0.75)

is_temperature_outlier = temperature_cleaned.temperature_avg.apply(
    lambda temperature_avg: is_iqr_outlier(x=temperature_avg, quartile1=quartile1, quartile3=quartile3)
)

is_temperature_outlier.head()
temperature_cleaned[is_temperature_outlier]

Q: Could we avoid using the apply function in this case and be a little bit more elegant?

iqr_distance = 1.5 * (quartile3 - quartile1)
upper_bound = quartile3 + iqr_distance
lower_bound = quartile1 - iqr_distance

is_temperature_outlier_condition = (
    (temperature_cleaned.temperature_avg > upper_bound) |
    (temperature_cleaned.temperature_avg < lower_bound)
)

temperature_cleaned[is_temperature_outlier_condition]

Split - apply - combine

Motivation: I want to compare distribution of temperatures between different months

temperature_cleaned.boxplot(by='month', column='temperature_avg');
plt.title('Average temperature distribution for each month');
plt.suptitle('');
plt.ylabel('Temperature [°C]');

But how to compare the distributions? I want normalize temperatures into <0, 1> interval for each month.

Min-max scaling

  • x' = (x - min) / (max - min)

Split - apply - combine usage

  1. Split data set on months.
  2. Apply min-max scaling for each group separately.
  3. Combine the groups back together.
def min_max_scale(x: pd.Series) -> pd.Series:
    minimum = x.min()
    maximum = x.max()
    
    return (x - minimum) / (maximum - minimum)

month_temperatures = temperature_cleaned.copy()

month_temperatures['temperature_avg_scaled'] = temperature_cleaned.groupby('month').apply(min_max_scale)['temperature_avg']

month_temperatures.head()
month_temperatures.boxplot(by='month', column='temperature_avg_scaled');
plt.title('Scaled average temperature distribution for each month');
plt.suptitle('');

Pivoting (stack, unstack, pivot)

rainfall_per_day = temperature_cleaned.groupby(['month', 'day']).mean()
rainfall_per_day
rainfall_per_day['rainfall'].plot(title='Mean rainfall per each day throughout the year');
plt.ylabel('Rainfall [mm]');
rainfall_per_day_stacked = rainfall_per_day.stack()
rainfall_per_day_stacked
rainfall_per_day[['rainfall']].unstack()
temperature_heatmap = temperature_cleaned.pivot_table(values='temperature_avg', index='day', columns='month', aggfunc=np.mean)
temperature_heatmap
temperature_heatmap.style.background_gradient(cmap='BuPu', axis='columns')

Performance comparison of different iteration approaches

row_size = 50000

numbers = pd.DataFrame.from_dict({'numbers': range(row_size)})
numbers
%%timeit -r 3 -n 2

summed_for = list()

for i in range(len(numbers)):
    summed_for.append(numbers.numbers[i] + 1)
summed_for[:5]
%%timeit -r 3 -n 2

summed_iter = list()

for _, row in numbers.iterrows():
    summed_iter.append(row.numbers + 1)
summed_iter[:5]
%%timeit -r 3 -n 2

summed_zip = list()

for a, b in zip(numbers.numbers, [1]*len(numbers)):
    summed_zip.append(a + b)
summed_for[:5]
%%timeit -r 3 -n 2

def add(x):    
    return x + 1

summed_apply = numbers.applymap(add)
summed_apply.numbers.tolist()[:5]
%%timeit -r 3 -n 2

summed_columns = numbers.numbers + 1
summed_columns.tolist()[:5]

Conclusion: The vectorized pandas column operations are by far the most efficient way to apply transformations.