Intro#
pandas
is a popular Python library for processing tabular data. It has excellent documentation, but some questions might not be straightforward. In this section, I will discuss some complex concepts and things I tend to forget.
import numpy as np
import pandas as pd
from random import shuffle
Creating & loading#
Pandas have variety of method to create dataframe or load table as pandas dataframe. This sections consdiers options.
For more details check:
Basics data frame section of the official documentation - describes the ways how to define
pandas.DataFrame
s.Specific page on this website for more details on some use cases.
The following cell shows the most typical way for me to define a data frame. Dict of lists:
pd.DataFrame({
"a": [1, 2, 3],
"b": ["a", "b", "c"]
})
a | b | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
As result there is a column for each item of the input dict
.
To show a contrast the following cell uses a much more complicated data structure to create a data frame. There is a dictionary of tuples
in the keys. Each element is a dictionary that also uses tuples as keys.
pd.DataFrame({
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
})
a | b | |||||
---|---|---|---|---|---|---|
b | a | c | a | b | ||
A | B | 1.0 | 4.0 | 5.0 | 8.0 | 10.0 |
C | 2.0 | 3.0 | 6.0 | 7.0 | NaN | |
D | NaN | NaN | NaN | NaN | 9.0 |
As a result, outer keys became a columns where elements of tuples define levels of MultiIndex
for columns. Inner keys define MultiIndex
for the index of the data frame.
Data selecting#
This section is about selecting subsets from the pandas data frames.
There are two main ways to select data from a data frame:
Use
loc
to specify selection selecting rules based on dataframe index.Use
iloc
to specify selection rules based on the order of records in the data frame.
Find out more in the corresponding page.
The following cell creates a series with shuffled index. So it’s easy to show the difference between loc
and iloc
.
index = list(range(1, 11))
shuffle(index)
test = pd.Series(list(range(1, 11)), index=index)
test
10 1
1 2
9 3
8 4
6 5
2 6
7 7
5 8
3 9
4 10
dtype: int64
The following cell applies iloc
with slice selected exactly elements that are stay in corresponding positions according to the order of the original series.
test.iloc[4:7]
6 5
2 6
7 7
dtype: int64
The following cells select data using loc
- values with matching indices are selected in the order they are listed in the loc
condition.
test.loc[[5, 6, 7]]
5 8
6 5
7 7
dtype: int64
Styles#
If you need a specific representation for your data in pandas, there is a feature that allows you to customize the representation of a pandas DataFrame.
Check the official guide from pandas and the specific page on this website.
The following cell demonstrates how different tools can change the appearance of a table. The original table is shown first, followed by the modified tables.
df = pd.DataFrame(
[[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]],
index=pd.Index(
['Tumour (Positive)', 'Non-Tumour (Negative)'],
name='Actual Label:'
),
columns=pd.MultiIndex.from_product(
[['Decision Tree', 'Regression', 'Random'],['Tumour', 'Non-Tumour']],
names=['Model:', 'Predicted:']
)
)
display(df)
cell_hover = { # for row hover use <tr> instead of <td>
'selector': 'td:hover',
'props': [('background-color', '#ffffb3')]
}
index_names = {
'selector': '.index_name',
'props': 'font-style: italic; color: darkgrey; font-weight:normal;'
}
headers = {
'selector': 'th:not(.index_name)',
'props': 'background-color: #000066; color: white;'
}
cell_color = pd.DataFrame(
[['true', 'false', 'true', 'false', 'true', 'false'],
['false', 'true', 'false', 'true', 'false', 'true']],
index=df.index,
columns=df.columns
)
(
df.style.format('{:.0f}')
.set_table_styles([cell_hover, index_names, headers])
.set_table_styles([
{'selector': 'th.col_heading', 'props': 'text-align: center;'},
{'selector': 'th.col_heading.level0', 'props': 'font-size: 1.5em;'},
{'selector': 'td', 'props': 'text-align: center; font-weight: bold;'},
], overwrite=False)
.set_table_styles([
{'selector': '.true', 'props': 'background-color: #e6ffe6;'},
{'selector': '.false', 'props': 'background-color: #ffe6e6;'},
], overwrite=False)
.set_td_classes(cell_color)
)
Model: | Decision Tree | Regression | Random | |||
---|---|---|---|---|---|---|
Predicted: | Tumour | Non-Tumour | Tumour | Non-Tumour | Tumour | Non-Tumour |
Actual Label: | ||||||
Tumour (Positive) | 38.0 | 2.0 | 18.0 | 22.0 | 21 | NaN |
Non-Tumour (Negative) | 19.0 | 439.0 | 6.0 | 452.0 | 226 | 232.0 |
Model: | Decision Tree | Regression | Random | |||
---|---|---|---|---|---|---|
Predicted: | Tumour | Non-Tumour | Tumour | Non-Tumour | Tumour | Non-Tumour |
Actual Label: | ||||||
Tumour (Positive) | 38 | 2 | 18 | 22 | 21 | nan |
Non-Tumour (Negative) | 19 | 439 | 6 | 452 | 226 | 232 |
Testing#
There is a pandas.testing
module that allows you to implement unit tests for pandas
objects. Check the corresponding documentation section for more details.
The following cell shows a typical case that helps explain why you should use pandas.testing
asserts instead of self-made comparisons of series. There are pandas.Series
objects that are essentially equal, but they contain empty values.
a = pd.Series([1, np.NaN, 3])
b = pd.Series([1, np.NaN, 3])
But any comparison operation involving np.NaN
values results in a False
output.
display(a == b)
display((a == b).all())
0 True
1 False
2 True
dtype: bool
False
Simply comparing pandas objects often requires writing a lot of additional code to handle such specific conditions. You can avoid that by using pd.testing
asserts, as demonstrated below for the series we created earlier.
pd.testing.assert_series_equal(a, b)
The following cell shows the comparison of two series and the message that would be shown in std err in case the series are different.
try:
pd.testing.assert_series_equal(
pd.Series([1, 2, 3]),
pd.Series([1, 4, 3])
)
except Exception as e:
print(e)
Series are different
Series values are different (33.33333 %)
[index]: [0, 1, 2]
[left]: [1, 2, 3]
[right]: [1, 4, 3]
Check like#
The check_like
parameter allows you to specify whether the test should check the order of columns or indices in the output.
Note: The meanings of the values can be a bit confusing:
True
: Means that an assertion won’t be raised if the datasets are the same but the order is different.False
: Means that an assertion will be raised ifleft
andright
have a different order.
The following cell creates datasets a
and b
. Generally, they are the same, but the rows and columns are in a different order.
a = pd.DataFrame(
{"a": [1, 2], "b": [3, 1]},
index=[1, 2]
)
b = pd.DataFrame(
{"b": [1, 3], "a": [2, 1]},
index=[2, 1]
)
display(a, b)
a | b | |
---|---|---|
1 | 1 | 3 |
2 | 2 | 1 |
b | a | |
---|---|---|
2 | 1 | 2 |
1 | 3 | 1 |
If you call the default assert_frame_equal
, it will raise a corresponding error.
try:
pd.testing.assert_frame_equal(a, b)
except Exception as e:
print(e)
DataFrame.index are different
DataFrame.index values are different (100.0 %)
[left]: Index([1, 2], dtype='int64')
[right]: Index([2, 1], dtype='int64')
At positional index 0, first diff: 1 != 2
But if check_like=True
, it will only check if the index/column combination has the same values in left
and right
.
pd.testing.assert_frame_equal(a, b, check_like=True)