Intro#

pandas is a popular Python library for processing tabular data. It has excellent documentation, but some questions might not be straightforward. In this section, I will discuss some complex concepts and things I tend to forget.

import numpy as np
import pandas as pd
from random import shuffle

Creating & loading#

Pandas have variety of method to create dataframe or load table as pandas dataframe. This sections consdiers options.

For more details check:

  • Basics data frame section of the official documentation - describes the ways how to define pandas.DataFrames.

  • Specific page on this website for more details on some use cases.


The following cell shows the most typical way for me to define a data frame. Dict of lists:

pd.DataFrame({
    "a": [1, 2, 3],
    "b": ["a", "b", "c"]
})
a b
0 1 a
1 2 b
2 3 c

As result there is a column for each item of the input dict.

To show a contrast the following cell uses a much more complicated data structure to create a data frame. There is a dictionary of tuples in the keys. Each element is a dictionary that also uses tuples as keys.

pd.DataFrame({
    ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
    ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
    ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
    ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
    ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
})
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

As a result, outer keys became a columns where elements of tuples define levels of MultiIndex for columns. Inner keys define MultiIndex for the index of the data frame.

Data selecting#

This section is about selecting subsets from the pandas data frames.

There are two main ways to select data from a data frame:

  • Use loc to specify selection selecting rules based on dataframe index.

  • Use iloc to specify selection rules based on the order of records in the data frame.

Find out more in the corresponding page.


The following cell creates a series with shuffled index. So it’s easy to show the difference between loc and iloc.

index = list(range(1, 11))
shuffle(index)
test = pd.Series(list(range(1, 11)), index=index)
test
10     1
1      2
9      3
8      4
6      5
2      6
7      7
5      8
3      9
4     10
dtype: int64

The following cell applies iloc with slice selected exactly elements that are stay in corresponding positions according to the order of the original series.

test.iloc[4:7]
6    5
2    6
7    7
dtype: int64

The following cells select data using loc - values with matching indices are selected in the order they are listed in the loc condition.

test.loc[[5, 6, 7]]
5    8
6    5
7    7
dtype: int64

Styles#

If you need a specific representation for your data in pandas, there is a feature that allows you to customize the representation of a pandas DataFrame.

Check the official guide from pandas and the specific page on this website.


The following cell demonstrates how different tools can change the appearance of a table. The original table is shown first, followed by the modified tables.

df = pd.DataFrame(
    [[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]],
    index=pd.Index(
        ['Tumour (Positive)', 'Non-Tumour (Negative)'], 
        name='Actual Label:'
    ),
    columns=pd.MultiIndex.from_product(
        [['Decision Tree', 'Regression', 'Random'],['Tumour', 'Non-Tumour']], 
        names=['Model:', 'Predicted:']
    )
)

display(df)

cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; color: darkgrey; font-weight:normal;'
}
headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #000066; color: white;'
}

cell_color = pd.DataFrame(
    [['true', 'false', 'true', 'false', 'true', 'false'],
    ['false', 'true', 'false', 'true', 'false', 'true']],
    index=df.index,
    columns=df.columns
)

(
    df.style.format('{:.0f}')
    .set_table_styles([cell_hover, index_names, headers])
    .set_table_styles([
        {'selector': 'th.col_heading', 'props': 'text-align: center;'},
        {'selector': 'th.col_heading.level0', 'props': 'font-size: 1.5em;'},
        {'selector': 'td', 'props': 'text-align: center; font-weight: bold;'},
    ], overwrite=False)
    .set_table_styles([
        {'selector': '.true', 'props': 'background-color: #e6ffe6;'},
        {'selector': '.false', 'props': 'background-color: #ffe6e6;'},
    ], overwrite=False)
    .set_td_classes(cell_color)
)
Model: Decision Tree Regression Random
Predicted: Tumour Non-Tumour Tumour Non-Tumour Tumour Non-Tumour
Actual Label:
Tumour (Positive) 38.0 2.0 18.0 22.0 21 NaN
Non-Tumour (Negative) 19.0 439.0 6.0 452.0 226 232.0
Model: Decision Tree Regression Random
Predicted: Tumour Non-Tumour Tumour Non-Tumour Tumour Non-Tumour
Actual Label:            
Tumour (Positive) 38 2 18 22 21 nan
Non-Tumour (Negative) 19 439 6 452 226 232

Testing#

There is a pandas.testing module that allows you to implement unit tests for pandas objects. Check the corresponding documentation section for more details.


The following cell shows a typical case that helps explain why you should use pandas.testing asserts instead of self-made comparisons of series. There are pandas.Series objects that are essentially equal, but they contain empty values.

a = pd.Series([1, np.NaN, 3])
b = pd.Series([1, np.NaN, 3])

But any comparison operation involving np.NaN values results in a False output.

display(a == b)
display((a == b).all())
0     True
1    False
2     True
dtype: bool
False

Simply comparing pandas objects often requires writing a lot of additional code to handle such specific conditions. You can avoid that by using pd.testing asserts, as demonstrated below for the series we created earlier.

pd.testing.assert_series_equal(a, b)

The following cell shows the comparison of two series and the message that would be shown in std err in case the series are different.

try:
    pd.testing.assert_series_equal(
        pd.Series([1, 2, 3]),
        pd.Series([1, 4, 3])
    )
except Exception as e:
    print(e)
Series are different

Series values are different (33.33333 %)
[index]: [0, 1, 2]
[left]:  [1, 2, 3]
[right]: [1, 4, 3]

Check like#

The check_like parameter allows you to specify whether the test should check the order of columns or indices in the output.

Note: The meanings of the values can be a bit confusing:

  • True: Means that an assertion won’t be raised if the datasets are the same but the order is different.

  • False: Means that an assertion will be raised if left and right have a different order.


The following cell creates datasets a and b. Generally, they are the same, but the rows and columns are in a different order.

a = pd.DataFrame(
    {"a": [1, 2], "b": [3, 1]},
    index=[1, 2]
)
b = pd.DataFrame(
    {"b": [1, 3], "a": [2, 1]},
    index=[2, 1]
)
display(a, b)
a b
1 1 3
2 2 1
b a
2 1 2
1 3 1

If you call the default assert_frame_equal, it will raise a corresponding error.

try:
    pd.testing.assert_frame_equal(a, b)
except Exception as e:
    print(e)
DataFrame.index are different

DataFrame.index values are different (100.0 %)
[left]:  Index([1, 2], dtype='int64')
[right]: Index([2, 1], dtype='int64')
At positional index 0, first diff: 1 != 2

But if check_like=True, it will only check if the index/column combination has the same values in left and right.

pd.testing.assert_frame_equal(a, b, check_like=True)