Intro

Intro#

pandas is a popular Python library for processing tabular data. It has excellent documentation, but some questions might not be straightforward. In this section, I will discuss some complex concepts and things I tend to forget.

import numpy as np
import pandas as pd
from random import shuffle

Creating & loading#

Pandas have variety of method to create dataframe or load table as pandas dataframe. This sections consdiers options.

For more details check:

Basics data frame section of the official documentation - describes the ways how to define pandas.DataFrames.
Specific page on this website for more details on some use cases.

The following cell shows the most typical way for me to define a data frame. Dict of lists:

pd.DataFrame({
    "a": [1, 2, 3],
    "b": ["a", "b", "c"]
})

	a	b
0	1	a
1	2	b
2	3	c

As result there is a column for each item of the input dict.

To show a contrast the following cell uses a much more complicated data structure to create a data frame. There is a dictionary of tuples in the keys. Each element is a dictionary that also uses tuples as keys.

pd.DataFrame({
    ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
    ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
    ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
    ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
    ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
})

		a			b
		b	a	c	a	b
A	B	1.0	4.0	5.0	8.0	10.0
	C	2.0	3.0	6.0	7.0	NaN
	D	NaN	NaN	NaN	NaN	9.0

As a result, outer keys became a columns where elements of tuples define levels of MultiIndex for columns. Inner keys define MultiIndex for the index of the data frame.

Data selecting#

This section is about selecting subsets from the pandas data frames.

There are two main ways to select data from a data frame:

Use loc to specify selection selecting rules based on dataframe index.
Use iloc to specify selection rules based on the order of records in the data frame.

Find out more in the corresponding page.

The following cell creates a series with shuffled index. So it’s easy to show the difference between loc and iloc.

index = list(range(1, 11))
shuffle(index)
test = pd.Series(list(range(1, 11)), index=index)
test

   1
    2
    3
    4
    5
    6
    7
    8
    9
   10
dtype: int64

The following cell applies iloc with slice selected exactly elements that are stay in corresponding positions according to the order of the original series.

test.iloc[4:7]

  5
  6
  7
dtype: int64

The following cells select data using loc - values with matching indices are selected in the order they are listed in the loc condition.

test.loc[[5, 6, 7]]

  8
  5
  7
dtype: int64

Styles#

If you need a specific representation for your data in pandas, there is a feature that allows you to customize the representation of a pandas DataFrame.

Check the official guide from pandas and the specific page on this website.

The following cell demonstrates how different tools can change the appearance of a table. The original table is shown first, followed by the modified tables.

df = pd.DataFrame(
    [[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]],
    index=pd.Index(
        ['Tumour (Positive)', 'Non-Tumour (Negative)'], 
        name='Actual Label:'
    ),
    columns=pd.MultiIndex.from_product(
        [['Decision Tree', 'Regression', 'Random'],['Tumour', 'Non-Tumour']], 
        names=['Model:', 'Predicted:']
    )
)

display(df)

cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; color: darkgrey; font-weight:normal;'
}
headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #000066; color: white;'
}

cell_color = pd.DataFrame(
    [['true', 'false', 'true', 'false', 'true', 'false'],
    ['false', 'true', 'false', 'true', 'false', 'true']],
    index=df.index,
    columns=df.columns
)

(
    df.style.format('{:.0f}')
    .set_table_styles([cell_hover, index_names, headers])
    .set_table_styles([
        {'selector': 'th.col_heading', 'props': 'text-align: center;'},
        {'selector': 'th.col_heading.level0', 'props': 'font-size: 1.5em;'},
        {'selector': 'td', 'props': 'text-align: center; font-weight: bold;'},
    ], overwrite=False)
    .set_table_styles([
        {'selector': '.true', 'props': 'background-color: #e6ffe6;'},
        {'selector': '.false', 'props': 'background-color: #ffe6e6;'},
    ], overwrite=False)
    .set_td_classes(cell_color)
)

Model:	Decision Tree		Regression		Random
Predicted:	Tumour	Non-Tumour	Tumour	Non-Tumour	Tumour	Non-Tumour
Actual Label:
Tumour (Positive)	38.0	2.0	18.0	22.0	21	NaN
Non-Tumour (Negative)	19.0	439.0	6.0	452.0	226	232.0

Model:	Decision Tree		Regression		Random
Predicted:	Tumour	Non-Tumour	Tumour	Non-Tumour	Tumour	Non-Tumour
Actual Label:
Tumour (Positive)	38	2	18	22	21	nan
Non-Tumour (Negative)	19	439	6	452	226	232

Testing#

There is a pandas.testing module that allows you to implement unit tests for pandas objects. Check the corresponding documentation section for more details.

The following cell shows a typical case that helps explain why you should use pandas.testing asserts instead of self-made comparisons of series. There are pandas.Series objects that are essentially equal, but they contain empty values.

a = pd.Series([1, np.NaN, 3])
b = pd.Series([1, np.NaN, 3])

But any comparison operation involving np.NaN values results in a False output.

display(a == b)
display((a == b).all())

   True
  False
   True
dtype: bool

False

Simply comparing pandas objects often requires writing a lot of additional code to handle such specific conditions. You can avoid that by using pd.testing asserts, as demonstrated below for the series we created earlier.

pd.testing.assert_series_equal(a, b)

The following cell shows the comparison of two series and the message that would be shown in std err in case the series are different.

try:
    pd.testing.assert_series_equal(
        pd.Series([1, 2, 3]),
        pd.Series([1, 4, 3])
    )
except Exception as e:
    print(e)

Series are different

Series values are different (33.33333 %)
[index]: [0, 1, 2]
[left]:  [1, 2, 3]
[right]: [1, 4, 3]

Check like#

The check_like parameter allows you to specify whether the test should check the order of columns or indices in the output.

Note: The meanings of the values can be a bit confusing:

True: Means that an assertion won’t be raised if the datasets are the same but the order is different.
False: Means that an assertion will be raised if left and right have a different order.

The following cell creates datasets a and b. Generally, they are the same, but the rows and columns are in a different order.

a = pd.DataFrame(
    {"a": [1, 2], "b": [3, 1]},
    index=[1, 2]
)
b = pd.DataFrame(
    {"b": [1, 3], "a": [2, 1]},
    index=[2, 1]
)
display(a, b)

	a	b
1	1	3
2	2	1

	b	a
2	1	2
1	3	1

If you call the default assert_frame_equal, it will raise a corresponding error.

try:
    pd.testing.assert_frame_equal(a, b)
except Exception as e:
    print(e)

DataFrame.index are different

DataFrame.index values are different (100.0 %)
[left]:  Index([1, 2], dtype='int64')
[right]: Index([2, 1], dtype='int64')
At positional index 0, first diff: 1 != 2

But if check_like=True, it will only check if the index/column combination has the same values in left and right.

pd.testing.assert_frame_equal(a, b, check_like=True)