Data transformations#

import numpy as np
import pandas as pd

Stack/unstack#

This topic is well covered in the official documentation: stack and unstack functions.

Here I just want to show the idea that stack actually adds columns as a sublevel of the index. And unstack adds index as sublevel of columns.

One level example#

The simple example will help you understand the concept. The following cell defines pandas.DataFrame, which we will use as an example.

df_single_level_cols = pd.DataFrame(
    [[0, 1], [2, 3]],
    index=['cat', 'dog'],
    columns=['weight', 'height']
)
df_single_level_cols
weight height
cat 0 1
dog 2 3

If we use the stack method, you’ll get series with indexes that use original columns as a sublevel of the index.

df_single_level_cols.stack(future_stack=True)
cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

If we use the unstack method, we’ll get series that have an index where the top level is built from columns and the inner level is built from the index of the original dataframe.

df_single_level_cols.unstack()
weight  cat    0
        dog    2
height  cat    1
        dog    3
dtype: int64

Two level example#

This example is even more representative, because as a result we get not a series that is always displayed vertically, but a dataframe that preserves the original display. So the following cell creates and displays the dataframe we’ll use as an example.

ind = pd.MultiIndex.from_tuples([
    ('cat', 'wild'),
    ('cat', 'home'),
    ('dog', "wild"),
    ('dog', 'home')
])
col = pd.MultiIndex.from_tuples([
    ("width", "m"),
    ("width", "ft"),
    ("height", "m"),
    ("height", "ft")
])

example_df = pd.DataFrame(
    (
        np.arange(len(ind)*len(col))
        .reshape([len(ind), len(col)])
    ),
    index=ind,
    columns=col
)
example_df
width height
m ft m ft
cat wild 0 1 2 3
home 4 5 6 7
dog wild 8 9 10 11
home 12 13 14 15

Stack

So here is an example of a stack that actually just moved different levels of columns to the inner level of the index.

display(example_df.stack(level=0, future_stack=True))
display(example_df.stack(level=1, future_stack=True))
m ft
cat wild width 0 1
height 2 3
home width 4 5
height 6 7
dog wild width 8 9
height 10 11
home width 12 13
height 14 15
width height
cat wild m 0 2
ft 1 3
home m 4 6
ft 5 7
dog wild m 8 10
ft 9 11
home m 12 14
ft 13 15

Unstack

So here is an example of the unstack that moves different levels of the index to the inner level of the columns.

display(example_df.unstack(level=0))
display(example_df.unstack(level=1))
width height
m ft m ft
cat dog cat dog cat dog cat dog
home 4 12 5 13 6 14 7 15
wild 0 8 1 9 2 10 3 11
width height
m ft m ft
home wild home wild home wild home wild
cat 4 0 5 1 6 2 7 3
dog 12 8 13 9 14 10 15 11

Combining frames#

Sometimes there are situations where it is useful to create a data frame that contains all possible combinations of input rows and their corresponding columns concatenated.

Problem#

Imagine you have a task that involves applying a statistical model to evaluate the performance of different games for each user in a game service. To accomplish this, you need to examine all possible combinations of users and games to determine the best game for each user. In the cell below, we have created example tables for both users and games to illustrate this scenario.

import pandas as pd
from IPython.display import HTML 

user = pd.DataFrame({
    "user_name" : ["Fedor", "Sandra", "Mark"],
    "sex" : ["m", "w", "m"],
    "age" : [10, 23, 7]
})
game = pd.DataFrame({
    "game_name" : ["warcraft", "conter-strike", "valorant"],
    "genre" : ["rpg", "shooter", "shooter"]
})

display(HTML("<p style='font-size:15px'>Users</p>"))
display(user)
display(HTML("<p style='font-size:15px'>Games</p>"))
display(game)

Users

user_name sex age
0 Fedor m 10
1 Sandra w 23
2 Mark m 7

Games

game_name genre
0 warcraft rpg
1 conter-strike shooter
2 valorant shooter

Solution#

You can just use pandas.Merge function for getting all possble combinations. But you have to specify column that should be used as a key for merge. You can just specify same key for all records of both tables and specify it on the on argument of the pandas.Merge. So the realisation is showen below.

user["key"]=1;game["key"]=1
pd.merge(user, game, on="key").drop("key", axis = 1)
user_name sex age game_name genre
0 Fedor m 10 warcraft rpg
1 Fedor m 10 conter-strike shooter
2 Fedor m 10 valorant shooter
3 Sandra w 23 warcraft rpg
4 Sandra w 23 conter-strike shooter
5 Sandra w 23 valorant shooter
6 Mark m 7 warcraft rpg
7 Mark m 7 conter-strike shooter
8 Mark m 7 valorant shooter

As the result Fedor have maches with all games. And any other user have matches with all games.

Group by#

pd.dataframe.groupby() and pd.series.groupby() allow to perform operations by groups.

For more check:


The following cell shows a very basic example of the groupby data. There is a column that separates the dataset by groups: group; and a column under consideration: value.

example = pd.DataFrame(
    [
        ["a", 1],
        ["b", 2],
        ["b", 8],
        ["c", 2]
    ],
    columns=["group", "value"]
)
example
group value
0 a 1
1 b 2
2 b 8
3 c 2

So the following cell calculates sum of the values for each unique value in the group.

example.groupby("group").sum()
value
group
a 1
b 10
c 2