Data transformations

Data transformations#

import numpy as np
import pandas as pd

Stack/unstack#

This topic is well covered in the official documentation: stack and unstack functions.

Here I just want to show the idea that stack actually adds columns as a sublevel of the index. And unstack adds index as sublevel of columns.

One level example#

The simple example will help you understand the concept. The following cell defines pandas.DataFrame, which we will use as an example.

df_single_level_cols = pd.DataFrame(
    [[0, 1], [2, 3]],
    index=['cat', 'dog'],
    columns=['weight', 'height']
)
df_single_level_cols

	weight	height
cat	0	1
dog	2	3

If we use the stack method, you’ll get series with indexes that use original columns as a sublevel of the index.

df_single_level_cols.stack(future_stack=True)

cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

If we use the unstack method, we’ll get series that have an index where the top level is built from columns and the inner level is built from the index of the original dataframe.

df_single_level_cols.unstack()

weight  cat    0
        dog    2
height  cat    1
        dog    3
dtype: int64

Two level example#

This example is even more representative, because as a result we get not a series that is always displayed vertically, but a dataframe that preserves the original display. So the following cell creates and displays the dataframe we’ll use as an example.

ind = pd.MultiIndex.from_tuples([
    ('cat', 'wild'),
    ('cat', 'home'),
    ('dog', "wild"),
    ('dog', 'home')
])
col = pd.MultiIndex.from_tuples([
    ("width", "m"),
    ("width", "ft"),
    ("height", "m"),
    ("height", "ft")
])

example_df = pd.DataFrame(
    (
        np.arange(len(ind)*len(col))
        .reshape([len(ind), len(col)])
    ),
    index=ind,
    columns=col
)
example_df

		width		height
		m	ft	m	ft
cat	wild	0	1	2	3
cat	home	4	5	6	7
dog	wild	8	9	10	11
dog	home	12	13	14	15

Stack

So here is an example of a stack that actually just moved different levels of columns to the inner level of the index.

display(example_df.stack(level=0, future_stack=True))
display(example_df.stack(level=1, future_stack=True))

			m	ft
cat	wild	width	0	1
	wild	height	2	3
	home	width	4	5
	home	height	6	7
dog	wild	width	8	9
	wild	height	10	11
	home	width	12	13
	home	height	14	15

			width	height
cat	wild	m	0	2
	wild	ft	1	3
	home	m	4	6
	home	ft	5	7
dog	wild	m	8	10
	wild	ft	9	11
	home	m	12	14
	home	ft	13	15

Unstack

So here is an example of the unstack that moves different levels of the index to the inner level of the columns.

display(example_df.unstack(level=0))
display(example_df.unstack(level=1))

	width				height
	m		ft		m		ft
	cat	dog	cat	dog	cat	dog	cat	dog
home	4	12	5	13	6	14	7	15
wild	0	8	1	9	2	10	3	11

	width				height
	m		ft		m		ft
	home	wild	home	wild	home	wild	home	wild
cat	4	0	5	1	6	2	7	3
dog	12	8	13	9	14	10	15	11

Combining frames#

Sometimes there are situations where it is useful to create a data frame that contains all possible combinations of input rows and their corresponding columns concatenated.

Problem#

Imagine you have a task that involves applying a statistical model to evaluate the performance of different games for each user in a game service. To accomplish this, you need to examine all possible combinations of users and games to determine the best game for each user. In the cell below, we have created example tables for both users and games to illustrate this scenario.

import pandas as pd
from IPython.display import HTML 

user = pd.DataFrame({
    "user_name" : ["Fedor", "Sandra", "Mark"],
    "sex" : ["m", "w", "m"],
    "age" : [10, 23, 7]
})
game = pd.DataFrame({
    "game_name" : ["warcraft", "conter-strike", "valorant"],
    "genre" : ["rpg", "shooter", "shooter"]
})

display(HTML("<p style='font-size:15px'>Users</p>"))
display(user)
display(HTML("<p style='font-size:15px'>Games</p>"))
display(game)

Users

	user_name	sex	age
0	Fedor	m	10
1	Sandra	w	23
2	Mark	m	7

Games

	game_name	genre
0	warcraft	rpg
1	conter-strike	shooter
2	valorant	shooter

Solution#

You can just use pandas.Merge function for getting all possble combinations. But you have to specify column that should be used as a key for merge. You can just specify same key for all records of both tables and specify it on the on argument of the pandas.Merge. So the realisation is showen below.

user["key"]=1;game["key"]=1
pd.merge(user, game, on="key").drop("key", axis = 1)

	user_name	sex	age	game_name	genre
0	Fedor	m	10	warcraft	rpg
1	Fedor	m	10	conter-strike	shooter
2	Fedor	m	10	valorant	shooter
3	Sandra	w	23	warcraft	rpg
4	Sandra	w	23	conter-strike	shooter
5	Sandra	w	23	valorant	shooter
6	Mark	m	7	warcraft	rpg
7	Mark	m	7	conter-strike	shooter
8	Mark	m	7	valorant	shooter

As the result Fedor have maches with all games. And any other user have matches with all games.

Group by#

pd.dataframe.groupby() and pd.series.groupby() allow to perform operations by groups.

For more check:

Special page on this website.
Corresponding page in the official documentation.

The following cell shows a very basic example of the groupby data. There is a column that separates the dataset by groups: group; and a column under consideration: value.

example = pd.DataFrame(
    [
        ["a", 1],
        ["b", 2],
        ["b", 8],
        ["c", 2]
    ],
    columns=["group", "value"]
)
example

	group	value
0	a	1
1	b	2
2	b	8
3	c	2

So the following cell calculates sum of the values for each unique value in the group.

example.groupby("group").sum()

	value
group
a	1
b	10
c	2