Data transformations#
import numpy as np
import pandas as pd
Stack/unstack#
This topic is well covered in the official documentation: stack and unstack functions.
Here I just want to show the idea that stack
actually adds columns as a sublevel of the index. And unstack
adds index as sublevel of columns.
One level example#
The simple example will help you understand the concept. The following cell defines pandas.DataFrame
, which we will use as an example.
df_single_level_cols = pd.DataFrame(
[[0, 1], [2, 3]],
index=['cat', 'dog'],
columns=['weight', 'height']
)
df_single_level_cols
weight | height | |
---|---|---|
cat | 0 | 1 |
dog | 2 | 3 |
If we use the stack
method, you’ll get series with indexes that use original columns as a sublevel of the index.
df_single_level_cols.stack(future_stack=True)
cat weight 0
height 1
dog weight 2
height 3
dtype: int64
If we use the unstack
method, we’ll get series that have an index where the top level is built from columns and the inner level is built from the index of the original dataframe.
df_single_level_cols.unstack()
weight cat 0
dog 2
height cat 1
dog 3
dtype: int64
Two level example#
This example is even more representative, because as a result we get not a series that is always displayed vertically, but a dataframe that preserves the original display. So the following cell creates and displays the dataframe we’ll use as an example.
ind = pd.MultiIndex.from_tuples([
('cat', 'wild'),
('cat', 'home'),
('dog', "wild"),
('dog', 'home')
])
col = pd.MultiIndex.from_tuples([
("width", "m"),
("width", "ft"),
("height", "m"),
("height", "ft")
])
example_df = pd.DataFrame(
(
np.arange(len(ind)*len(col))
.reshape([len(ind), len(col)])
),
index=ind,
columns=col
)
example_df
width | height | ||||
---|---|---|---|---|---|
m | ft | m | ft | ||
cat | wild | 0 | 1 | 2 | 3 |
home | 4 | 5 | 6 | 7 | |
dog | wild | 8 | 9 | 10 | 11 |
home | 12 | 13 | 14 | 15 |
Stack
So here is an example of a stack
that actually just moved different levels of columns to the inner level of the index.
display(example_df.stack(level=0, future_stack=True))
display(example_df.stack(level=1, future_stack=True))
m | ft | |||
---|---|---|---|---|
cat | wild | width | 0 | 1 |
height | 2 | 3 | ||
home | width | 4 | 5 | |
height | 6 | 7 | ||
dog | wild | width | 8 | 9 |
height | 10 | 11 | ||
home | width | 12 | 13 | |
height | 14 | 15 |
width | height | |||
---|---|---|---|---|
cat | wild | m | 0 | 2 |
ft | 1 | 3 | ||
home | m | 4 | 6 | |
ft | 5 | 7 | ||
dog | wild | m | 8 | 10 |
ft | 9 | 11 | ||
home | m | 12 | 14 | |
ft | 13 | 15 |
Unstack
So here is an example of the unstack
that moves different levels of the index to the inner level of the columns.
display(example_df.unstack(level=0))
display(example_df.unstack(level=1))
width | height | |||||||
---|---|---|---|---|---|---|---|---|
m | ft | m | ft | |||||
cat | dog | cat | dog | cat | dog | cat | dog | |
home | 4 | 12 | 5 | 13 | 6 | 14 | 7 | 15 |
wild | 0 | 8 | 1 | 9 | 2 | 10 | 3 | 11 |
width | height | |||||||
---|---|---|---|---|---|---|---|---|
m | ft | m | ft | |||||
home | wild | home | wild | home | wild | home | wild | |
cat | 4 | 0 | 5 | 1 | 6 | 2 | 7 | 3 |
dog | 12 | 8 | 13 | 9 | 14 | 10 | 15 | 11 |
Combining frames#
Sometimes there are situations where it is useful to create a data frame that contains all possible combinations of input rows and their corresponding columns concatenated.
Problem#
Imagine you have a task that involves applying a statistical model to evaluate the performance of different games for each user in a game service. To accomplish this, you need to examine all possible combinations of users and games to determine the best game for each user. In the cell below, we have created example tables for both users and games to illustrate this scenario.
import pandas as pd
from IPython.display import HTML
user = pd.DataFrame({
"user_name" : ["Fedor", "Sandra", "Mark"],
"sex" : ["m", "w", "m"],
"age" : [10, 23, 7]
})
game = pd.DataFrame({
"game_name" : ["warcraft", "conter-strike", "valorant"],
"genre" : ["rpg", "shooter", "shooter"]
})
display(HTML("<p style='font-size:15px'>Users</p>"))
display(user)
display(HTML("<p style='font-size:15px'>Games</p>"))
display(game)
Users
user_name | sex | age | |
---|---|---|---|
0 | Fedor | m | 10 |
1 | Sandra | w | 23 |
2 | Mark | m | 7 |
Games
game_name | genre | |
---|---|---|
0 | warcraft | rpg |
1 | conter-strike | shooter |
2 | valorant | shooter |
Solution#
You can just use pandas.Merge
function for getting all possble combinations. But you have to specify column that should be used as a key for merge. You can just specify same key for all records of both tables and specify it on the on
argument of the pandas.Merge
. So the realisation is showen below.
user["key"]=1;game["key"]=1
pd.merge(user, game, on="key").drop("key", axis = 1)
user_name | sex | age | game_name | genre | |
---|---|---|---|---|---|
0 | Fedor | m | 10 | warcraft | rpg |
1 | Fedor | m | 10 | conter-strike | shooter |
2 | Fedor | m | 10 | valorant | shooter |
3 | Sandra | w | 23 | warcraft | rpg |
4 | Sandra | w | 23 | conter-strike | shooter |
5 | Sandra | w | 23 | valorant | shooter |
6 | Mark | m | 7 | warcraft | rpg |
7 | Mark | m | 7 | conter-strike | shooter |
8 | Mark | m | 7 | valorant | shooter |
As the result Fedor
have maches with all games. And any other user have matches with all games.
Group by#
pd.dataframe.groupby()
and pd.series.groupby()
allow to perform operations by groups.
For more check:
Special page on this website.
The following cell shows a very basic example of the groupby data. There is a column that separates the dataset by groups: group
; and a column under consideration: value
.
example = pd.DataFrame(
[
["a", 1],
["b", 2],
["b", 8],
["c", 2]
],
columns=["group", "value"]
)
example
group | value | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
2 | b | 8 |
3 | c | 2 |
So the following cell calculates sum of the values for each unique value in the group
.
example.groupby("group").sum()
value | |
---|---|
group | |
a | 1 |
b | 10 |
c | 2 |