Inline column add (assign)

Contents

Inline column add (assign)#

Sometimes you need to create temporary columns for certain calculations and don’t need a new column in other calculations. For this purpose, it is very convenient to have a method that allows you to return the same database but with new columns. For such purposes you can use pandas.DataFrame.assign. See basic examples in pandas documentation.

import numpy as np
import pandas as pd

from IPython.display import HTML

Use case#

Let’s say you need to group your data frame by the sum of the columns. Example data frame generated in the following cell.

sample_size = 20
np.random.seed(10)

show_df = pd.DataFrame({
    "group col1" : np.random.randint(10, 14, sample_size),
    "group col2" : np.random.randint(10, 14, sample_size)
})
show_df
group col1 group col2
0 11 10
1 11 13
2 10 10
3 13 10
4 10 13
5 11 13
6 13 12
7 10 10
8 11 13
9 11 12
10 10 12
11 11 11
12 11 10
13 12 10
14 10 12
15 11 11
16 10 13
17 12 12
18 10 11
19 12 11

Some combinations of numbers can give the same sum. For example, 10 + 12 = 22 and 11 + 11 = 22. If you just groupby both columns it will think they are different combinations, but we need to aggregate by the sum of the columns.

The following cell shows that groupby both columns leads to wrong results.

show_df.groupby(["group col1", "group col2"])["group col1"].count().to_dict()
{(10, 10): 2,
 (10, 11): 1,
 (10, 12): 2,
 (10, 13): 2,
 (11, 10): 2,
 (11, 11): 2,
 (11, 12): 1,
 (11, 13): 3,
 (12, 10): 1,
 (12, 11): 1,
 (12, 12): 1,
 (13, 10): 1,
 (13, 12): 1}

You can perform it using temporary dataframe. But it cause some unnecessary code and variables. Like in cell below.

temp_df = show_df.copy()
temp_df["sum"] = temp_df["group col1"] + temp_df["group col2"]
display(HTML('<p style="font-size:15px">Temp frame</p>'))
display(temp_df)
display(HTML('<p style="font-size:15px">Result</p>'))
display(temp_df.groupby("sum")["group col1"].count().to_dict())

Temp frame

group col1 group col2 sum
0 11 10 21
1 11 13 24
2 10 10 20
3 13 10 23
4 10 13 23
5 11 13 24
6 13 12 25
7 10 10 20
8 11 13 24
9 11 12 23
10 10 12 22
11 11 11 22
12 11 10 21
13 12 10 22
14 10 12 22
15 11 11 22
16 10 13 23
17 12 12 24
18 10 11 21
19 12 11 23

Result

{20: 2, 21: 3, 22: 5, 23: 5, 24: 4, 25: 1}

But using assign method you can perform the same operation just using one line of the code. The exampe is in the cell below.

show_df.assign(
    sum = lambda row: row["group col1"] + row["group col2"]
).groupby("sum")["sum"].count().to_dict()
{20: 2, 21: 3, 22: 5, 23: 5, 24: 4, 25: 1}