Usage options

Usage options#

This section considers ways to compute things over groups and practical ways to overate with “gropby” objects.

import pandas as pd
from IPython.display import HTML
A B C
0 a 2 10
1 a 1 20
2 b 3 30
3 b 4 40
4 c 6 50
5 c 5 60

Iterating#

You can iterate trow pandas.DataFrame.groupby retults. In each eteration you will get tuple of two values:

  • Value of the grouping variable for this iteration.

  • Sub-sampling from the original data set corresponding to the considered value of the grouping variable.


The following cell defines example.

df = pd.DataFrame({
    'A': ['a', 'a', 'b', 'b', 'c', 'c'],
    'B': [2, 1, 3, 4, 6, 5],
    'C': [10, 20, 30, 40, 50, 60]
})

df
A B C
0 a 2 10
1 a 1 20
2 b 3 30
3 b 4 40
4 c 6 50
5 c 5 60

The following cell creates a groupby object and displays the result of the first iteration over it.

gb = df.groupby('A')
ans = next(iter(gb))
ans
('a',
    A  B   C
 0  a  2  10
 1  a  1  20)

The first item determines the group:

ans[0]
'a'

The second element contains the subset corresponding to the group considered by the iteration.

ans[1]
A B C
0 a 2 10
1 a 1 20

agg#

This is a way to apply aggregation functions using syntax {<var_name_1>:<aggregation_function_1>, <var_name_2>:<aggregation_function_2>, ...}.


The next cell defines a data frame containing the group and two variables.

df = pd.DataFrame({
    "group": ["a", "a", "b", "b"],
    "variable1": [0, 7, 5, 5],
    "variable2": [0, 7, 5, 5]
})

The following cell illustrates how different aggregation functions must be applied to the different columns.

df.groupby("group").agg({"variable1": "max", "variable2": "sum"})
variable1 variable2
group
a 7 7
b 5 10

apply#

This method applies a specific function to each subset corresponding to a group.

Check more at the corresponding page.


The following cell defines a data frame that we will used as an example.

df = pd.DataFrame({
    "group": ["a", "a", "b"],
    "a": [1, 2, 3],
    "b": [7, 8, 1]
})

The following cell defines the example_function which prints the input it receives and uses it as an argument for the apply function.

def example_funtion(subdf):
    print("=========")
    print(subdf)
    return 5

res = df.groupby("group").apply(example_funtion, include_groups=False)
=========
   a  b
0  1  7
1  2  8
=========
   a  b
2  3  1

transform#

This is a function that allows you to get aggregations as pandas.Series/pandas.DataFrame indexed like the original pandas.DataFrame.


The following cell creates a pandas.DataFrame that will be used as an example.

df = pd.DataFrame({
    "A": ["a", "a", "b", "b"],
    "B": [15, 10, 4, 5]
})
df
A B
0 a 15
1 a 10
2 b 4
3 b 5

Suppose you need to each line correspond the maximum value of B among the group determined by the A.

df["max B in A"] = (
    df.groupby("A")["B"].transform("max")
)
df
A B min B in A
0 a 15 15
1 a 10 15
2 b 4 5
3 b 5 5

expanding#

If you need count expanding (cumulating) values over groups use expanding method of the groupby objects.


The following cell defines and displays the data frame that will be used as examle.

example = pd.DataFrame({
    "groups": ["A", "A", "A", "B", "B"],
    "values": [1, 2, 7, 3, 4],
    "order": [3, 2, 1, 1, 2]
})
example
groups values order
0 A 1 3
1 A 2 2
2 A 7 1
3 B 3 1
4 B 4 2

Suppose you need to accumulate values across groups - each next record must be affected by the previous ones. The following cell shows how it can be done.

example.groupby("groups").expanding()["values"].sum()
groups   
A       0     1.0
        1     3.0
        2    10.0
B       3     3.0
        4     7.0
Name: values, dtype: float64

It follows the order of the input DataFrame. To expand over a specific column, you need to sort the DataFrame accordingly. The following cell shows a solution where the results expand by order.

example.sort_values("order").groupby("groups").expanding()["values"].sum()
groups   
A       2     7.0
        1     9.0
        0    10.0
B       3     3.0
        4     7.0
Name: values, dtype: float64

So each record of the result is influenced only by rows with lower order values.