Usage options#
This section considers ways to compute things over groups and practical ways to overate with “gropby” objects.
import pandas as pd
from IPython.display import HTML
A | B | C | |
---|---|---|---|
0 | a | 2 | 10 |
1 | a | 1 | 20 |
2 | b | 3 | 30 |
3 | b | 4 | 40 |
4 | c | 6 | 50 |
5 | c | 5 | 60 |
Iterating#
You can iterate trow pandas.DataFrame.groupby
retults. In each eteration you will get tuple of two values:
Value of the grouping variable for this iteration.
Sub-sampling from the original data set corresponding to the considered value of the grouping variable.
The following cell defines example.
df = pd.DataFrame({
'A': ['a', 'a', 'b', 'b', 'c', 'c'],
'B': [2, 1, 3, 4, 6, 5],
'C': [10, 20, 30, 40, 50, 60]
})
df
A | B | C | |
---|---|---|---|
0 | a | 2 | 10 |
1 | a | 1 | 20 |
2 | b | 3 | 30 |
3 | b | 4 | 40 |
4 | c | 6 | 50 |
5 | c | 5 | 60 |
The following cell creates a groupby
object and displays the result of the first iteration over it.
gb = df.groupby('A')
ans = next(iter(gb))
ans
('a',
A B C
0 a 2 10
1 a 1 20)
The first item determines the group:
ans[0]
'a'
The second element contains the subset corresponding to the group considered by the iteration.
ans[1]
A | B | C | |
---|---|---|---|
0 | a | 2 | 10 |
1 | a | 1 | 20 |
agg
#
This is a way to apply aggregation functions using syntax {<var_name_1>:<aggregation_function_1>, <var_name_2>:<aggregation_function_2>, ...}
.
The next cell defines a data frame containing the group
and two variables.
df = pd.DataFrame({
"group": ["a", "a", "b", "b"],
"variable1": [0, 7, 5, 5],
"variable2": [0, 7, 5, 5]
})
The following cell illustrates how different aggregation functions must be applied to the different columns.
df.groupby("group").agg({"variable1": "max", "variable2": "sum"})
variable1 | variable2 | |
---|---|---|
group | ||
a | 7 | 7 |
b | 5 | 10 |
apply
#
This method applies a specific function to each subset corresponding to a group.
Check more at the corresponding page.
The following cell defines a data frame that we will used as an example.
df = pd.DataFrame({
"group": ["a", "a", "b"],
"a": [1, 2, 3],
"b": [7, 8, 1]
})
The following cell defines the example_function
which prints the input it receives and uses it as an argument for the apply
function.
def example_funtion(subdf):
print("=========")
print(subdf)
return 5
res = df.groupby("group").apply(example_funtion, include_groups=False)
=========
a b
0 1 7
1 2 8
=========
a b
2 3 1
transform
#
This is a function that allows you to get aggregations as pandas.Series
/pandas.DataFrame
indexed like the original pandas.DataFrame
.
The following cell creates a pandas.DataFrame
that will be used as an example.
df = pd.DataFrame({
"A": ["a", "a", "b", "b"],
"B": [15, 10, 4, 5]
})
df
A | B | |
---|---|---|
0 | a | 15 |
1 | a | 10 |
2 | b | 4 |
3 | b | 5 |
Suppose you need to each line correspond the maximum value of B
among the group determined by the A
.
df["max B in A"] = (
df.groupby("A")["B"].transform("max")
)
df
A | B | min B in A | |
---|---|---|---|
0 | a | 15 | 15 |
1 | a | 10 | 15 |
2 | b | 4 | 5 |
3 | b | 5 | 5 |
expanding
#
If you need count expanding (cumulating) values over groups use expanding
method of the groupby
objects.
The following cell defines and displays the data frame that will be used as examle.
example = pd.DataFrame({
"groups": ["A", "A", "A", "B", "B"],
"values": [1, 2, 7, 3, 4],
"order": [3, 2, 1, 1, 2]
})
example
groups | values | order | |
---|---|---|---|
0 | A | 1 | 3 |
1 | A | 2 | 2 |
2 | A | 7 | 1 |
3 | B | 3 | 1 |
4 | B | 4 | 2 |
Suppose you need to accumulate values
across groups - each next record must be affected by the previous ones. The following cell shows how it can be done.
example.groupby("groups").expanding()["values"].sum()
groups
A 0 1.0
1 3.0
2 10.0
B 3 3.0
4 7.0
Name: values, dtype: float64
It follows the order of the input DataFrame. To expand over a specific column, you need to sort the DataFrame accordingly. The following cell shows a solution where the results expand by order
.
example.sort_values("order").groupby("groups").expanding()["values"].sum()
groups
A 2 7.0
1 9.0
0 10.0
B 3 3.0
4 7.0
Name: values, dtype: float64
So each record of the result is influenced only by rows with lower order
values.