# `groupby` parameters

This page duscusses the parameters of `pandas.DataFrame.groupby` and `pandas.Series.groupby` methods.

In [2]:
import pandas as pd

## Groups as index

The `as_index` sepcify whether columns selected for certain groups are to be used as indexes in the output, or as regular columns.

---

The following cell creates a table that will serve as an example.

In [8]:
df = pd.DataFrame({
    "A": ["a", "a", "b", "b"],
    "B": [1, 2, 3, 4]
})
df

Unnamed: 0,A,B
0,a,1
1,a,2
2,b,3
3,b,4


The two following cells aggregate by the "A" column, with both options for `as_index`.

In [9]:
df.groupby("A", as_index=True).sum()

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
a,3
b,7


In [10]:
df.groupby("A", as_index=False).sum()

Unnamed: 0,A,B
0,a,3
1,b,7


If the `as_index=True`, grouping column will be plased as the index in the result.

## By external

You can use an arbitrary array (that is not a column of the dataframe being grouped) for grouping by passing them to the `by` argument.

---

The following cell creates a example dataframe.

In [14]:
df = pd.DataFrame({
    "g": ['a', 'a', 'a', 'b', 'b'],
    "val": [1, 2, 3, 4, 5]
})
df

Unnamed: 0,g,val
0,a,1
1,a,2
2,a,3
3,b,4
4,b,5


The following cell shows grouping dataframe by given list.

In [8]:
df.groupby(["x", "x", "y", "y", "y"]).sum()

Unnamed: 0,g,val
x,aa,3
y,abb,12


The next code shows how to use several external grouping rules at the same time.

In [None]:
df.groupby([
    ["x", "x", "y", "y", "y"],
    ["m", "k", "m", "k", "m"]
]).sum()

Unnamed: 0,Unnamed: 1,g,val
x,k,a,2
x,m,a,1
y,k,b,4
y,m,ab,8


Mix external and internal variables.

In [13]:
df.groupby(
    [["x", "x", "y", "y", "y"], "g"]
).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,val
Unnamed: 0_level_1,g,Unnamed: 2_level_1
x,a,3
y,a,3
y,b,9


## `observed`

In the categorical datatype there is a possible case where a category exists but never appears in `series`. This parameter describes whether unobserved catetories will be used in `groupby` results (`False`) or only observed categories will be used (`True`).

---

The following cell sets the column `A` to the categorical datatype and adds a new category `l` that doesn't appear in any observation.

In [21]:
df = pd.DataFrame({
    "A": pd.Categorical(
        ['a', 'a', 'b', 'b'],
        categories=['a', 'b', 'lost_cat']
    ),
    "B": [1, 2, 3, 4]
})
df

Unnamed: 0,A,B
0,a,1
1,a,2
2,b,3
3,b,4


The key feature of the created dataframe is a fact that it is supposed to have more categories than it use. The following cell prints the `A` column of the dataframe:

In [22]:
df["A"]

0    a
1    a
2    b
3    b
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'lost_cat']

The `Categories` mentions `lost_cat`, but it never appears in the dataframe.

If `observed=True` in result appear only categories that are actually mentioned in data frame.

In [23]:
df.groupby("A", observed=True).sum()

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
a,3
b,7


If `observed=False`, then the result includes all categories specified in the categorical columns, even if some of them do not appear in the Series.

In [None]:
df.groupby("A", observed=False).sum()

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
a,3
b,7
lost_cat,0
