Columns by types#
Sometimes it’s really useful to be able to select columns with specific data types. For example, in machine learning it’s very common to process numeric and categorical columns in different ways.
pandas.DataFrame.select_dtypes - key function here.
Experimental frame#
The following cell creates a data frame with columns of different datatypes.
import pandas as pd
import numpy as np
from IPython.display import HTML
np.random.seed(10)
sample_shape = (100, 20)
generate_numeric = lambda sample_size: np.random.normal(0, 1, sample_size)
generate_str = lambda sample_size:[
"".join(map(chr, np.random.randint(low = ord("a"), high = ord("z"), size=10)))
for i in range(sample_size)
]
test_frame = pd.DataFrame({
f"column{i}" : np.random.choice([generate_numeric, generate_str])(sample_shape[0])
for i in range(sample_shape[1])
})
test_frame.head(5)
column0 | column1 | column2 | column3 | column4 | column5 | column6 | column7 | column8 | column9 | column10 | column11 | column12 | column13 | column14 | column15 | column16 | column17 | column18 | column19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | eparqrijak | -0.885237 | icveuvsnde | cbrwxbagsa | joojouhnsa | cmuhdgunmu | nmfjqsnnvq | 0.900117 | -1.269680 | 1.673854 | -0.981047 | wxejdwfwks | ywexlugyio | oueqgttxsx | eksymjwner | xyowiyxxfa | gxcccqjhsp | twhcjgunju | grccysdkoh | 0.114989 |
1 | iwetqeplwy | 0.322590 | bfkypaqefr | fjkhhelgvn | cfpvrmueij | crsvpfjehm | pfhardywns | 1.762525 | 1.519998 | 0.085414 | 1.083912 | pwtchygtok | dpvswtdmvg | lvsxcxvnpp | gwunqbkhdh | ubcrvcnpgs | mauwtacklp | yvfdtqbixn | xwrvapaqbs | 1.247267 |
2 | lbieortwnf | -0.449772 | siavscpewk | mhnjrhtnau | xaxeyfoxvc | tbsrqpmmew | txeqmsqaau | -1.020418 | 0.898286 | 1.805783 | 1.103664 | xitnomidxt | fgyktnnake | rnxbuhgnnt | bcjbulycwx | ovqimfwygl | jmdnepbygg | nrgyocwhqx | ytwxdqablj | -1.462726 |
3 | ntnwmbesnw | 0.790567 | ksqksaxcvf | xdfgbltktd | rwmjlcjiun | jeixeeqhgc | jaksucxefd | -0.017661 | -0.274982 | -0.077126 | -0.643199 | gddfxpsqfv | dnjqsbynyw | qxjyvrnewn | xnyptpulyq | epwspoqbyg | bsvjpmulwr | cwywpacbma | krhtlbypnu | -0.517101 |
4 | xylkyjpsqw | 1.690074 | rebhymnhun | rjbandmwys | qectgjuiwb | wcrrgxwbdl | nutkvjctxv | -1.259420 | -0.229280 | -0.698492 | -0.351349 | qooeitqcjb | udrqckibws | qacerqmmxk | bctmrcaypo | exnlqjeilv | jmcoelgcbu | ycqvdihklu | oqakhlnuux | 0.121970 |
Select numeric columns#
Just use pandas.DataFrame.select_dtypes("number")
. So in the following cell I’ll use this syntax and show that the results are correct.
numeric_columns = test_frame.select_dtypes("number")
display(HTML("<b>Head</b>"))
display(numeric_columns.head())
display(HTML("<b>Data types</b>"))
display(pd.Series(numeric_columns.dtypes, name = "Data type").to_frame())
Head
column1 | column7 | column8 | column9 | column10 | column19 | |
---|---|---|---|---|---|---|
0 | -0.885237 | 0.900117 | -1.269680 | 1.673854 | -0.981047 | 0.114989 |
1 | 0.322590 | 1.762525 | 1.519998 | 0.085414 | 1.083912 | 1.247267 |
2 | -0.449772 | -1.020418 | 0.898286 | 1.805783 | 1.103664 | -1.462726 |
3 | 0.790567 | -0.017661 | -0.274982 | -0.077126 | -0.643199 | -0.517101 |
4 | 1.690074 | -1.259420 | -0.229280 | -0.698492 | -0.351349 | 0.121970 |
Data types
Data type | |
---|---|
column1 | float64 |
column7 | float64 |
column8 | float64 |
column9 | float64 |
column10 | float64 |
column19 | float64 |