Columns by types

Columns by types#

Sometimes it’s really useful to be able to select columns with specific data types. For example, in machine learning it’s very common to process numeric and categorical columns in different ways.

Experimental frame#

The following cell creates a data frame with columns of different datatypes.

import pandas as pd
import numpy as np

from IPython.display import HTML

np.random.seed(10)
     
sample_shape = (100, 20)
generate_numeric = lambda sample_size: np.random.normal(0, 1, sample_size)
generate_str = lambda sample_size:[
    "".join(map(chr, np.random.randint(low = ord("a"), high = ord("z"), size=10))) 
    for i in range(sample_size)
]

test_frame = pd.DataFrame({
    f"column{i}" : np.random.choice([generate_numeric, generate_str])(sample_shape[0])
    for i in range(sample_shape[1])
})

test_frame.head(5)
column0 column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 column17 column18 column19
0 eparqrijak -0.885237 icveuvsnde cbrwxbagsa joojouhnsa cmuhdgunmu nmfjqsnnvq 0.900117 -1.269680 1.673854 -0.981047 wxejdwfwks ywexlugyio oueqgttxsx eksymjwner xyowiyxxfa gxcccqjhsp twhcjgunju grccysdkoh 0.114989
1 iwetqeplwy 0.322590 bfkypaqefr fjkhhelgvn cfpvrmueij crsvpfjehm pfhardywns 1.762525 1.519998 0.085414 1.083912 pwtchygtok dpvswtdmvg lvsxcxvnpp gwunqbkhdh ubcrvcnpgs mauwtacklp yvfdtqbixn xwrvapaqbs 1.247267
2 lbieortwnf -0.449772 siavscpewk mhnjrhtnau xaxeyfoxvc tbsrqpmmew txeqmsqaau -1.020418 0.898286 1.805783 1.103664 xitnomidxt fgyktnnake rnxbuhgnnt bcjbulycwx ovqimfwygl jmdnepbygg nrgyocwhqx ytwxdqablj -1.462726
3 ntnwmbesnw 0.790567 ksqksaxcvf xdfgbltktd rwmjlcjiun jeixeeqhgc jaksucxefd -0.017661 -0.274982 -0.077126 -0.643199 gddfxpsqfv dnjqsbynyw qxjyvrnewn xnyptpulyq epwspoqbyg bsvjpmulwr cwywpacbma krhtlbypnu -0.517101
4 xylkyjpsqw 1.690074 rebhymnhun rjbandmwys qectgjuiwb wcrrgxwbdl nutkvjctxv -1.259420 -0.229280 -0.698492 -0.351349 qooeitqcjb udrqckibws qacerqmmxk bctmrcaypo exnlqjeilv jmcoelgcbu ycqvdihklu oqakhlnuux 0.121970

Select numeric columns#

Just use pandas.DataFrame.select_dtypes("number"). So in the following cell I’ll use this syntax and show that the results are correct.

numeric_columns = test_frame.select_dtypes("number")

display(HTML("<b>Head</b>"))
display(numeric_columns.head())
display(HTML("<b>Data types</b>"))
display(pd.Series(numeric_columns.dtypes, name = "Data type").to_frame())
Head
column1 column7 column8 column9 column10 column19
0 -0.885237 0.900117 -1.269680 1.673854 -0.981047 0.114989
1 0.322590 1.762525 1.519998 0.085414 1.083912 1.247267
2 -0.449772 -1.020418 0.898286 1.805783 1.103664 -1.462726
3 0.790567 -0.017661 -0.274982 -0.077126 -0.643199 -0.517101
4 1.690074 -1.259420 -0.229280 -0.698492 -0.351349 0.121970
Data types
Data type
column1 float64
column7 float64
column8 float64
column9 float64
column10 float64
column19 float64