Columns by types

Contents

Columns by types#

Sometimes it’s really useful to be able to select columns with specific data types. For example, in machine learning it’s very common to process numeric and categorical columns in different ways.

pandas.DataFrame.select_dtypes - key function here.

Experimental frame#

The following cell creates a data frame with columns of different datatypes.

import pandas as pd
import numpy as np

from IPython.display import HTML

np.random.seed(10)
     
sample_shape = (100, 20)
generate_numeric = lambda sample_size: np.random.normal(0, 1, sample_size)
generate_str = lambda sample_size:[
    "".join(map(chr, np.random.randint(low = ord("a"), high = ord("z"), size=10))) 
    for i in range(sample_size)
]

test_frame = pd.DataFrame({
    f"column{i}" : np.random.choice([generate_numeric, generate_str])(sample_shape[0])
    for i in range(sample_shape[1])
})

test_frame.head(5)

	column0	column1	column2	column3	column4	column5	column6	column7	column8	column9	column10	column11	column12	column13	column14	column15	column16	column17	column18	column19
0	eparqrijak	-0.885237	icveuvsnde	cbrwxbagsa	joojouhnsa	cmuhdgunmu	nmfjqsnnvq	0.900117	-1.269680	1.673854	-0.981047	wxejdwfwks	ywexlugyio	oueqgttxsx	eksymjwner	xyowiyxxfa	gxcccqjhsp	twhcjgunju	grccysdkoh	0.114989
1	iwetqeplwy	0.322590	bfkypaqefr	fjkhhelgvn	cfpvrmueij	crsvpfjehm	pfhardywns	1.762525	1.519998	0.085414	1.083912	pwtchygtok	dpvswtdmvg	lvsxcxvnpp	gwunqbkhdh	ubcrvcnpgs	mauwtacklp	yvfdtqbixn	xwrvapaqbs	1.247267
2	lbieortwnf	-0.449772	siavscpewk	mhnjrhtnau	xaxeyfoxvc	tbsrqpmmew	txeqmsqaau	-1.020418	0.898286	1.805783	1.103664	xitnomidxt	fgyktnnake	rnxbuhgnnt	bcjbulycwx	ovqimfwygl	jmdnepbygg	nrgyocwhqx	ytwxdqablj	-1.462726
3	ntnwmbesnw	0.790567	ksqksaxcvf	xdfgbltktd	rwmjlcjiun	jeixeeqhgc	jaksucxefd	-0.017661	-0.274982	-0.077126	-0.643199	gddfxpsqfv	dnjqsbynyw	qxjyvrnewn	xnyptpulyq	epwspoqbyg	bsvjpmulwr	cwywpacbma	krhtlbypnu	-0.517101
4	xylkyjpsqw	1.690074	rebhymnhun	rjbandmwys	qectgjuiwb	wcrrgxwbdl	nutkvjctxv	-1.259420	-0.229280	-0.698492	-0.351349	qooeitqcjb	udrqckibws	qacerqmmxk	bctmrcaypo	exnlqjeilv	jmcoelgcbu	ycqvdihklu	oqakhlnuux	0.121970

Select numeric columns#

Just use pandas.DataFrame.select_dtypes("number"). So in the following cell I’ll use this syntax and show that the results are correct.

numeric_columns = test_frame.select_dtypes("number")

display(HTML("<b>Head</b>"))
display(numeric_columns.head())
display(HTML("<b>Data types</b>"))
display(pd.Series(numeric_columns.dtypes, name = "Data type").to_frame())

Head

	column1	column7	column8	column9	column10	column19
0	-0.885237	0.900117	-1.269680	1.673854	-0.981047	0.114989
1	0.322590	1.762525	1.519998	0.085414	1.083912	1.247267
2	-0.449772	-1.020418	0.898286	1.805783	1.103664	-1.462726
3	0.790567	-0.017661	-0.274982	-0.077126	-0.643199	-0.517101
4	1.690074	-1.259420	-0.229280	-0.698492	-0.351349	0.121970

Data types

	Data type
column1	float64
column7	float64
column8	float64
column9	float64
column10	float64
column19	float64