Creating & loading

Creating & loading#

Pandas have variety of method to create dataframe or load table as pandas dataframe. This sections consdiers options.

Find out more on the ways to define data frame in the basics data frame section of the official documentation.

import pandas as pd

import docker

from sqlalchemy import create_engine
from sqlalchemy.engine import URL 

docker_client = docker.client.from_env()

SQL#

One of the most popular options used in production is to load the results of SQL queries as pandas.DataFrame. For this purpose, pandas has a read_sql function that requires a query to be executed and a special object that implements the connection to the database, typically it can be a sqlalchemy engine.

The following cell runs a docker container, so we can try to connect to the database by ourselves.

container = docker_client.containers.run(
    image="postgres:15.4",
    name="read_postgres_to_pandas",
    detach=True,
    environment={"POSTGRES_USER": "postgres", "POSTGRES_PASSWORD": "password"},
    remove=True,
    ports={"5432/tcp": 5432},
)

The following cell shows the creation of the sqlalchemy engine and loading some information from it.

url_object = URL.create(
    "postgresql+psycopg2",
    username="postgres",
    password="password",
    host="localhost",
    port=5432,
    database="postgres",
)
engine = create_engine(url_object)

df = pd.read_sql('SELECT * FROM information_schema.columns', con=engine)
display(df)

	table_catalog	table_schema	table_name	column_name	ordinal_position	column_default	is_nullable	data_type	character_maximum_length	character_octet_length	...	is_identity	identity_generation	identity_start	identity_increment	identity_maximum	identity_minimum	identity_cycle	is_generated	generation_expression	is_updatable
0	postgres	pg_catalog	pg_init_privs	privtype	4	None	NO	"char"	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
1	postgres	pg_catalog	pg_constraint	conkey	19	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
2	postgres	pg_catalog	pg_constraint	confkey	20	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
3	postgres	pg_catalog	pg_constraint	conpfeqop	21	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
4	postgres	pg_catalog	pg_constraint	conppeqop	22	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2000	postgres	information_schema	schemata	default_character_set_catalog	4	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2001	postgres	information_schema	schemata	default_character_set_schema	5	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2002	postgres	information_schema	schemata	default_character_set_name	6	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2003	postgres	information_schema	schemata	sql_path	7	None	YES	character varying	NaN	1.073742e+09	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2004	postgres	information_schema	sequences	sequence_catalog	1	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO

2005 rows × 44 columns

Connection url#

You can just use the connection url string as the con argument of the read_sql function. Looks like it still uses SQLAlchemy under the hood but it can make things easier.

The following cell loads information_schema.columns as we did before, but uses con=<connection url>.

df = pd.read_sql(
    'SELECT * FROM information_schema.columns',
    con="postgresql+psycopg2://postgres:password@localhost:5432/postgres"
)
display(df)

	table_catalog	table_schema	table_name	column_name	ordinal_position	column_default	is_nullable	data_type	character_maximum_length	character_octet_length	...	is_identity	identity_generation	identity_start	identity_increment	identity_maximum	identity_minimum	identity_cycle	is_generated	generation_expression	is_updatable
0	postgres	pg_catalog	pg_init_privs	privtype	4	None	NO	"char"	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
1	postgres	pg_catalog	pg_constraint	conkey	19	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
2	postgres	pg_catalog	pg_constraint	confkey	20	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
3	postgres	pg_catalog	pg_constraint	conpfeqop	21	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
4	postgres	pg_catalog	pg_constraint	conppeqop	22	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2000	postgres	information_schema	schemata	default_character_set_catalog	4	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2001	postgres	information_schema	schemata	default_character_set_schema	5	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2002	postgres	information_schema	schemata	default_character_set_name	6	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2003	postgres	information_schema	schemata	sql_path	7	None	YES	character varying	NaN	1.073742e+09	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2004	postgres	information_schema	sequences	sequence_catalog	1	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO

2005 rows × 44 columns

Result is the same but much less code.

No SQLAlchemy#

It’s possible to use the Postgres database without using SQLAlchemy - just pass the connection object from psycopg2 to the con parameter of the read_sql function.

The following example shows such a case.

import psycopg2

conn = psycopg2.connect(
    port="5432",
    dbname="postgres",
    user="postgres",
    password="password",
    host="localhost"
)

df = pd.read_sql('SELECT * FROM information_schema.columns', con=conn)
display(df)
conn.close()

/tmp/ipykernel_81333/3364160090.py:11: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df = pd.read_sql('SELECT * FROM information_schema.columns', con=conn)

	table_catalog	table_schema	table_name	column_name	ordinal_position	column_default	is_nullable	data_type	character_maximum_length	character_octet_length	...	is_identity	identity_generation	identity_start	identity_increment	identity_maximum	identity_minimum	identity_cycle	is_generated	generation_expression	is_updatable
0	postgres	pg_catalog	pg_init_privs	privtype	4	None	NO	"char"	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
1	postgres	pg_catalog	pg_constraint	conkey	19	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
2	postgres	pg_catalog	pg_constraint	confkey	20	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
3	postgres	pg_catalog	pg_constraint	conpfeqop	21	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
4	postgres	pg_catalog	pg_constraint	conppeqop	22	None	YES	ARRAY	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	YES
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2000	postgres	information_schema	schemata	default_character_set_catalog	4	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2001	postgres	information_schema	schemata	default_character_set_schema	5	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2002	postgres	information_schema	schemata	default_character_set_name	6	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2003	postgres	information_schema	schemata	sql_path	7	None	YES	character varying	NaN	1.073742e+09	...	NO	None	None	None	None	None	NO	NEVER	None	NO
2004	postgres	information_schema	sequences	sequence_catalog	1	None	YES	name	NaN	NaN	...	NO	None	None	None	None	None	NO	NEVER	None	NO

2005 rows × 44 columns

As a result there is an expected result, but also a warning that advises to use sqlalchemy.

List tables#

You may face issues with using \dt command in postgres to list tables available in the data frame. Use appropritate table - in postgres case it would be pg_catalog.pg_tables.

The following code tries to run pd.read_sql for listing available in the database tables.

try:
    pd.read_sql("\dt;", con=engine)
except Exception as e:
    print(e)

(psycopg2.errors.SyntaxError) syntax error at or near "\"
LINE 1: \dt;
        ^

[SQL: \dt;]
(Background on this error at: https://sqlalche.me/e/20/f405)

<>:2: SyntaxWarning: invalid escape sequence '\d'
<>:2: SyntaxWarning: invalid escape sequence '\d'
/tmp/ipykernel_81333/3553204363.py:2: SyntaxWarning: invalid escape sequence '\d'
  pd.read_sql("\dt;", con=engine)

The following cell shows you how you’re supposed to deal with it.

pd.read_sql(
    "SELECT * FROM pg_catalog.pg_tables",
    con = engine
)

	schemaname	tablename	tableowner	tablespace	hasindexes	hasrules	hastriggers	rowsecurity
0	pg_catalog	pg_statistic	postgres	None	True	False	False	False
1	pg_catalog	pg_type	postgres	None	True	False	False	False
2	pg_catalog	pg_foreign_table	postgres	None	True	False	False	False
3	pg_catalog	pg_authid	postgres	pg_global	True	False	False	False
4	pg_catalog	pg_statistic_ext_data	postgres	None	True	False	False	False
...	...	...	...	...	...	...	...	...
63	pg_catalog	pg_largeobject	postgres	None	True	False	False	False
64	information_schema	sql_features	postgres	None	False	False	False	False
65	information_schema	sql_implementation_info	postgres	None	False	False	False	False
66	information_schema	sql_parts	postgres	None	False	False	False	False
67	information_schema	sql_sizing	postgres	None	False	False	False	False

68 rows × 8 columns

Creating & loading

Contents

Creating & loading#

SQL#

Connection url#

No SQLAlchemy#

List tables#