Creating & loading

Creating & loading#

Pandas have variety of method to create dataframe or load table as pandas dataframe. This sections consdiers options.

Find out more on the ways to define data frame in the basics data frame section of the official documentation.

import pandas as pd

import docker

from sqlalchemy import create_engine
from sqlalchemy.engine import URL 

docker_client = docker.client.from_env()

SQL#

One of the most popular options used in production is to load the results of SQL queries as pandas.DataFrame. For this purpose, pandas has a read_sql function that requires a query to be executed and a special object that implements the connection to the database, typically it can be a sqlalchemy engine.


The following cell runs a docker container, so we can try to connect to the database by ourselves.

container = docker_client.containers.run(
    image="postgres:15.4",
    name="read_postgres_to_pandas",
    detach=True,
    environment={"POSTGRES_USER": "postgres", "POSTGRES_PASSWORD": "password"},
    remove=True,
    ports={"5432/tcp": 5432},
)

The following cell shows the creation of the sqlalchemy engine and loading some information from it.

url_object = URL.create(
    "postgresql+psycopg2",
    username="postgres",
    password="password",
    host="localhost",
    port=5432,
    database="postgres",
)
engine = create_engine(url_object)

df = pd.read_sql('SELECT * FROM information_schema.columns', con=engine)
display(df)
table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length character_octet_length ... is_identity identity_generation identity_start identity_increment identity_maximum identity_minimum identity_cycle is_generated generation_expression is_updatable
0 postgres pg_catalog pg_init_privs privtype 4 None NO "char" NaN NaN ... NO None None None None None NO NEVER None YES
1 postgres pg_catalog pg_constraint conkey 19 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
2 postgres pg_catalog pg_constraint confkey 20 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
3 postgres pg_catalog pg_constraint conpfeqop 21 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
4 postgres pg_catalog pg_constraint conppeqop 22 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2000 postgres information_schema schemata default_character_set_catalog 4 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2001 postgres information_schema schemata default_character_set_schema 5 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2002 postgres information_schema schemata default_character_set_name 6 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2003 postgres information_schema schemata sql_path 7 None YES character varying NaN 1.073742e+09 ... NO None None None None None NO NEVER None NO
2004 postgres information_schema sequences sequence_catalog 1 None YES name NaN NaN ... NO None None None None None NO NEVER None NO

2005 rows × 44 columns

Connection url#

You can just use the connection url string as the con argument of the read_sql function. Looks like it still uses SQLAlchemy under the hood but it can make things easier.


The following cell loads information_schema.columns as we did before, but uses con=<connection url>.

df = pd.read_sql(
    'SELECT * FROM information_schema.columns',
    con="postgresql+psycopg2://postgres:password@localhost:5432/postgres"
)
display(df)
table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length character_octet_length ... is_identity identity_generation identity_start identity_increment identity_maximum identity_minimum identity_cycle is_generated generation_expression is_updatable
0 postgres pg_catalog pg_init_privs privtype 4 None NO "char" NaN NaN ... NO None None None None None NO NEVER None YES
1 postgres pg_catalog pg_constraint conkey 19 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
2 postgres pg_catalog pg_constraint confkey 20 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
3 postgres pg_catalog pg_constraint conpfeqop 21 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
4 postgres pg_catalog pg_constraint conppeqop 22 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2000 postgres information_schema schemata default_character_set_catalog 4 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2001 postgres information_schema schemata default_character_set_schema 5 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2002 postgres information_schema schemata default_character_set_name 6 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2003 postgres information_schema schemata sql_path 7 None YES character varying NaN 1.073742e+09 ... NO None None None None None NO NEVER None NO
2004 postgres information_schema sequences sequence_catalog 1 None YES name NaN NaN ... NO None None None None None NO NEVER None NO

2005 rows × 44 columns

Result is the same but much less code.

No SQLAlchemy#

It’s possible to use the Postgres database without using SQLAlchemy - just pass the connection object from psycopg2 to the con parameter of the read_sql function.


The following example shows such a case.

import psycopg2

conn = psycopg2.connect(
    port="5432",
    dbname="postgres",
    user="postgres",
    password="password",
    host="localhost"
)

df = pd.read_sql('SELECT * FROM information_schema.columns', con=conn)
display(df)
conn.close()
/tmp/ipykernel_81333/3364160090.py:11: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df = pd.read_sql('SELECT * FROM information_schema.columns', con=conn)
table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length character_octet_length ... is_identity identity_generation identity_start identity_increment identity_maximum identity_minimum identity_cycle is_generated generation_expression is_updatable
0 postgres pg_catalog pg_init_privs privtype 4 None NO "char" NaN NaN ... NO None None None None None NO NEVER None YES
1 postgres pg_catalog pg_constraint conkey 19 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
2 postgres pg_catalog pg_constraint confkey 20 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
3 postgres pg_catalog pg_constraint conpfeqop 21 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
4 postgres pg_catalog pg_constraint conppeqop 22 None YES ARRAY NaN NaN ... NO None None None None None NO NEVER None YES
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2000 postgres information_schema schemata default_character_set_catalog 4 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2001 postgres information_schema schemata default_character_set_schema 5 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2002 postgres information_schema schemata default_character_set_name 6 None YES name NaN NaN ... NO None None None None None NO NEVER None NO
2003 postgres information_schema schemata sql_path 7 None YES character varying NaN 1.073742e+09 ... NO None None None None None NO NEVER None NO
2004 postgres information_schema sequences sequence_catalog 1 None YES name NaN NaN ... NO None None None None None NO NEVER None NO

2005 rows × 44 columns

As a result there is an expected result, but also a warning that advises to use sqlalchemy.

List tables#

You may face issues with using \dt command in postgres to list tables available in the data frame. Use appropritate table - in postgres case it would be pg_catalog.pg_tables.


The following code tries to run pd.read_sql for listing available in the database tables.

try:
    pd.read_sql("\dt;", con=engine)
except Exception as e:
    print(e)
(psycopg2.errors.SyntaxError) syntax error at or near "\"
LINE 1: \dt;
        ^

[SQL: \dt;]
(Background on this error at: https://sqlalche.me/e/20/f405)
<>:2: SyntaxWarning: invalid escape sequence '\d'
<>:2: SyntaxWarning: invalid escape sequence '\d'
/tmp/ipykernel_81333/3553204363.py:2: SyntaxWarning: invalid escape sequence '\d'
  pd.read_sql("\dt;", con=engine)

The following cell shows you how you’re supposed to deal with it.

pd.read_sql(
    "SELECT * FROM pg_catalog.pg_tables",
    con = engine
)
schemaname tablename tableowner tablespace hasindexes hasrules hastriggers rowsecurity
0 pg_catalog pg_statistic postgres None True False False False
1 pg_catalog pg_type postgres None True False False False
2 pg_catalog pg_foreign_table postgres None True False False False
3 pg_catalog pg_authid postgres pg_global True False False False
4 pg_catalog pg_statistic_ext_data postgres None True False False False
... ... ... ... ... ... ... ... ...
63 pg_catalog pg_largeobject postgres None True False False False
64 information_schema sql_features postgres None False False False False
65 information_schema sql_implementation_info postgres None False False False False
66 information_schema sql_parts postgres None False False False False
67 information_schema sql_sizing postgres None False False False False

68 rows × 8 columns