Creating & loading#
Pandas have variety of method to create dataframe or load table as pandas dataframe. This sections consdiers options.
Find out more on the ways to define data frame in the basics data frame section of the official documentation.
import pandas as pd
import docker
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
docker_client = docker.client.from_env()
SQL#
One of the most popular options used in production is to load the results of SQL queries as pandas.DataFrame
. For this purpose, pandas
has a read_sql
function that requires a query to be executed and a special object that implements the connection to the database, typically it can be a sqlalchemy
engine.
The following cell runs a docker container, so we can try to connect to the database by ourselves.
container = docker_client.containers.run(
image="postgres:15.4",
name="read_postgres_to_pandas",
detach=True,
environment={"POSTGRES_USER": "postgres", "POSTGRES_PASSWORD": "password"},
remove=True,
ports={"5432/tcp": 5432},
)
The following cell shows the creation of the sqlalchemy
engine and loading some information from it.
url_object = URL.create(
"postgresql+psycopg2",
username="postgres",
password="password",
host="localhost",
port=5432,
database="postgres",
)
engine = create_engine(url_object)
df = pd.read_sql('SELECT * FROM information_schema.columns', con=engine)
display(df)
table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | ... | is_identity | identity_generation | identity_start | identity_increment | identity_maximum | identity_minimum | identity_cycle | is_generated | generation_expression | is_updatable | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | postgres | pg_catalog | pg_init_privs | privtype | 4 | None | NO | "char" | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
1 | postgres | pg_catalog | pg_constraint | conkey | 19 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
2 | postgres | pg_catalog | pg_constraint | confkey | 20 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
3 | postgres | pg_catalog | pg_constraint | conpfeqop | 21 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
4 | postgres | pg_catalog | pg_constraint | conppeqop | 22 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2000 | postgres | information_schema | schemata | default_character_set_catalog | 4 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2001 | postgres | information_schema | schemata | default_character_set_schema | 5 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2002 | postgres | information_schema | schemata | default_character_set_name | 6 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2003 | postgres | information_schema | schemata | sql_path | 7 | None | YES | character varying | NaN | 1.073742e+09 | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2004 | postgres | information_schema | sequences | sequence_catalog | 1 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2005 rows × 44 columns
Connection url#
You can just use the connection url string as the con
argument of the read_sql
function. Looks like it still uses SQLAlchemy
under the hood but it can make things easier.
The following cell loads information_schema.columns
as we did before, but uses con=<connection url>
.
df = pd.read_sql(
'SELECT * FROM information_schema.columns',
con="postgresql+psycopg2://postgres:password@localhost:5432/postgres"
)
display(df)
table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | ... | is_identity | identity_generation | identity_start | identity_increment | identity_maximum | identity_minimum | identity_cycle | is_generated | generation_expression | is_updatable | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | postgres | pg_catalog | pg_init_privs | privtype | 4 | None | NO | "char" | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
1 | postgres | pg_catalog | pg_constraint | conkey | 19 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
2 | postgres | pg_catalog | pg_constraint | confkey | 20 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
3 | postgres | pg_catalog | pg_constraint | conpfeqop | 21 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
4 | postgres | pg_catalog | pg_constraint | conppeqop | 22 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2000 | postgres | information_schema | schemata | default_character_set_catalog | 4 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2001 | postgres | information_schema | schemata | default_character_set_schema | 5 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2002 | postgres | information_schema | schemata | default_character_set_name | 6 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2003 | postgres | information_schema | schemata | sql_path | 7 | None | YES | character varying | NaN | 1.073742e+09 | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2004 | postgres | information_schema | sequences | sequence_catalog | 1 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2005 rows × 44 columns
Result is the same but much less code.
No SQLAlchemy#
It’s possible to use the Postgres database without using SQLAlchemy - just pass the connection object
from psycopg2
to the con
parameter of the read_sql
function.
The following example shows such a case.
import psycopg2
conn = psycopg2.connect(
port="5432",
dbname="postgres",
user="postgres",
password="password",
host="localhost"
)
df = pd.read_sql('SELECT * FROM information_schema.columns', con=conn)
display(df)
conn.close()
/tmp/ipykernel_81333/3364160090.py:11: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
df = pd.read_sql('SELECT * FROM information_schema.columns', con=conn)
table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | ... | is_identity | identity_generation | identity_start | identity_increment | identity_maximum | identity_minimum | identity_cycle | is_generated | generation_expression | is_updatable | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | postgres | pg_catalog | pg_init_privs | privtype | 4 | None | NO | "char" | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
1 | postgres | pg_catalog | pg_constraint | conkey | 19 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
2 | postgres | pg_catalog | pg_constraint | confkey | 20 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
3 | postgres | pg_catalog | pg_constraint | conpfeqop | 21 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
4 | postgres | pg_catalog | pg_constraint | conppeqop | 22 | None | YES | ARRAY | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | YES |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2000 | postgres | information_schema | schemata | default_character_set_catalog | 4 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2001 | postgres | information_schema | schemata | default_character_set_schema | 5 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2002 | postgres | information_schema | schemata | default_character_set_name | 6 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2003 | postgres | information_schema | schemata | sql_path | 7 | None | YES | character varying | NaN | 1.073742e+09 | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2004 | postgres | information_schema | sequences | sequence_catalog | 1 | None | YES | name | NaN | NaN | ... | NO | None | None | None | None | None | NO | NEVER | None | NO |
2005 rows × 44 columns
As a result there is an expected result, but also a warning that advises to use sqlalchemy
.
List tables#
You may face issues with using \dt
command in postgres to list tables available in the data frame. Use appropritate table - in postgres case it would be pg_catalog.pg_tables
.
The following code tries to run pd.read_sql
for listing available in the database tables.
try:
pd.read_sql("\dt;", con=engine)
except Exception as e:
print(e)
(psycopg2.errors.SyntaxError) syntax error at or near "\"
LINE 1: \dt;
^
[SQL: \dt;]
(Background on this error at: https://sqlalche.me/e/20/f405)
<>:2: SyntaxWarning: invalid escape sequence '\d'
<>:2: SyntaxWarning: invalid escape sequence '\d'
/tmp/ipykernel_81333/3553204363.py:2: SyntaxWarning: invalid escape sequence '\d'
pd.read_sql("\dt;", con=engine)
The following cell shows you how you’re supposed to deal with it.
pd.read_sql(
"SELECT * FROM pg_catalog.pg_tables",
con = engine
)
schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity | |
---|---|---|---|---|---|---|---|---|
0 | pg_catalog | pg_statistic | postgres | None | True | False | False | False |
1 | pg_catalog | pg_type | postgres | None | True | False | False | False |
2 | pg_catalog | pg_foreign_table | postgres | None | True | False | False | False |
3 | pg_catalog | pg_authid | postgres | pg_global | True | False | False | False |
4 | pg_catalog | pg_statistic_ext_data | postgres | None | True | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | pg_catalog | pg_largeobject | postgres | None | True | False | False | False |
64 | information_schema | sql_features | postgres | None | False | False | False | False |
65 | information_schema | sql_implementation_info | postgres | None | False | False | False | False |
66 | information_schema | sql_parts | postgres | None | False | False | False | False |
67 | information_schema | sql_sizing | postgres | None | False | False | False | False |
68 rows × 8 columns