When working with large datasets, it's common to extract PostgreSQL tables to CSV files for further analysis and sharing. However, naively exporting a large table can crash your machine by consuming all available RAM.
In this article, I'll explain an efficient technique to export big PostgreSQL tables to CSV using Python, based on my experience as a Research Assistant analyzing criminal record relief data.
The Naive Approach
You can export a table like this:
# Execute a SELECT query
cursor.execute("SELECT * FROM employee")
# Fetch all rows
rows = cursor.fetchall()
# Write rows to CSV
with open("employees.csv", "w") as f:
for row in rows:
f.write(",".join([str(cell) for cell in row]) + "\n")
However, this loads the entire table into memory - which will crash your machine for a big table!
Streaming Data in Chunks with psycopg2
The key is to stream the data in chunks from the database directly to disk. The psycopg2 module provides a handy COPY method that lets us do just that.
First, connect to the database and open the output CSV:
import psycopg2
conn = psycopg2.connect("dbname=mydb user=postgres")
with open('out.csv', 'w') as f:
Next, use copy_expert() to stream the table to the file in chunks:
cur = conn.cursor()
cur.copy_expert('COPY employee TO STDOUT WITH CSV HEADER', f)
The table is streamed to the CSV without loading everything into memory!
Putting It All Together
Here is the full script to efficiently export a large PostgreSQL table to a CSV:
import psycopg2
conn = psycopg2.connect("dbname=mydb user=postgres")
cur = conn.cursor()
with open('out.csv', 'w') as f:
cur.copy_expert('COPY employees TO STDOUT WITH CSV HEADER', f)
print("Table exported to CSV!")
By streaming the data in chunks directly to a file, we avoid RAM issues and efficiently export the table.
Extracting results of SQL query
The above method copies all the data from the employee
table to a CSV file. But if you want to store the results of a query with the above method you will have to modify it slightly as below.
import psycopg2
conn = psycopg2.connect("dbname=mydb user=postgres")
cur = conn.cursor()
# list of departments from which we want all employees
dept_list = ['dept1', 'dept2']
query = """SELECT *
FROM employee
WHERE department IN %s"""
query_params = (tuple(dept_list),)
query = cur.mogrify(query, query_params)
query = query.decode('utf-8')
with open('out.csv', 'w') as f:
cur.copy_expert("COPY ({}) TO STDOUT WITH CSV HEADER".format(query), f)
print("Query results exported to CSV!")
This technique has helped me quickly extract large criminal record relief datasets from PostgreSQL databases for analysis. Let me know if you have any other tips and tricks!
Top comments (3)
\COPY my_table TO 'filename' CSV HEADER
hey could you also elaborate on how can we control the chunk size being processed here?
There is no built-in method to control the chunk size. The COPY protocol itself focuses on efficient bulk data transfer. If you DO want to control you can change the config on the PostgreSQL server OR iterate through your data in loops.