This post covers two categories of BigQuery best practices—cost reduction and query performance.
Cost reduction
Avoid using SELECT *
. Specify the names of the desired columns instead. BigQuery stores data in columns, so when you specify columns, only that data is read.
Always price queries before running them. The price is given in bytes to be processed in the upper right corner of the query screen.
Use clustered and partitioned tables.
Use streaming inserts with caution. They can increase your costs drastically.
Materialize query results in stages. Among other advantages, by breaking down queries into smaller stages and processing data incrementally, BigQuery can allocate resources more efficiently. Only the necessary compute and storage resources are used at each stage of the query execution, reducing overall resource consumption.
Query performance
Filter on partitioned columns. This helps with cost too.
Denormalize your tables. Denormalizing data is a database design technique where redundant copies of data from one or more normalized tables are combined and stored together in a single table. Retrieving data from a single denormalized table is typically faster than joining multiple normalized tables. You can use nested or repeated columns to do this.
Use external data sources sparingly. When you query data from an external data source, such as Google cloud storage, there is inherent network overhead involved in transferring the data over the network to the compute resources where BigQuery queries are executed. This network transfer can introduce latency and impact query performance, especially for large datasets or when network bandwidth is limited.
You want to reduce data before performing a join. This means you filter or aggregate data to minimize the amount of data that needs to be processed during the join operation. This method should improve query performance and reduce resource usage by eliminating unnecessary data from the tables before executing the join.
The instructor also suggested avoiding the use of WITH clauses as prepared statements. Since I don't understand how to use a WITH clause, I'm safe from making this mistake.
Avoid oversharding the data. Sharding a table involves partitioning the data into smaller subsets or shards and distributing them across multiple storage locations or nodes to improve scalability, performance, and availability in distributed database systems. (Partitioning as we have already discussed is not a form of sharding.) But if you overshard, that means you create too many partitions and may incur a performance drop. Sometimes the complexity and overhead of sharding outweigh the potential performance benefits, especially in distributed systems with a large number of shards.
You should do ORDER BY last. Also optimize your JOIN patterns by placing the largest table first, then the smallest, and then order the rest of the tables (if any) by decreasing size. However, it's best to experiment with the order of the tables before executing the query, using the price given in bytes processed in the upper right of the query view.
This learning in public is partly for my learning, partly to advertise DataTalksClub's FREE data engineering zoomcamp. It's active now, but you can join anytime.
Top comments (0)