Intermediate
Table Partition¶
One of BigQuery's key features is its ability to partition a table into small chunks of data. In many cases, table partitions can improve query performance and reduce costs.
Consider a table of sales
| sale_id|sale_date | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
| 1|2022-07-26 | 1|G | 4.98|
| 6|NULL | 4|U | 2.18|
| 2|2022-07-26 | 2|I | 19.67|
| 4|2022-07-29 | 3|J | 4.73|
| 5|2022-07-30 | 4|W | 10.68|
| 3|2022-07-27 | 3|I | 7.01|
To fetch sales on 2022-07-27, BigQuery has to scan the entire sale_date
column. However, if we set up the table with a
daily partition on sale_date
, the records will be partitioned into chunks like this
| sale_id|sale_date | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
| 1|2022-07-26 | 1|G | 4.98|
| 2|2022-07-26 | 2|I | 19.67|
----------------------------------------------------
| 3|2022-07-27 | 3|I | 7.01|
----------------------------------------------------
| 4|2022-07-29 | 3|J | 4.73|
----------------------------------------------------
| 5|2022-07-30 | 4|W | 10.68|
----------------------------------------------------
| 6|NULL | 4|U | 2.18|
Now, BigQuery doesn't need to can the entire table to identify sales on a particular date.
Notes¶
- A table can have zero or one partition.
- Partitions can be set on sequential data types (Dates, Timestamps, or Integers).
- You can make a partition with a
PARTITION BY
clause. - A partition table can have up to 4,000 partitions.
Table Cluster¶
Another option to improve performance on a BigQuery table is to assign it a cluster. A cluster improves performance by
pre-sorting the data inside a table. For example, clustering the sales
table by sale_id
might order the data like this.
| sale_id|sale_date | customer_id|product | amount|
|-------:|:----------|-----------:|:-------|------:|
| 1|2022-07-26 | 1|G | 4.98|
| 2|2022-07-26 | 2|I | 19.67|
| 3|2022-07-27 | 3|I | 7.01|
| 4|2022-07-29 | 3|J | 4.73|
| 5|2022-07-30 | 4|W | 10.68|
| 6|NULL | 4|U | 2.18|
Now, fetching rows WHERE sale_id IN (1,4)
will be fast and efficient.
Notes
- Tables can have a partition and a cluster (even on the same field).
- A cluster can be set on a collection of multiple columns.
- A cluster can be defined with a CLUSTER BY
clause containing a clustering_column_list
.