Speed comparison

Introduction

The purpose of this page is to compare the speed of data transfer between the load method from this library:

and Python Client for Google BigQuery’s methods:

Set up

[2]:
# %%bash
# pip install google-pandas-load
# pip install google-cloud-bigquery-storage[fastavro]==0.*
# pip install pyarrow==0.*
[3]:
import os
import pandas
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from google_pandas_load import LoaderQuickSetup
[4]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'bucket_gpl'
local_dir_path = '/tmp/gpl_directory'

Next, set bq_client, bqstorage_client, table_ref and instantiate a loader.

[5]:
bq_client = bigquery.Client(
    project=project_id,
    credentials=None)

bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=None)

table_ref = bigquery.dataset.DatasetReference(
    project=project_id,
    dataset_id=dataset_id).table('z0')

gpl = LoaderQuickSetup(
    project_id=project_id,
    dataset_id=dataset_id,
    bucket_name=bucket_name,
    local_dir_path=local_dir_path)
[6]:
if not os.path.isdir(local_dir_path):
    os.makedirs(local_dir_path)

Download

The query below creates a 1.01 GB BigQuery table.

[7]:
query = """
select * from
(select 'Hello, ' as a from unnest(generate_array(1, 8000)))
cross join
(select 'World!' as b from unnest(generate_array(1, 8000)))
"""
[8]:
%%time
df = bq_client.query(query).to_dataframe()
CPU times: user 5min 6s, sys: 10.2 s, total: 5min 16s
Wall time: 24min 55s

Using bqstorage_client speeds up the download. See here for additional information.

[9]:
%%time
df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)
CPU times: user 3min 23s, sys: 24.5 s, total: 3min 48s
Wall time: 3min 5s

There is issue with the previous download: it used the cache query results !

[10]:
%%time
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache=False
df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)
CPU times: user 3min 16s, sys: 23 s, total: 3min 39s
Wall time: 3min 21s

When executing a query with google_pandas_load.Loader.load(), the query is not caught because the method always creates a destination table. See here for more informations.

[11]:
%%time
df = gpl.load(source='query', destination='dataframe', query=query)
2019-12-03 17:22:09,828 - LoaderQuickSetup - DEBUG - Starting query to bq...
2019-12-03 17:22:33,698 - LoaderQuickSetup - DEBUG - Ended source to bq [23s, 0.0$]
2019-12-03 17:22:33,700 - LoaderQuickSetup - DEBUG - Starting bq to gs...
2019-12-03 17:22:47,731 - LoaderQuickSetup - DEBUG - Ended bq to gs [14s]
2019-12-03 17:22:47,732 - LoaderQuickSetup - DEBUG - Starting gs to local...
2019-12-03 17:22:50,729 - LoaderQuickSetup - DEBUG - Ended gs to local [2s]
2019-12-03 17:22:50,729 - LoaderQuickSetup - DEBUG - Starting local to dataframe...
2019-12-03 17:23:07,737 - LoaderQuickSetup - DEBUG - Ended local to dataframe [17s]
CPU times: user 16.1 s, sys: 892 ms, total: 17 s
Wall time: 58.3 s

Upload

[12]:
N = 64*10**6
df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})
[13]:
%%time
bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()
CPU times: user 15.7 s, sys: 1.49 s, total: 17.2 s
Wall time: 2min 32s
[13]:
<google.cloud.bigquery.job.LoadJob at 0x7fd6eae2d5c0>
[14]:
%%time
gpl.load(source='dataframe', destination='bq', data_name='z1', dataframe=df)
2019-12-03 17:25:48,468 - LoaderQuickSetup - DEBUG - Starting dataframe to local...
2019-12-03 17:26:58,614 - LoaderQuickSetup - DEBUG - Ended dataframe to local [70s]
2019-12-03 17:26:58,614 - LoaderQuickSetup - DEBUG - Starting local to gs...
2019-12-03 17:26:59,287 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]
2019-12-03 17:26:59,288 - LoaderQuickSetup - DEBUG - Starting gs to bq...
2019-12-03 17:28:41,625 - LoaderQuickSetup - DEBUG - Ended gs to bq [102s]
CPU times: user 1min 11s, sys: 482 ms, total: 1min 11s
Wall time: 2min 54s

Conclusion

The google_pandas_load.Loader.load() method executes much faster downloads and slightly slower uploads than those executed by the built-in methods from Python Client for Google BigQuery.