Speed comparison¶
Introduction¶
The purpose of this page is to compare the speed of data transfer between the the load method from this library:
and Python Client for Google BigQuery’s methods:
Set up¶
[2]:
import os
import pandas
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from google_pandas_load import LoaderQuickSetup
[3]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'bucket_gpl'
local_dir_path = '/tmp/gpl_directory'
Next, set bq_client, bqstorage_client, table_ref and instantiate a loader.
Credentials are inferred from the environment. Further information about how to authenticate to Google Cloud Platform with the Google Cloud Client Libraries for Python can be found here.
[4]:
bq_client = bigquery.Client(
project=project_id,
credentials=None)
bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=None)
table_ref = bigquery.dataset.DatasetReference(
project=project_id,
dataset_id=dataset_id).table('s0')
gpl = LoaderQuickSetup(
project_id=project_id,
dataset_id=dataset_id,
bucket_name=bucket_name,
local_dir_path=local_dir_path)
[5]:
if not os.path.isdir(local_dir_path):
os.makedirs(local_dir_path)
Download¶
The query below creates a 600 MB BigQuery table.
[6]:
query = """
select * from
(select 'Hello, ' as a from unnest(generate_array(1, 4000)))
cross join
(select 'World!' as b from unnest(generate_array(1, 4000)))
"""
[7]:
%%time
df = bq_client.query(query).to_dataframe()
CPU times: user 1min 44s, sys: 2.69 s, total: 1min 47s
Wall time: 6min 48s
Using bqstorage_client speeds up the download. See here for additional information.
[8]:
%%time
df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)
CPU times: user 1min 28s, sys: 3.23 s, total: 1min 32s
Wall time: 1min 29s
There is issue with the previous download: it used the cache query results !
[9]:
%%time
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache=False
df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)
CPU times: user 1min 24s, sys: 2.9 s, total: 1min 27s
Wall time: 1min 31s
When executing a query with google_pandas_load.Loader.load(), the query is not caught because the method always creates a destination table. See here for more informations.
[10]:
%%time
df = gpl.load(source='query', destination='dataframe', query=query)
2019-04-11 20:38:31,981 - LoaderQuickSetup - DEBUG - Starting query to bq...
2019-04-11 20:38:40,530 - LoaderQuickSetup - DEBUG - Ended query to bq [8s, 0.0$]
2019-04-11 20:38:40,531 - LoaderQuickSetup - DEBUG - Starting bq to gs...
2019-04-11 20:38:46,004 - LoaderQuickSetup - DEBUG - Ended bq to gs [5s]
2019-04-11 20:38:46,006 - LoaderQuickSetup - DEBUG - Starting gs to local...
2019-04-11 20:38:47,118 - LoaderQuickSetup - DEBUG - Ended gs to local [1s]
2019-04-11 20:38:47,120 - LoaderQuickSetup - DEBUG - Starting local to dataframe...
2019-04-11 20:38:50,836 - LoaderQuickSetup - DEBUG - Ended local to dataframe [3s]
CPU times: user 4.43 s, sys: 492 ms, total: 4.93 s
Wall time: 19.8 s
Upload¶
[11]:
N = 16*10**6
df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})
[12]:
%%time
# you may need to install pyarrow (pip install pyarrow)
# for this to work.
bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()
CPU times: user 2.51 s, sys: 368 ms, total: 2.88 s
Wall time: 1min 30s
[12]:
<google.cloud.bigquery.job.LoadJob at 0x7f3f78a774a8>
[13]:
%%time
df = gpl.load(source='dataframe', destination='bq', data_name='s1', dataframe=df)
2019-04-11 20:40:24,349 - LoaderQuickSetup - DEBUG - Starting dataframe to local...
2019-04-11 20:40:41,306 - LoaderQuickSetup - DEBUG - Ended dataframe to local [16s]
2019-04-11 20:40:41,306 - LoaderQuickSetup - DEBUG - Starting local to gs...
2019-04-11 20:40:41,788 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]
2019-04-11 20:40:41,789 - LoaderQuickSetup - DEBUG - Starting gs to bq...
2019-04-11 20:41:09,435 - LoaderQuickSetup - DEBUG - Ended gs to bq [27s]
CPU times: user 17.3 s, sys: 120 ms, total: 17.5 s
Wall time: 45.4 s
Conclusion¶
The google_pandas_load.Loader.load() method executes faster downloads and faster uploads than those executed by the built-in methods from Python Client for Google BigQuery.
Looking at the basic mechanism, one could think that the use of the local folder diminishes the number of network calls, and thus it speeds up data transfers.