{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Speed comparison\n", "\n", "## Introduction\n", "\n", "The purpose of this page is to compare the speed of data transfer between the load method from this library: \n", "\n", "- [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load)\n", "\n", "and Python Client for Google BigQuery’s methods: \n", "\n", "- [google.cloud.bigquery.job.QueryJob.to_dataframe()](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)\n", "- [google.cloud.bigquery.client.Client.load_table_from_dataframe()](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe)\n", "\n", "## Set up " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# %%bash \n", "# pip install google-pandas-load\n", "# pip install google-cloud-bigquery-storage[fastavro]==0.*\n", "# pip install pyarrow==0.*" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas\n", "from google.cloud import bigquery\n", "from google.cloud import bigquery_storage_v1beta1\n", "from google_pandas_load import LoaderQuickSetup" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "project_id = 'dmp-y-tests'\n", "dataset_id = 'tmp'\n", "bucket_name = 'bucket_gpl'\n", "local_dir_path = '/tmp/gpl_directory'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, set bq_client, bqstorage_client, table_ref and instantiate a loader." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "bq_client = bigquery.Client(\n", " project=project_id, \n", " credentials=None)\n", "\n", "bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(\n", " credentials=None)\n", "\n", "table_ref = bigquery.dataset.DatasetReference(\n", " project=project_id, \n", " dataset_id=dataset_id).table('z0')\n", "\n", "gpl = LoaderQuickSetup(\n", " project_id=project_id, \n", " dataset_id=dataset_id,\n", " bucket_name=bucket_name,\n", " local_dir_path=local_dir_path)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "if not os.path.isdir(local_dir_path):\n", " os.makedirs(local_dir_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download\n", "\n", "The query below creates a 1.01 GB BigQuery table." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "select * from \n", "(select 'Hello, ' as a from unnest(generate_array(1, 8000))) \n", "cross join \n", "(select 'World!' as b from unnest(generate_array(1, 8000)))\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5min 6s, sys: 10.2 s, total: 5min 16s\n", "Wall time: 24min 55s\n" ] } ], "source": [ "%%time\n", "df = bq_client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using bqstorage_client speeds up the download. See [here](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for additional information." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3min 23s, sys: 24.5 s, total: 3min 48s\n", "Wall time: 3min 5s\n" ] } ], "source": [ "%%time\n", "df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is issue with the previous download: it used the cache query results !" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3min 16s, sys: 23 s, total: 3min 39s\n", "Wall time: 3min 21s\n" ] } ], "source": [ "%%time\n", "job_config = bigquery.QueryJobConfig()\n", "job_config.use_query_cache=False\n", "df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When executing a query with [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load), the query is not caught because the method always creates a destination table. See [here](https://cloud.google.com/bigquery/docs/cached-results) for more informations. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-12-03 17:22:09,828 - LoaderQuickSetup - DEBUG - Starting query to bq...\n", "2019-12-03 17:22:33,698 - LoaderQuickSetup - DEBUG - Ended source to bq [23s, 0.0$]\n", "2019-12-03 17:22:33,700 - LoaderQuickSetup - DEBUG - Starting bq to gs...\n", "2019-12-03 17:22:47,731 - LoaderQuickSetup - DEBUG - Ended bq to gs [14s]\n", "2019-12-03 17:22:47,732 - LoaderQuickSetup - DEBUG - Starting gs to local...\n", "2019-12-03 17:22:50,729 - LoaderQuickSetup - DEBUG - Ended gs to local [2s]\n", "2019-12-03 17:22:50,729 - LoaderQuickSetup - DEBUG - Starting local to dataframe...\n", "2019-12-03 17:23:07,737 - LoaderQuickSetup - DEBUG - Ended local to dataframe [17s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.1 s, sys: 892 ms, total: 17 s\n", "Wall time: 58.3 s\n" ] } ], "source": [ "%%time\n", "df = gpl.load(source='query', destination='dataframe', query=query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "N = 64*10**6\n", "df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.7 s, sys: 1.49 s, total: 17.2 s\n", "Wall time: 2min 32s\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-12-03 17:25:48,468 - LoaderQuickSetup - DEBUG - Starting dataframe to local...\n", "2019-12-03 17:26:58,614 - LoaderQuickSetup - DEBUG - Ended dataframe to local [70s]\n", "2019-12-03 17:26:58,614 - LoaderQuickSetup - DEBUG - Starting local to gs...\n", "2019-12-03 17:26:59,287 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]\n", "2019-12-03 17:26:59,288 - LoaderQuickSetup - DEBUG - Starting gs to bq...\n", "2019-12-03 17:28:41,625 - LoaderQuickSetup - DEBUG - Ended gs to bq [102s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 11s, sys: 482 ms, total: 1min 11s\n", "Wall time: 2min 54s\n" ] } ], "source": [ "%%time\n", "gpl.load(source='dataframe', destination='bq', data_name='z1', dataframe=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load) method executes much faster downloads and slightly slower uploads than those executed by the built-in methods from [Python Client for Google BigQuery](https://googleapis.dev/python/bigquery/latest/index.html). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }