{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Speed comparison\n", "\n", "## Introduction\n", "\n", "The purpose of this page is to compare the speed of data transfer between the the load method from this library: \n", "\n", "- [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load)\n", "\n", "and Python Client for Google BigQuery’s methods: \n", "\n", "- [google.cloud.bigquery.job.QueryJob.to_dataframe()](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.job.QueryJob.to_dataframe.html#google.cloud.bigquery.job.QueryJob.to_dataframe)\n", "- [google.cloud.bigquery.client.Client.load_table_from_dataframe()](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.load_table_from_dataframe.html#google.cloud.bigquery.client.Client.load_table_from_dataframe)\n", "\n", "## Set up " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas\n", "from google.cloud import bigquery\n", "from google.cloud import bigquery_storage_v1beta1\n", "from google_pandas_load import LoaderQuickSetup" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "project_id = 'dmp-y-tests'\n", "dataset_id = 'tmp'\n", "bucket_name = 'bucket_gpl'\n", "local_dir_path = '/tmp/gpl_directory'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, set bq_client, bqstorage_client, table_ref and instantiate a loader.\n", "\n", "Credentials are inferred from the environment. Further information about how to authenticate to Google Cloud Platform with the [Google Cloud Client Libraries for Python](https://googleapis.github.io/google-cloud-python/latest/index.html) can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html?highlight=defaults)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "bq_client = bigquery.Client(\n", " project=project_id, \n", " credentials=None)\n", "\n", "bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(\n", " credentials=None)\n", "\n", "table_ref = bigquery.dataset.DatasetReference(\n", " project=project_id, \n", " dataset_id=dataset_id).table('s0')\n", "\n", "gpl = LoaderQuickSetup(\n", " project_id=project_id, \n", " dataset_id=dataset_id,\n", " bucket_name=bucket_name,\n", " local_dir_path=local_dir_path)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "if not os.path.isdir(local_dir_path):\n", " os.makedirs(local_dir_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download\n", "\n", "The query below creates a 600 MB BigQuery table." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "select * from \n", "(select 'Hello, ' as a from unnest(generate_array(1, 4000))) \n", "cross join \n", "(select 'World!' as b from unnest(generate_array(1, 4000)))\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 44s, sys: 2.69 s, total: 1min 47s\n", "Wall time: 6min 48s\n" ] } ], "source": [ "%%time\n", "df = bq_client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using bqstorage_client speeds up the download. See [here](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for additional information." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 28s, sys: 3.23 s, total: 1min 32s\n", "Wall time: 1min 29s\n" ] } ], "source": [ "%%time\n", "df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is issue with the previous download: it used the cache query results !" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 24s, sys: 2.9 s, total: 1min 27s\n", "Wall time: 1min 31s\n" ] } ], "source": [ "%%time\n", "job_config = bigquery.QueryJobConfig()\n", "job_config.use_query_cache=False\n", "df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When executing a query with [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load), the query is not caught because the method always creates a destination table. See [here](https://cloud.google.com/bigquery/docs/cached-results) for more informations. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-04-11 20:38:31,981 - LoaderQuickSetup - DEBUG - Starting query to bq...\n", "2019-04-11 20:38:40,530 - LoaderQuickSetup - DEBUG - Ended query to bq [8s, 0.0$]\n", "2019-04-11 20:38:40,531 - LoaderQuickSetup - DEBUG - Starting bq to gs...\n", "2019-04-11 20:38:46,004 - LoaderQuickSetup - DEBUG - Ended bq to gs [5s]\n", "2019-04-11 20:38:46,006 - LoaderQuickSetup - DEBUG - Starting gs to local...\n", "2019-04-11 20:38:47,118 - LoaderQuickSetup - DEBUG - Ended gs to local [1s]\n", "2019-04-11 20:38:47,120 - LoaderQuickSetup - DEBUG - Starting local to dataframe...\n", "2019-04-11 20:38:50,836 - LoaderQuickSetup - DEBUG - Ended local to dataframe [3s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.43 s, sys: 492 ms, total: 4.93 s\n", "Wall time: 19.8 s\n" ] } ], "source": [ "%%time\n", "df = gpl.load(source='query', destination='dataframe', query=query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "N = 16*10**6\n", "df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.51 s, sys: 368 ms, total: 2.88 s\n", "Wall time: 1min 30s\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "# you may need to install pyarrow (pip install pyarrow)\n", "# for this to work.\n", "bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-04-11 20:40:24,349 - LoaderQuickSetup - DEBUG - Starting dataframe to local...\n", "2019-04-11 20:40:41,306 - LoaderQuickSetup - DEBUG - Ended dataframe to local [16s]\n", "2019-04-11 20:40:41,306 - LoaderQuickSetup - DEBUG - Starting local to gs...\n", "2019-04-11 20:40:41,788 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]\n", "2019-04-11 20:40:41,789 - LoaderQuickSetup - DEBUG - Starting gs to bq...\n", "2019-04-11 20:41:09,435 - LoaderQuickSetup - DEBUG - Ended gs to bq [27s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 17.3 s, sys: 120 ms, total: 17.5 s\n", "Wall time: 45.4 s\n" ] } ], "source": [ "%%time\n", "df = gpl.load(source='dataframe', destination='bq', data_name='s1', dataframe=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load) method executes faster downloads and faster uploads than those executed by the built-in methods from [Python Client for Google BigQuery](https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html). \n", "\n", "Looking at the [basic mechanism](index.rst#The-basic-mechanism), one could think that the use of the local folder diminishes the number of network calls, and thus it speeds up data transfers. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }