google-pandas-load documentation
================================
Release v\ |release|.
.. image:: https://img.shields.io/pypi/l/google-pandas-load.svg
:target: https://pypi.org/project/google-pandas-load/
.. image:: https://img.shields.io/pypi/pyversions/google-pandas-load.svg
:target: https://pypi.org/project/google-pandas-load/
google-pandas-load is a wrapper library for transferring big data from A to B, where A and B are distinct
and chosen between BigQuery, Storage, a local folder and pandas.
This library enables faster data transfer than those performed by `Python Client for Google BigQuery`_'s methods:
- `google.cloud.bigquery.job.QueryJob.to_dataframe()`_
- `google.cloud.bigquery.client.Client.load_table_from_dataframe()`_
See `Speed Comparison`_.
Acknowledgements
----------------
I am grateful to my employer Ysance_ for providing me the resources to develop this library and for allowing me
to publish it.
Installation
------------
::
$ pip install google-pandas-load
Quickstart
----------
Set up a loader.
In the following code, the credentials are inferred from the environment.
For further information about how to authenticate to Google Cloud Platform with the
`Google Cloud Client Library for Python`_, have a look
`here `__.
.. code-block:: python
from google_pandas_load import LoaderQuickSetup
gpl = LoaderQuickSetup(
project_id='pi',
dataset_id='di',
bucket_name='bn',
local_dir_path='/tmp',
credentials=None)
Transfer data seamlessly from and to various locations:
.. warning::
In general, data is moved, not copied! The precise behavior is stated `here `__.
.. warning::
In general, before data is moved to any location, it will delete any prior existing data having the same name in
the location. This ensures a clean space for the upcoming data.
The precise behavior is stated `here `__.
.. code-block:: python
# Populate a dataframe with a query result.
df = gpl.load(
source='query',
destination='dataframe',
query='select 3 as x')
# Apply a python transformation to the data.
df['x'] = 2*df['x']
# Upload the result to BigQuery.
gpl.load(
source='dataframe',
destination='bq',
data_name='a0',
dataframe=df)
# Extract the data to Storage.
gpl.load(
source='bq',
destination='gs',
data_name='a0')
# The data is not in BigQuery anymore.
# See warning above.
# Download the data to the local folder
# without deleting it in Storage.
gpl.load(
source='gs',
destination='local',
data_name='a0',
delete_in_gs=False)
Launch simultaneously several load jobs with massive parallelization of the query_to_bq and bq_to_gs steps.
This is made possible by BigQuery.
.. code-block:: python
from google_pandas_load import LoadConfig
# Build the load configs.
configs = []
for i in range(100):
config = LoadConfig(
source='query',
destination='local',
data_name='b{}'.format(i),
query='select {} as x'.format(i))
configs.append(config)
# Launch all the load jobs
# at the same time.
gpl.mload(configs=configs)
Main features
-------------
- Transfer big data faster (see `Speed Comparison`_).
- Transfer data seamlessly from and to various locations.
- Launch several load jobs simultaneously.
- Massive parallelization of the cloud steps with BigQuery.
- Monitor query costs and step durations of load jobs.
The basic mechanism
-------------------
This code essentially chains transferring data functions from the `Google Cloud Client Library for Python`_
and from pandas_.
To download, the following functions are chained:
- `google.cloud.bigquery.client.query()`_
- `google.cloud.bigquery.client.extract_table()`_
- `google.cloud.storage.blob.Blob.download_to_filename()`_
- `pandas.read_csv()`_
To upload, the following functions are chained:
- `pandas.DataFrame.to_csv()`_
- `google.cloud.storage.blob.Blob.upload_from_filename()`_
- `google.cloud.bigquery.client.load_table_from_uri()`_
Required packages
-----------------
- google-cloud-bigquery
- google-cloud-storage
- pandas
Table of Contents
-----------------
.. toctree::
:maxdepth: 3
Tutorial
Speed_comparison
API
.. _`Python Client for Google BigQuery`: https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html
.. _`Speed Comparison`: Speed_comparison.ipynb
.. _Ysance: https://www.ysance.com/data-services/fr/home/
.. _`google.cloud.bigquery.job.QueryJob.to_dataframe()`: https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.job.QueryJob.to_dataframe.html#google.cloud.bigquery.job.QueryJob.to_dataframe
.. _`google.cloud.bigquery.client.Client.load_table_from_dataframe()`: https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.load_table_from_dataframe.html#google.cloud.bigquery.client.Client.load_table_from_dataframe
.. _`Google Cloud Client Library for Python`: https://googleapis.github.io/google-cloud-python/latest/index.html
.. _pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
.. _`google.cloud.bigquery.client.query()`: https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query
.. _`google.cloud.bigquery.client.extract_table()`: https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.extract_table
.. _`google.cloud.storage.blob.Blob.download_to_filename()`: https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob.download_to_filename
.. _`pandas.read_csv()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
.. _`pandas.DataFrame.to_csv()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
.. _`google.cloud.storage.blob.Blob.upload_from_filename()`: https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob.upload_from_filename
.. _`google.cloud.bigquery.client.load_table_from_uri()`: https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_uri