google-pandas-load documentation ================================ Release v\ |release|. .. image:: https://img.shields.io/pypi/l/google-pandas-load.svg :target: https://pypi.org/project/google-pandas-load/ .. image:: https://img.shields.io/pypi/pyversions/google-pandas-load.svg :target: https://pypi.org/project/google-pandas-load/ google-pandas-load is a wrapper library for transferring big data from A to B, where A and B are distinct and chosen between BigQuery, Storage, a local folder and pandas. This library enables faster data transfer than those performed by `Python Client for Google BigQuery`_'s methods: - `google.cloud.bigquery.job.QueryJob.to_dataframe()`_ - `google.cloud.bigquery.client.Client.load_table_from_dataframe()`_ See `Speed Comparison`_. Acknowledgements ---------------- I am grateful to my employer Ysance_ for providing me the resources to develop this library and for allowing me to publish it. Installation ------------ :: $ pip install google-pandas-load Quickstart ---------- Set up a loader. In the following code, the credentials are inferred from the environment. For further information about how to authenticate to Google Cloud Platform with the `Google Cloud Client Library for Python`_, have a look `here `__. .. code-block:: python from google_pandas_load import LoaderQuickSetup gpl = LoaderQuickSetup( project_id='pi', dataset_id='di', bucket_name='bn', local_dir_path='/tmp', credentials=None) Transfer data seamlessly from and to various locations: .. warning:: In general, data is moved, not copied! The precise behavior is stated `here `__. .. warning:: In general, before data is moved to any location, it will delete any prior existing data having the same name in the location. This ensures a clean space for the upcoming data. The precise behavior is stated `here `__. .. code-block:: python # Populate a dataframe with a query result. df = gpl.load( source='query', destination='dataframe', query='select 3 as x') # Apply a python transformation to the data. df['x'] = 2*df['x'] # Upload the result to BigQuery. gpl.load( source='dataframe', destination='bq', data_name='a0', dataframe=df) # Extract the data to Storage. gpl.load( source='bq', destination='gs', data_name='a0') # The data is not in BigQuery anymore. # See warning above. # Download the data to the local folder # without deleting it in Storage. gpl.load( source='gs', destination='local', data_name='a0', delete_in_gs=False) Launch simultaneously several load jobs with massive parallelization of the query_to_bq and bq_to_gs steps. This is made possible by BigQuery. .. code-block:: python from google_pandas_load import LoadConfig # Build the load configs. configs = [] for i in range(100): config = LoadConfig( source='query', destination='local', data_name='b{}'.format(i), query='select {} as x'.format(i)) configs.append(config) # Launch all the load jobs # at the same time. gpl.mload(configs=configs) Main features ------------- - Transfer big data faster (see `Speed Comparison`_). - Transfer data seamlessly from and to various locations. - Launch several load jobs simultaneously. - Massive parallelization of the cloud steps with BigQuery. - Monitor query costs and step durations of load jobs. The basic mechanism ------------------- This code essentially chains transferring data functions from the `Google Cloud Client Library for Python`_ and from pandas_. To download, the following functions are chained: - `google.cloud.bigquery.client.query()`_ - `google.cloud.bigquery.client.extract_table()`_ - `google.cloud.storage.blob.Blob.download_to_filename()`_ - `pandas.read_csv()`_ To upload, the following functions are chained: - `pandas.DataFrame.to_csv()`_ - `google.cloud.storage.blob.Blob.upload_from_filename()`_ - `google.cloud.bigquery.client.load_table_from_uri()`_ Required packages ----------------- - google-cloud-bigquery - google-cloud-storage - pandas Table of Contents ----------------- .. toctree:: :maxdepth: 3 Tutorial Speed_comparison API .. _`Python Client for Google BigQuery`: https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html .. _`Speed Comparison`: Speed_comparison.ipynb .. _Ysance: https://www.ysance.com/data-services/fr/home/ .. _`google.cloud.bigquery.job.QueryJob.to_dataframe()`: https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.job.QueryJob.to_dataframe.html#google.cloud.bigquery.job.QueryJob.to_dataframe .. _`google.cloud.bigquery.client.Client.load_table_from_dataframe()`: https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.load_table_from_dataframe.html#google.cloud.bigquery.client.Client.load_table_from_dataframe .. _`Google Cloud Client Library for Python`: https://googleapis.github.io/google-cloud-python/latest/index.html .. _pandas: https://pandas.pydata.org/pandas-docs/stable/index.html .. _`google.cloud.bigquery.client.query()`: https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query .. _`google.cloud.bigquery.client.extract_table()`: https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.extract_table .. _`google.cloud.storage.blob.Blob.download_to_filename()`: https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob.download_to_filename .. _`pandas.read_csv()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html .. _`pandas.DataFrame.to_csv()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html .. _`google.cloud.storage.blob.Blob.upload_from_filename()`: https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob.upload_from_filename .. _`google.cloud.bigquery.client.load_table_from_uri()`: https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_uri