google-pandas-load documentation¶
Release v1.0.0.
google-pandas-load is a wrapper library for transferring big data from A to B, where A and B are distinct and chosen between BigQuery, Storage, a local folder and pandas.
This library enables faster data transfer than those performed by Python Client for Google BigQuery’s methods:
See Speed Comparison.
Acknowledgements¶
I am grateful to my employer Ysance for providing me the resources to develop this library and for allowing me to publish it.
Installation¶
$ pip install google-pandas-load
Quickstart¶
Set up a loader.
In the following code, the credentials are inferred from the environment. For further information about how to authenticate to Google Cloud Platform with the Google Cloud Client Library for Python, have a look here.
from google_pandas_load import LoaderQuickSetup
gpl = LoaderQuickSetup(
project_id='pi',
dataset_id='di',
bucket_name='bn',
local_dir_path='/tmp',
credentials=None)
Transfer data seamlessly from and to various locations:
Warning
In general, data is moved, not copied! The precise behavior is stated here.
Warning
In general, before data is moved to any location, it will delete any prior existing data having the same name in the location. This ensures a clean space for the upcoming data. The precise behavior is stated here.
# Populate a dataframe with a query result.
df = gpl.load(
source='query',
destination='dataframe',
query='select 3 as x')
# Apply a python transformation to the data.
df['x'] = 2*df['x']
# Upload the result to BigQuery.
gpl.load(
source='dataframe',
destination='bq',
data_name='a0',
dataframe=df)
# Extract the data to Storage.
gpl.load(
source='bq',
destination='gs',
data_name='a0')
# The data is not in BigQuery anymore.
# See warning above.
# Download the data to the local folder
# without deleting it in Storage.
gpl.load(
source='gs',
destination='local',
data_name='a0',
delete_in_gs=False)
Launch simultaneously several load jobs with massive parallelization of the query_to_bq and bq_to_gs steps. This is made possible by BigQuery.
from google_pandas_load import LoadConfig
# Build the load configs.
configs = []
for i in range(100):
config = LoadConfig(
source='query',
destination='local',
data_name='b{}'.format(i),
query='select {} as x'.format(i))
configs.append(config)
# Launch all the load jobs
# at the same time.
gpl.mload(configs=configs)
Main features¶
Transfer big data faster (see Speed Comparison).
Transfer data seamlessly from and to various locations.
Launch several load jobs simultaneously.
Massive parallelization of the cloud steps with BigQuery.
Monitor query costs and step durations of load jobs.
The basic mechanism¶
This code essentially chains transferring data functions from the Google Cloud Client Library for Python and from pandas.
To download, the following functions are chained:
To upload, the following functions are chained:
Required packages¶
google-cloud-bigquery
google-cloud-storage
pandas