Loader¶
-
class
google_pandas_load.Loader(bq_client=None, dataset_ref=None, bucket=None, gs_dir_path_in_bucket=None, local_dir_path=None, generated_data_name_prefix=None, max_concurrent_google_jobs=10, use_wildcard=True, compress=True, separator='|', chunk_size=268435456, logger=<Logger Loader (WARNING)>)¶ Bases:
objectWrapper for transferring big data between A and B where A and B are distinct and chosen between a BigQuery dataset, a directory in a Storage bucket, a local folder and the RAM (with type pandas.DataFrame).
The Loader bundles all the parameters that do not change often when executing load jobs during a workflow.
- Parameters
bq_client (google.cloud.bigquery.client.Client, optional) – Client to execute google load jobs.
dataset_ref (google.cloud.bigquery.dataset.DatasetReference, optional) – The dataset reference.
bucket (google.cloud.storage.bucket.Bucket, optional) – The bucket.
gs_dir_path_in_bucket (str, optional) – The path of the directory in the bucket.
local_dir_path (str, optional) – The path of the local folder.
generated_data_name_prefix (str, optional) – The prefix added to any generated data name in case the user does not give a name to the loaded data. It is a useful feature to quickly find loaded data when debugging the code.
max_concurrent_google_jobs (int, optional) – The maximum number of concurrent google jobs allowed to be launched by the BigQuery Client. Defaults to 10.
use_wildcard (bool, optional) – If set to True, data moving from BigQuery to Storage will be split in several files whose basenames match a wildcard pattern. Defaults to True.
compress (bool, optional) – If set to True, data is compressed when moved from BigQuery to Storage or from pandas to the local folder. Defaults to True.
separator (str, optional) – The character which separates the columns of the data. Defaults to ‘|’.
chunk_size (int, optional) – The chunk size of a Storage blob created when data comes from the local folder. See here for more information. Defaults to 2**28.
logger (logging.Logger, optional) – The logger creating the log records of this class. Defaults to a logger called Loader.
Note
What is the data named data_name?
in BigQuery: the table in the dataset whose id is data_name.
in Storage: the blobs whose basename begins with data_name inside the bucket directory.
in local: the files whose basename begins with data_name inside the local folder.
This definition is motivated by the fact that BigQuery splits a big table in several blobs when extracting it to Storage.
-
list_blob_uris(data_name)¶ Return the list of the uris of Storage blobs forming the data named data_name in Storage.
-
list_local_file_paths(data_name)¶ Return the list of the paths of the files forming the data named data_name in local.
-
load(source, destination, data_name=None, query=None, dataframe=None, write_disposition='WRITE_TRUNCATE', dtype=None, parse_dates=None, infer_datetime_format=True, date_cols=None, timestamp_cols=None, bq_schema=None, delete_in_bq=True, delete_in_gs=True, delete_in_local=True)¶ Execute a load job whose configuration is specified by the arguments.
The data is transferred from source to destination. The valid values for the source and the destination are: ‘query’, ‘bq’, ‘gs’, ‘local’ and ‘dataframe’.
Downloading follows the path : ‘query’ -> ‘bq’ -> ‘gs’ -> ‘local’ -> ‘dataframe’ while uploading goes in the opposite direction.
Warning
In general, data is moved, not copied!
Once the load job has been executed, the data usually does not exist anymore in the source and in any transitional locations.
However two exceptions exist:
When source = ‘dataframe’, the dataframe is not deleted in RAM.
When destination = ‘query’, the data is not deleted in BigQuery, so that it still exists somewhere. Indeed, in this case, the load job returns a simple query which represents the data but does not contain it.
Use the delete_in_bq, delete_in_gs and delete_in_local parameters to control the data deletion, during the execution of the load job.
Warning
In general, pre-existing data is deleted!
Before new data is moved to any location, the loader will usually delete any prior data bearing the same name to prevent any conflict.
There is one exception:
When destination = ‘bq’ and the write_dispostion parameter is set to ‘WRITE_APPEND’, new data is appended to pre-existing one with the same name.
- Parameters
source (str) – one of ‘query’, ‘bq’, ‘gs’, ‘local’, ‘dataframe’.
destination (str) – one of ‘query’, ‘bq’, ‘gs’, ‘local’, ‘dataframe’.
data_name (str, optional) – The name of the data. If not passed, a name is generated by concatenating the generated_data_name_prefix of the loader, if any, the current timestamp and a random integer. This is useful when source = ‘query’ and destination = ‘dataframe’ because the user may not need to know the data_name.
query (str, optional) – A BigQuery Standard Sql query. Required if source = ‘query’.
dataframe (pandas.DataFrame, optional) – A pandas dataframe. Required if source = ‘dataframe’.
write_disposition (google.cloud.bigquery.job.WriteDisposition, optional) – Specifies the action that occurs if data named data_name already exist in BigQuery. Defaults to ‘WRITE_TRUNCATE’.
dtype (dict, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and dtype is one of its parameters.
parse_dates (list of str, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and parse_dates is one of its parameters.
infer_datetime_format (bool, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and infer_datetime_format is one of its parameters. Defaults to True.
date_cols (list of str, optional) – If no bq_schema is passed, indicate which columns of a pandas dataframe should have the BigQuery type DATE.
timestamp_cols (list of str, optional) – If no bq_schema is passed, indicate which columns of a pandas dataframe should have the BigQuery type TIMESTAMP.
bq_schema (list of google.cloud.bigquery.schema.SchemaField, optional) – The table’s schema in BigQuery. Used when destination = ‘bq’ and source != ‘query’. When source = ‘query’, the bq_schema is inferred from the query. If not passed and source = ‘dataframe’, falls back to an inferred value from the dataframe with
google_pandas_load.LoadConfig.bq_schema_inferred_from_dataframe().delete_in_bq (bool, optional) – If set to False, data going from or through Bigquery is not deleted in BigQuery. Defaults to True.
delete_in_gs (bool, optional) – If set to False, data going from or through Storage is not deleted in Storage. Defaults to True.
delete_in_local (bool, optional) – If set to False, data going from or through the local folder is not deleted in that folder. Defaults to True.
- Returns
The result of the load job:
When destination = ‘query’, it returns the BigQuery standard SQL query: “select * from `project_id.dataset_id.data_name`”, where the project_id is the dataset’s one.
When destination = ‘dataframe’, it returns a pandas dataframe populated with the data specified by the arguments.
In all other cases, it returns None.
- Return type
str or pandas.DataFrame or NoneType
-
mload(configs)¶ Execute several load jobs specified by the configurations. The prefix m means multi.
The BigQuery Client executes simultaneously the query_to_bq parts (resp. the bq_to_gs and gs_to_bq parts) from the configurations by batch of size max_concurrent_google_jobs.
- Parameters
configs (list of google_pandas_load.LoadConfig) – See
google_pandas_load.LoadConfigfor the format of one configuration.- Returns
A list of of load results. The i-th element is the result of the load job configured by configs[i]. See
google_pandas_load.Loader.load()for the format of one load result.- Return type
list of (str or NoneType or pandas.DataFrame)
-
xload(source, destination, data_name=None, query=None, dataframe=None, write_disposition='WRITE_TRUNCATE', dtype=None, parse_dates=None, infer_datetime_format=True, date_cols=None, timestamp_cols=None, bq_schema=None, delete_in_bq=True, delete_in_gs=True, delete_in_local=True)¶ It works like
google_pandas_load.Loader.load()but also returns extra informations about the data and the load job’s execution. The prefix x is for extra.- Returns
A xload result res with the following attributes:
res.load_result (str or NoneType or pandas.DataFrame): The result of the load job.
res.data_name (str): The name of the loaded data.
res.duration (int): The load job’s duration in seconds.
res.durations (argparse.Namespace): A report providing the durations of each step of the load job. It has the following attributes:
res.durations.query_to_bq (int or NoneType): the duration in seconds of the query_to_bq part if any.
res.durations.bq_to_gs (int or NoneType): the duration in seconds of the bq_to_gs part if any.
res.durations.gs_to_local (int or NoneType): the duration in seconds of the gs_to_local part if any.
res.durations.local_to_dataframe (int or NoneType): the duration in seconds of the local_to_dataframe part if any.
res.durations.dataframe_to_local (int or NoneType): the duration in seconds of the dataframe_to_local part if any.
res.durations.local_to_gs (int or NoneType): the duration in seconds of the local_to_gs part if any.
res.durations.gs_to_bq (int or NoneType): the duration in seconds of the gs_to_bq part if any.
res.durations.bq_to_query (int or NoneType): the duration in seconds of the bq_to_query part if any.
res.query_cost (float or NoneType): The query cost in US dollars of the query_to_bq part if any.
- Return type
argparse.Namespace
-
xmload(configs)¶ It works like
google_pandas_load.Loader.mload()but also returns extra informations about the data and the mload job’s execution.- Parameters
configs (list of google_pandas_load.LoadConfig) – See
google_pandas_load.LoadConfigfor the format of one configuration.- Returns
The xmload result res with the following attributes:
res.load_results (list of (str or NoneType or pandas.DataFrame)): A list of load results.
res.data_names (list of str): The names of the data. The i-th element is the data_name attached to configs[i], either given as an argument or generated by the loader.
res.duration (int): The mload job’s duration.
res.durations (args.Namespace): A report res.durations providing the duration of each step of the mload job.
res.query_cost (float or NoneType): The query cost in US dollars of the query_to_bq part if any.
res.query_costs (list of (float or NoneType)): The query costs in US dollars of the mload. The i-th element is the query cost of the load job configured by configs[i].
- Return type
args.Namespace