Loader

class google_pandas_load.loader.Loader(bq_client=None, dataset_ref=None, bucket=None, gs_dir_path=None, local_dir_path=None, separator='|', chunk_size=268435456, logger=<Logger Loader (WARNING)>)[source]

Bases: object

Wrapper for transferring big data between A and B where A and B are distinct and chosen between a BigQuery dataset, a directory in a Storage bucket, a local folder and the RAM (with type pandas.DataFrame).

The Loader bundles all the parameters that do not change often when executing load jobs during a workflow.

Parameters
  • bq_client (google.cloud.bigquery.client.Client, optional) – Client to execute google load jobs.

  • dataset_ref (google.cloud.bigquery.dataset.DatasetReference, optional) – The dataset reference.

  • bucket (google.cloud.storage.bucket.Bucket, optional) – The bucket.

  • gs_dir_path (str, optional) – The path of the directory in the bucket.

  • local_dir_path (str, optional) – The path of the local folder.

  • separator (str, optional) – The character which separates the columns of the data. Defaults to ‘|’.

  • chunk_size (int, optional) – The chunk size of a Storage blob created when data comes from the local folder. See here for more information. Defaults to 2**28.

  • logger (logging.Logger, optional) – The logger creating the log records of this class. Defaults to a logger called Loader.

property bq_client

The bq_client given in the argument.

Type

google.cloud.bigquery.client.Client

property bucket

The bucket given in the argument.

Type

google.cloud.storage.bucket.Bucket

property bucket_name

The name of the bucket given in the argument.

Type

str

property dataset_id

The id of the dataset_ref given in the argument.

Type

str

property dataset_name

The name of the dataset_ref given in the argument.

Type

str

property dataset_ref

The dataset_ref given in the argument.

Type

google.cloud.bigquery.dataset.DatasetReference

delete_in_bq(data_name)[source]

Delete the data named data_name in BigQuery.

delete_in_gs(data_name)[source]

Delete the data named data_name in Storage.

delete_in_local(data_name)[source]

Delete the data named data_name in local.

exist_in_bq(data_name)[source]

Return True if data named data_name exist in BigQuery.

exist_in_gs(data_name)[source]

Return True if data named data_name exist in Storage.

exist_in_local(data_name)[source]

Return True if data named data_name exist in local.

property gs_dir_path

The gs_dir_path given in the argument.

Type

str

list_blob_uris(data_name)[source]

Return the list of the uris of Storage blobs forming the data named data_name in Storage.

list_blobs(data_name)[source]

Return the data named data_name in Storage as a list of Storage blobs.

list_local_file_paths(data_name)[source]

Return the list of the paths of the files forming the data named data_name in local.

load(source, destination, data_name=None, query=None, dataframe=None, write_disposition='WRITE_TRUNCATE', dtype=None, parse_dates=None, date_cols=None, timestamp_cols=None, bq_schema=None)[source]

Execute a load job whose configuration is specified by the arguments.

The data is loaded from source to destination.

The valid values for source are ‘query’, ‘bq’, ‘gs’, ‘local’ and ‘dataframe’.

The valid values for the destination are ‘bq’, ‘gs’, ‘local’ and ‘dataframe’.

Downloading follows the path: ‘query’ -> ‘bq’ -> ‘gs’ -> ‘local’ -> ‘dataframe’.

Uploading follows the path: ‘dataframe’ -> ‘local’ -> ‘gs’ -> ‘bq’.

Note

What is the data named data_name?

  • in BigQuery: the table in the dataset whose name is data_name.

  • in Storage: the blobs whose basename begins with data_name inside the bucket directory.

  • in local: the files whose basename begins with data_name inside the local folder.

This definition is motivated by the fact that BigQuery splits a big table in several blobs when extracting it to Storage.

Note

Data is not renamed

Since renaming the data identified by a prefix (see previous note) rises too much difficulties, choice has been made to keep its original name.

Warning

By default, pre-existing data is deleted !

Since data is not renamed (see previous note), the loader deletes any prior data having the same name before loading the new data. This is done in order to prevent any conflict.

To illustrate this process, consider the following load:

loader.load(
    source='dataframe',
    destination='bq',
    data_name='a0',
    dataframe=df)

Before populating a BigQuery table, data goes through a local folder and Storage. If some existing data was named ‘a0’ prior the load job in any of these three locations, it is going to be erased first.

Default behaviour can only be modified in the BigQuery location. To do this, the default value of the write_disposition parameter has to be changed.

Parameters
  • source (str) – one of ‘query’, ‘bq’, ‘gs’, ‘local’, ‘dataframe’.

  • destination (str) – one of ‘bq’, ‘gs’, ‘local’, ‘dataframe’.

  • data_name (str, optional) – The name of the data. If not passed, a name is generated by concatenating the current timestamp and a random integer. This is useful when source = ‘query’ and destination = ‘dataframe’ because the user may not need to know the data_name.

  • query (str, optional) – A BigQuery Standard SQL query. Required if source = ‘query’.

  • dataframe (pandas.DataFrame, optional) – A pandas dataframe. Required if source = ‘dataframe’.

  • write_disposition (google.cloud.bigquery.job.WriteDisposition, optional) – Specifies the action that occurs if data named data_name already exist in BigQuery. Defaults to ‘WRITE_TRUNCATE’.

  • dtype (dict, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and dtype is one of its parameters.

  • parse_dates (list of str, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and parse_dates is one of its parameters.

  • date_cols (list of str, optional) – If no bq_schema is passed, indicate which columns of a pandas dataframe should have the BigQuery type DATE.

  • timestamp_cols (list of str, optional) – If no bq_schema is passed, indicate which columns of a pandas dataframe should have the BigQuery type TIMESTAMP.

  • bq_schema (list of google.cloud.bigquery.schema.SchemaField, optional) – The table’s schema in BigQuery. Used when destination = ‘bq’ and source != ‘query’. When source = ‘query’, the bq_schema is inferred from the query. If source is one of ‘gs’ or ‘local’ and the bq_schema is not passed, it falls back to an inferred value from the CSV with google.cloud.bigquery.job.LoadJobConfig.autodetect. If source = ‘dataframe’ and the bq_schema is not passed, it falls back to an inferred value from the dataframe with this method.

Returns

The result of the load job:

  • When destination = ‘dataframe’, it returns a pandas dataframe populated with the data specified by the arguments.

  • In all other cases, it returns None.

Return type

pandas.DataFrame or NoneType

property local_dir_path

The local_dir_path given in the argument.

Type

str

mload(configs)[source]

Execute several load jobs specified by the configurations. The prefix m means multi.

The BigQuery Client executes simultaneously the query_to_bq parts (resp. the bq_to_gs and gs_to_bq parts) from the configurations.

Parameters

configs (list of google_pandas_load.load_config.LoadConfig) – See google_pandas_load.load_config.LoadConfig for the format of one configuration.

Returns

A list of load results. The i-th element is the result of the load job configured by configs[i]. See google_pandas_load.loader.Loader.load() for the format of one load result.

Return type

list of (pandas.DataFrame or NoneType)

xload(source, destination, data_name=None, query=None, dataframe=None, write_disposition='WRITE_TRUNCATE', dtype=None, parse_dates=None, date_cols=None, timestamp_cols=None, bq_schema=None)[source]

It works like google_pandas_load.loader.Loader.load() but also returns extra informations about the data and the load job’s execution. The prefix x is for extra.

Returns

A xload result res with the following attributes:

  • res.load_result (pandas.DataFrame or NoneType): The result of the load job.

  • res.data_name (str): The name of the loaded data.

  • res.duration (int): The load job’s duration in seconds.

  • res.durations (argparse.Namespace): A report providing the durations of each step of the load job. It has the following attributes:

    • res.durations.query_to_bq (int or NoneType): the duration in seconds of the query_to_bq part if any.

    • res.durations.bq_to_gs (int or NoneType): the duration in seconds of the bq_to_gs part if any.

    • res.durations.gs_to_local (int or NoneType): the duration in seconds of the gs_to_local part if any.

    • res.durations.local_to_dataframe (int or NoneType): the duration in seconds of the local_to_dataframe part if any.

    • res.durations.dataframe_to_local (int or NoneType): the duration in seconds of the dataframe_to_local part if any.

    • res.durations.local_to_gs (int or NoneType): the duration in seconds of the local_to_gs part if any.

    • res.durations.gs_to_bq (int or NoneType): the duration in seconds of the gs_to_bq part if any.

  • res.query_cost (float or NoneType): The query cost in US dollars of the query_to_bq part if any.

Return type

argparse.Namespace

xmload(configs)[source]

It works like google_pandas_load.loader.Loader.mload() but also returns extra informations about the data and the mload job’s execution.

Parameters

configs (list of google_pandas_load.LoadConfig) – See google_pandas_load.load_config.LoadConfig for the format of one configuration.

Returns

The xmload result res with the following attributes:

  • res.load_results (list of (pandas.DataFrame or NoneType)): A list of load results.

  • res.data_names (list of str): The names of the data. The i-th element is the data_name attached to configs[i], either given as an argument or generated by the loader.

  • res.duration (int): The mload job’s duration.

  • res.durations (args.Namespace): A report res.durations providing the duration of each step of the mload job.

  • res.query_cost (float or NoneType): The query cost in US dollars of the query_to_bq part if any.

  • res.query_costs (list of (float or NoneType)): The query costs in US dollars of the mload job. The i-th element is the query cost of the load job configured by configs[i].

Return type

args.Namespace