Loader¶

Bases: object

Wrapper for transferring data between A and B where A and B are distinct and chosen between a BigQuery dataset, a Storage bucket directory, a local directory and the RAM (with type pandas.DataFrame).

The Loader bundles all the parameters that do not change often when executing load jobs during a workflow.

Note

If the optional argument bucket_dir_path is not given, data will be stored at the root of the bucket. It is a good practice to specify this argument so that data is stored in a defined bucket directory.

Parameters:

bq_client (google.cloud.bigquery.client.Client, optional) – Client to manage connections to the BigQuery API.
dataset_id (str, optional) – The dataset id.
gs_client (google.cloud.storage.client.Client, optional) – Client for interacting with the Storage API.
bucket_name (str, optional) – The bucket name.
bucket_dir_path (str, optional) – The bucket directory path.
local_dir_path (str, optional) – The local directory path.
separator (str, optional) – The character which separates the columns of the data. Defaults to ‘|’.
chunk_size (int, optional) – The chunk size of a Storage blob created when data is uploaded. See here for more information. Defaults to 2**28.
timeout (int, optional) – The amount of time, in seconds, to wait for the server response when uploading a Storage blob. Defaults to 60.

property bq_client: Client¶

The BigQuery client.

Type:: google.cloud.bigquery.client.Client

property bucket: Bucket¶

The bucket.

Type:: google.cloud.storage.bucket.Bucket

property bucket_dir_path: str¶

The bucket directory path.

Type:: str

property bucket_name: str¶

The bucket name.

Type:: str

property dataset_id: str¶

The dataset id.

Type:: str

property dataset_name: str¶

The dataset name.

Type:: str

delete_in_bucket(data_name: str) → None[source]¶: Delete the data named data_name in Storage.

delete_in_dataset(data_name: str) → None[source]¶: Delete the data named data_name in BigQuery.

delete_in_local(data_name: str) → None[source]¶: Delete the data named data_name in local.

exist_in_bucket(data_name: str) → bool[source]¶: Return True if data named data_name exist in Storage.

exist_in_dataset(data_name: str) → bool[source]¶: Return True if data named data_name exist in BigQuery.

exist_in_local(data_name: str) → bool[source]¶: Return True if data named data_name exist in local.

property gs_client: Client¶

The Storage client.

Type:: google.cloud.storage.client.Client

list_blob_uris(data_name: str) → List[str][source]¶: Return the list of the uris of Storage blobs forming the data named data_name in Storage.

list_blobs(data_name: str) → List[Blob][source]¶: Return the data named data_name in Storage as a list of Storage blobs.

list_local_file_paths(data_name: str) → List[str][source]¶: Return the list of the paths of the files forming the data named data_name in local.

load(source: Literal['query', 'dataset', 'bucket', 'local', 'dataframe'], destination: Literal['dataset', 'bucket', 'local', 'dataframe'], data_name: str | None = None, query: str | None = None, dataframe: DataFrame | None = None, write_disposition: str | None = 'WRITE_TRUNCATE', dtype: Dict[str, Any] | None = None, parse_dates: List[str] | None = None, date_cols: List[str] | None = None, timestamp_cols: List[str] | None = None, bq_schema: List[SchemaField] | None = None)[source]¶

Execute a load job whose configuration is specified by the arguments. The data is loaded from source to destination.

The valid values for source are ‘query’, ‘dataset’, ‘bucket’, ‘local’ and ‘dataframe’. The valid values for the destination are ‘dataset’, ‘bucket’, ‘local’ and ‘dataframe’.

Downloading follows the path: ‘query’ -> ‘dataset’ -> ‘bucket’ -> ‘local’ -> ‘dataframe’. Uploading follows the path: ‘dataframe’ -> ‘local’ -> ‘bucket’ -> ‘dataset’.

Note

What is the data named data_name?

in BigQuery: the table in the dataset whose name is data_name.
in Storage: the blobs at the root of the bucket directory and whose basename begins with data_name.
in local: the files at the root of the local directory and whose basename begins with data_name.

This definition is motivated by the fact that BigQuery splits a big table in several blobs when extracting it to Storage.

Note

Data is not renamed

Since renaming the data identified by a prefix (see previous note) rises too much difficulties, choice has been made to keep its original name.

Warning

By default, pre-existing data is deleted !

Since data is not renamed (see previous note), the loader deletes any prior data having the same name before loading the new data. This is done in order to prevent any conflict.

To illustrate this process, consider the following load:

loader.load(
    source='dataframe',
    destination='dataset',
    data_name='a0',
    dataframe=df)

Before populating a BigQuery table, data goes through a local directory and a bucket. If some existing data was named a0 prior to the load job in any of these three locations, it is going to be erased first.

Default behaviour can only be modified in the BigQuery location. To do this, the default value of the write_disposition parameter has to be changed.

Parameters:

source (str) – one of ‘query’, ‘dataset’, ‘bucket’, ‘local’, ‘dataframe’.
destination (str) – one of ‘dataset’, ‘bucket’, ‘local’, ‘dataframe’.
data_name (str, optional) – The name of the data. If not passed, a name is generated by concatenating the current timestamp and a random integer. This is useful when source = ‘query’ and destination = ‘dataframe’ because the user may not need to know the data_name.
query (str, optional) – A BigQuery Standard SQL query. Required if source = ‘query’.
dataframe (pandas.DataFrame, optional) – A pandas dataframe. Required if source = ‘dataframe’.
write_disposition (google.cloud.bigquery.job.WriteDisposition, optional) – Specifies the action that occurs if data named data_name already exist in BigQuery. Defaults to bigquery.WriteDisposition.WRITE_TRUNCATE.
dtype (dict, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and dtype is one of its parameters.
parse_dates (list of str, optional) – When destination = ‘dataframe’, pandas.read_csv() is used and parse_dates is one of its parameters.
date_cols (list of str, optional) – If no bq_schema is passed, indicate which columns of a pandas dataframe should have the BigQuery type DATE.
timestamp_cols (list of str, optional) – If no bq_schema is passed, indicate which columns of a pandas dataframe should have the BigQuery type TIMESTAMP.
bq_schema (list of google.cloud.bigquery.schema.SchemaField, optional) – The table schema in BigQuery. Used when destination = ‘dataset’ and source != ‘query’. When source = ‘query’, the bq_schema is inferred from the query. If source is one of ‘bucket’ or ‘local’ and the bq_schema is not passed, it falls back to an inferred value from the CSV with google.cloud.bigquery.job.LoadJobConfig.autodetect. If source = ‘dataframe’ and the bq_schema is not passed, it falls back to an inferred value from the dataframe with this method.

Returns:

The result of the load job:

When destination = ‘dataframe’, it returns a pandas dataframe populated with the data specified by the arguments.
In all other cases, it returns None.

Return type:

pandas.DataFrame or NoneType

property local_dir_path: str¶

The local directory path.

Type:: str

multi_load(configs: List[LoadConfig])[source]¶

Execute several load jobs specified by the configurations.

The BigQuery Client executes simultaneously the query_to_dataset parts (resp. the dataset_to_bucket and bucket_to_dataset parts) from the configurations.

Parameters:: configs (list of google_pandas_load.load_config.LoadConfig) – See google_pandas_load.load_config.LoadConfig for the format of one configuration.
Returns:: A list of load results. The i-th element is the result of the load job configured by configs[i]. See google_pandas_load.loader.Loader.load() for the format of one load result.
Return type:: list of (pandas.DataFrame or NoneType)

Loader¶

google-pandas-load

Navigation

Related Topics