feat: Added default AutoMLTabularTrainingJob column transformations#357
Conversation
chore: merge main into dev
40f354e to
435a98d
Compare
435a98d to
3300faa
Compare
| raise ValueError("One of gcs_source or bq_source must be set.") | ||
|
|
||
| if gcs_source: | ||
| dataset_metadata = {"input_config": {"gcs_source": {"uri": gcs_source}}} |
There was a problem hiding this comment.
These were wrong before
| get_dataset_mock.return_value = gca_dataset.Dataset( | ||
| display_name=_TEST_DISPLAY_NAME, | ||
| metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR, | ||
| metadata=_TEST_METADATA_TABULAR_GCS, |
There was a problem hiding this comment.
Mock with GCS source
| get_dataset_mock.return_value = gca_dataset.Dataset( | ||
| display_name=_TEST_DISPLAY_NAME, | ||
| metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR, | ||
| metadata=None, |
There was a problem hiding this comment.
Mock with no metadata
sasha-gitg
left a comment
There was a problem hiding this comment.
Looks great. Tested column_names and it works nicely.
Left a few requested changes. Thanks!
| header_line = line[:first_new_line_index] | ||
|
|
||
| # Split to make it an iterable | ||
| header_line = header_line.split("\n") |
There was a problem hiding this comment.
It may be safer to only include the first line header_line.split("\n")[:1] to avoid possible parsing errors down stream.
|
|
||
| training_task_definition = schema.training_job.definition.automl_tabular | ||
|
|
||
| if self._column_transformations is None: |
There was a problem hiding this comment.
Please log here we are defaulting to auto for all columns as column_transformations was not provided.
There was a problem hiding this comment.
Makes sense, will add.
There was a problem hiding this comment.
INFO:google.cloud.aiplatform.training_jobs:No column transformations provided, so now retrieving columns from dataset in order to set default column transformations.
INFO:google.cloud.aiplatform.training_jobs:The column transformation of type 'auto' was set for the following columns': ['station_number', 'wban_number', 'year', 'month', 'day', 'num_mean_temp_samples', 'mean_dew_point', 'num_mean_dew_point_samples', 'mean_sealevel_pressure', 'num_mean_sealevel_pressure_samples', 'mean_station_pressure', 'num_mean_station_pressure_samples', 'mean_visibility', 'num_mean_visibility_samples', 'mean_wind_speed', 'num_mean_wind_speed_samples', 'max_sustained_wind_speed', 'max_gust_wind_speed', 'max_temperature', 'max_temperature_explicit', 'min_temperature', 'min_temperature_explicit', 'total_precipitation', 'snow_depth', 'fog', 'rain', 'snow', 'hail', 'thunder', 'tornado'].
There was a problem hiding this comment.
@sasha-gitg Does this look okay or is it too verbose?
There was a problem hiding this comment.
I thought it would be nice to show the names so the user can verify the columns.
8937abc to
9e66508
Compare
sasha-gitg
left a comment
There was a problem hiding this comment.
Looks great. Added a few comments. Thanks Ivan!
| line = "" | ||
|
|
||
| try: | ||
| logging.disable(logging.CRITICAL) |
There was a problem hiding this comment.
Preference to be more precise by filtering the module logs we're trying to suppress. Like so:
python-aiplatform/google/cloud/aiplatform/initializer.py
Lines 172 to 176 in 857f63d
| from google.cloud.aiplatform import utils | ||
|
|
||
| from typing import List | ||
| import logging |
There was a problem hiding this comment.
Import should be up top with stdlib imports.
This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0.
* Bump google-cloud-storage min version to 1.32.0 This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0. * Bumped version in constraints-3.6.txt
Testing code
fixes: https://b.corp.google.com/issues/181042526