Transforming data
You can create reusable and rule based data transformations as a part of your data pipeline. You can perform transformations as part of your data onboarding, or create reusable transformation data tasks. You can perform row-level transformations and create datasets that are either materialized as tables, or created as views that perform transformations on the fly.
You can perform explicit transformations per dataset, or create global rules that transform multiple datasets. You can also filter a dataset to create a subset of rows.
Creating a transformation data asset
The easiest way to create a transformation data task is to click ... on a storage data task and then selecting Transform data.
You can also click Add new in a data project and select Transform data. In this case you need to define which source data task to use.
-
Define your source data in Transform dataset.
Select source datasets and click Add to add them to Output.
You can add a SQL-based dataset by selecting a dataset, and clicking Add SQL-based dataset.
Tip noteYou can also add more datasets from other storage data assets by clicking Select source data. -
Make all required changes to the included datasets, such as transformations, filtering data, or adding columns.
For more information, see Managing datasets.
-
When you have added the transformations that you want, you can validate the datasets by clicking Validate datasets. If the validation finds errors, fix the errors before proceeding.
For more information, see Validating and adjusting the datasets.
-
Create a data model
Click Model to set the relationships between the included datasets.
For more information, see Creating a data model.
-
Click Prepare to prepare the data task and all required artifacts. This can take a little while.
-
When the status displays Prepared, you can run the data task .
Click ... and then Run.
The data task will now start creating datasets to transform the data.
Adding SQL-based datasets
You can include SQL-based tables in transformation tasks. A SQL-based table allows you to input a SQL SELECT query into a pipeline to define complex or simple transformations.
For more information, see Adding SQL-based datasets.
-
In Transform dataset, select the datasets to include in the query, and click Add SQL-based table.
Scheduling a transformation task
You can schedule a transformation task to be updated periodically. You can set a time based schedule, or set the task to run when input data tasks have completed running.
Click ... on a data task and select Scheduling to create a schedule. The default scheduling setting is inherited from the settings in the data project. For more information about default settings, see Transform default values.
You always need to set Scheduling to On to enable the schedule.
Time based schedules
You can use a time based schedule to run the task regardless of when the different input sources are updated.
-
Select At specific time in Run the data task.
You can set an hourly, daily, weekly or monthly schedule.
Event based schedules
You can use an event based schedule to run the task when input data tasks have completed running.
-
Select On specific event in Run the data task.
You can select if you want to run the task when any of the input tasks have completed successfully, or when any of a selection of input tasks have completed successfully.
Monitoring a transformation task
You can monitor the status and progress of a transformation task by clicking on Monitor.
For more information, see Monitoring a data task.
Transformation settings
You can set properties for the transformation data task .
-
Click Settings.
General settings
-
Database
Database to use in the data source.
-
Data task schema
You can change the name of the storage data task schema. Default name is store.
-
Internal schema
You can change the name of the internal storage schema. Default name is store__internal.
- Prefix for all tables and views
You can set a prefix for all tables and views created with this task.
Information noteYou must use a unique prefix when you want to use a database schema in several data tasks. -
Materialized
You can select to create tables (Materialized), or only create views that perform transformations on the fly (Non-materialized).
-
History
You can keep historical change data to let you easily recreate data as it looked at a specific point in time. You can use history views and live history views to see historical data.
Runtime settings
-
Parallel execution
You can set the maximum number of data connections for full loads to a number from 1 to 5.
-
Warehouse
The name of the cloud data warehouse.
Limitations
-
It is not possible to change data types in a transformation data task when the Non-materialized option is selected.
-
Non-materialized SQL-based data sets provide the full result set only, incremental filtering is not supported. Non-materialized SQL-based data sets also provide current data only and do not keep historical records.