Onboarding data

The first step of creating a data pipeline in a Qlik Talend Data Integration project is onboarding the data. This involves transferring the data from the data source and storing datasets in read-optimized format. You can update data with continuous change handling, or use scheduled reloads.

You create onboarding in a single operation, but it is performed in two steps.

Landing the data

This involves transferring the data continuously from the on-premises data source to a landing area, using a Landing data task.

Landing data from data sources

You can also land data to a lakehouse, where the data is landed to cloud file storage. This is available for Snowflake projects, where the landing target is set to Cloud file storage.

Landing data to a lakehouse.
Storing datasets

This involves reading the initial load of landing data or incremental loads, and applying the data in read-optimized format using a Storage data task.

Storing datasets

When you have onboarded the data, you can use the stored datasets in several ways.

You can use the datasets in an analytics app.
You can create transformations.
You can create a data mart.

Onboard data

You start onboarding data in a project. Datasets will be stored in the cloud data warehouse defined in the project. For more information about projects, see Creating a data pipeline project.

In your project, click Create and then Onboard data.

Tip noteYou can also click on an existing source in the project, and then click Onboard data.
Add Name and Description for the onboarding.

Click Next.
Select the source connection.

You can select an existing source connection or create a new connection to the source.

For more information, see Setting up connections to data sources.

Click Next.
Select data to load.

For more information, see Selecting data.

Click Next.

Settings is displayed, where you can select update method and history settings.
Select which method to use to update data in Update method:
- Change data capture (CDC)
  
  If your data also contains tables that do not support CDC, or views, two data pipelines will be created. One pipeline with all tables supporting CDC, and another pipeline with all other tables and views using Reload and compare.
- Reload and compare
Select if you want to replicate history of previous data in addition to current data in History.

Click Next when you are ready.
If you are not using Data Movement gateway to access your data source, the following section will be displayed in the settings:

Replication scheduler
- Replicate data every: You can schedule how often to capture changes from the data source and set a Start time and Start date. If the source datasets support CDC (Change data capture), only the changes to the source data will be replicated and applied to the corresponding target tables. If the source datasets do not support CDC (for example, Views), changes will be applied by reloading of all the source data to the corresponding target tables. If some of the source datasets support CDC and some do not, two separate sub-tasks will be created: one for reloading the datasets that do not support CDC, and the other for capturing the changes to datasets that do support CDC.
  
  The onboarding setup wizard allows you to schedule an hourly interval. After you have completed the onboarding wizard, you can explore different scheduling options, as described in Data replication task settings.
For information about minimum scheduling intervals according to data source type and subscription tier, see Minimum allowed scheduling intervals.
Preview the data tasks that are created to onboard data, and rename them if you prefer.

Tip noteThe names are used when naming database schemas in the storage data task. Consider using names that are unique to avoid conflicts with data tasks in other projects using the same data platform.
Select if you want to open any of the data tasks that are created, or return to the project.

When you are ready, click Finish.

The onboarding data tasks are now created. To start replicating data you need to:

Prepare and run the landing data task.

For more information, see Landing data from data sources.
Prepare and run the storage data task.

For more information, see Storing datasets

Selecting data

You can select specific tables or views, or use selection rules to include or exclude groups of tables.

If the selection includes views, CDC is not supported.

Use % as a wildcard to define a selection criteria for schemas and tables.

%.% defines all tables in all schemas.
Public.% defines all tables in the schema Public.

Selection criteria gives you a preview based on your selections.

You can now either:

Create a rule to include or exclude a group of tables based on the selection criteria.

Click Add rule from selection criteria to create a rule, and select either Include or Exclude.

You can see the rule under Selection rules.
Select one or more datasets, and click Add selected datasets.

You can see the added datasets under Explicitly selected datasets.

Selection rules only apply to the current set of tables and views, not to tables and views that are added in the future.

Related learning:

Using Qlik Cloud Data Integration to onboard and transform data

Learn more

Using Qlik Cloud Data Integration to onboard and transform data

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here