Lake landing settings
The following Lake landing task settings apply to Qlik Open Lakehouse projects.
You can configure settings for the Lake landing data task.
-
Open the landing task and click Settings in the toolbar.
The Settings: <Task-Name> dialog opens. The available settings are described below.
General
Update method
The landing task always starts with a Full Load. After the Full Load completes, you can keep the landed data up-to-date using one of the following methods:
-
Change data capture (CDC)
The landed data is kept up-to-date using CDC (Change Data Capture) technology. CDC may not be supported by all data sources. CDC does not capture DDL operations, such as renaming columns, or changes in metadata.
If your data also contains views or tables that do not support CDC, two data pipelines will be created. One pipeline with all tables supporting CDC, and another pipeline with all other tables and views using Reload as the update method.
-
Reload
All of the landed data is reloaded from the source. This is useful if your source does not support CDC, but can be used with any supported data source.
You can schedule the reloads periodically.
Information noteThis setting is not available when using a SaaS application Preview connector. Preview connectors are indicated by abutton, both in the Create connection dialog and in the online help.
Folder to use
Select which folder to use when landing data to the staging area.
-
Default folder
This creates a folder with the default name: <project name>/<data task name>.
-
Root folder
Store data in the root folder of the storage.
-
Folder
Specify a folder name to use.
Data uploading
File attributes
Configure the file settings for uploading data.
-
Field delimiter
The field delimiter character used in the source data file. The default is ",". -
Null value
Optional default string to replace null values in the source data file. -
Record delimiter
The record delimiter character used in the source data file. The default is "\n". -
Quote character
The quote character used in the source data file. The default is """. -
Quote escape character
The quote escape character used in the source data file. The default is """. -
Maximum file size (KB)
The maximum size a file can reach before it is closed. Smaller files might be uploaded faster (depending on the network) and improve performance when used in conjunction with the parallel execution option. However, it is generally considered bad practice to clutter the database with small files.
-
Compress files using
When selected, the CSV files will be compressed using GZIP.
Information note- Requires Data Movement gateway 2023.5.16 or later.
Metadata
LOB columns
-
Include LOB columns and limit column size to (KB):
You can choose to include LOB columns in the task, and set the maximum LOB size. LOBs that are larger than the maximum size will be truncated.
Full load
button, both in the Create connection dialog and in the online help.Performance tuning
- Maximum number of tables to load in parallel: Enter the maximum number of tables to load into the target at one time. The default value is 5.
-
Transaction consistency timeout (seconds): Enter the number of seconds to wait for open transactions to close, before beginning the Full Load operation. The default value is 600 (10 minutes). The full load will start after the timeout value is reached even if there are transactions that are still open.
Information noteTo replicate transactions that were open when Full Load started but were only committed after the timeout value was reached, you need to reload the target tables. - Commit rate during full load: The maximum number of events that can be transferred together. The default value is 10000.
After full load completes
Create primary key or unique: Select this option if you want to delay creation of the primary key or unique index on the data platform until after full load completes.
Error handling
Data error handling is supported with the Change Data Capture (CDC) update method only.
For data truncation errors: Select what you want to happen when a truncation occurs in one or more specific records. You can select one of the following from the list:
- Ignore: The task continues and the error is ignored.
- Suspend table: The task continues, but data from the table with the error record is moved into an error state and its data is not replicated
- Stop task: The task is stopped and manual intervention is required.
For other data errors: Select what you want to happen when an error occurs in one or more specific records. You can select one of the following from the list:
- Ignore: The task continues and the error is ignored.
- Suspend table: The task continues, but data from the table with the error record is moved into an error state and its data is not replicated
- Stop task: The task is stopped and manual intervention is required.
Escalate error handling when other data errors reach (per table): Select this check box to escalate error handling when the number of non-truncation data errors (per table) reaches the specified amount. Valid values are 1-10,000.
Escalation action: Choose what should happen when error handling is escalated. Note that the available actions are dependent on the action selected from the For other data errors drop-down list described above.
-
Suspend table (default): The task continues, but data from the table with the error record is moved into an error state and its data is not landed.
- Stop task: The task is stopped and manual intervention is required.
Table errors
Number of retries before returning a table error
This option allows you to control when the table error handling policy will be triggered. By default, after a table error is encountered, the table will be suspended or the task will be stopped (according to the selected action) after three retries. Sometimes, a table error will occur due to planned maintenance of the SaaS application. In these cases, the default number of retries might not be enough for the maintenance to be completed before the table error handling policy is triggered. This also depends on the task's scheduling interval as a retry will be performed each time the task runs. So, for example, if you schedule a task to run every hour and the SaaS application is taken offline for maintenance just as the task starts to run, the default of three retries will allow the SaaS application to be offline for up to three hours without the table error handling policy being triggered. A longer maintenance window would require you to increase the number of retries (or change the scheduling) to prevent the table error handling policy being triggered.
To sum up, if you aware that your SaaS application undergoes periodic maintenance, best practice is to increase the number of retries according to the scheduling so that the maintenance can be completed without table error handling policy being triggered.
-
This option will only be shown for tasks configured with a Lite or Standard SaaS application connector.
-
If you are using Data Movement gateway, version 2024.11.70 or later is required.
When encountering a table error: Select one of the following from the drop-down list:
- Suspend table (default): The task continues but data from the table with the error record is moved into an error state and its data is not replicated
- Stop task: The task is stopped and manual intervention is required.
Escalate error handling when table errors reach (per table): Select this check box to escalate error handling when the number of table errors (per table) reaches the specified amount. Valid values are 1-10,000.
Escalation action: The escalation policy for table errors is set to Stop task and cannot be changed.
Environmental
-
Maximum retry count: Select this option and then specify the maximum number of attempts to retry a task when a recoverable environmental error occurs. After the task has been retried the specified number of times, the task is stopped and manual intervention is required.
To never retry a task, clear the check box or specify "0".
To retry a task an infinite number of times, specify "-1"
-
Interval between retry attempts (seconds): Use the counter to select or type the number of seconds that the system waits between attempts to retry a task.
Valid values are 0-2,000.
-
- Increase retry interval for long outages: Select this check box to increase the retry interval for long outages. When this option is enabled, the interval between each retry attempt is doubled, until the Maximum retry interval is reached (and continues retrying according to the specified maximum interval).
- Maximum retry interval (seconds): Use the counter to select or type the number of seconds to wait between attempts to retry a task when the Increase retry interval for long outages option is enabled. Valid values are 0-2,000.
Change processing tuning
Transaction offload tuning
Offload transactions in progress to disk if:Transaction data is usually kept in memory until it is fully committed to the source or target. However, transactions that are larger than the allocated memory or that are not committed within the specified time limit will be offloaded to disk.
- Total memory size for all transactions exceeds (MB): The maximum size that all transactions can occupy in memory before being offloaded to disk. The default value is 1024.
- Transaction duration exceeds (seconds): The maximum time that each transaction can stay in memory before being offloaded to disk. The duration is calculated from the time Qlik Talend Data Integration started capturing the transaction. The default value is 60.
Batch tuning
Minimum number of changes per transaction: The minimum number of changes to include in each transaction. The default value is 1000.
Information noteThe changes will be applied to the target either when the number of changes is equal to or greater than the Minimum number of changes per transaction value OR when the Maximum time to batch transactions before applying (seconds) value described below is reached - whichever occurs first. Because the frequency of changes applied to the target is controlled by these two parameters, changes to the source records may not immediately be reflected in the target records.
- Maximum time to batch transactions before applying (seconds): The maximum time to collect transactions in batches before declaring a timeout. The default value is 1.
Miscellaneous tuning
- Statements cache size (number of statements): The maximum number of prepared statements to store on the server for later execution (when applying changes to the target). The default is 50. The maximum is 200.
DELETE and INSERT when updating a primary key column: This option requires full supplemental logging to be turned on in the source database.
Schema evolution
Select how to handle the following types of DDL changes in the schema. When you have changed schema evolution settings, you must prepare the task again. The table below describes which actions are available for the supported DDL changes.
| DDL change | Apply to target | Ignore | Suspend table | Stop task |
|---|---|---|---|---|
| Add column | Yes | Yes | Yes | Yes |
| Rename column | No | No | Yes | Yes |
| Rename table | No | No | Yes | Yes |
| Change column data type | No | Yes | Yes | Yes |
| Create table If you used a Selection rule to add datasets that match a pattern, new tables that meet the pattern will be detected and added. | Yes | Yes | No | No |
Character substitution
You can substitute or delete source characters in the target database and/or you can substitute or delete source characters that are not supported by a selected character set.
All characters must be specified as Unicode code points.
- Character substitution will also be performed on the Control tables.
Invalid values will be indicated by a red triangle in the top right of the table cell. Hovering your mouse cursor over the triangle will show the error message.
Any table-level or global transformations defined for the task will be performed after the character substitution has been completed.
Substitutions actions defined in the Substitute or Delete Source Characters table are performed before the substitution action defined in the Substitute or Delete Source Characters Unsupported by the Selected Character Set table.
- Character substitution does not support LOB data types.
Substituting or deleting source characters
Use the Substitute or delete source characters table to define replacements for specific source characters. This may be useful, for example, when the Unicode representation of a character is different on the source and target platforms. For example, on Linux, the minus character in the Shift_JIS character set is represented as U+2212, but on Windows it is represented as U+FF0D.
| To | Do This |
|---|---|
Define substitution actions. |
|
Edit the specified source or target character | Click |
Delete entries from the table | Click |
Substituting or deleting source characters unsupported by the selected character set
Use the Unsupported source characters by character set table to define a single replacement character for all characters not supported by the selected character set.
| To | Do This |
|---|---|
Define or edit a substitution action. |
|
Disable the substitution action. | Select the blank entry from the Character Set drop-down list. |
More options
These options are not exposed in the UI as they are only relevant to specific versions or environments. Consequently, do not set these options unless explicitly instructed to do so by Qlik Support or product documentation.
To set an option, simply copy the option into the Add feature name field and click Add. Then set the value or enable the option according to the instructions you received.
Loading dataset segments in parallel
During full load, you can accelerate the loading of large datasets by splitting the dataset into segments, which will be loaded in parallel. Tables can be split by data ranges, all partitions, all subpartitions, or specific partitions.
For more information, see Landing dataset segments in parallel
Scheduling CDC tasks when working without Data Movement gateway
Data Movement gateway is not supported with a Qlik Talend Cloud Starter subscription and optional with other subscription tiers. When working without Data Movement gateway, you keep the target data up-to-date by setting a scheduling interval. The schedule determines how often the target datasets will be updated with changes to the source datasets. Whereas the schedule determines the update frequency, the dataset type determines the update method. If the source datasets support CDC (Change data capture), only the changes to the source data will be replicated and applied to the corresponding target tables. If the source datasets do not support CDC (for example, Views), changes will be applied by reloading of all the source data to the corresponding target tables. If some of the source datasets support CDC and some do not, two separate sub-tasks will be created: one for reloading the datasets that do not support CDC, and the other for capturing the changes to datasets that do support CDC. In this case, to ensure data consistency, it is strongly recommended to set the same schedule for both sub-tasks.
For information about minimum scheduling intervals according to data source type and subscription tier, see Minimum allowed scheduling intervals.
To change the scheduling:
Open you data project and then do one of the following:
- In tasks view, click
on the data task and select Scheduling.
- In pipeline view, click
on the data task and select Scheduling.
- Open the landing task and click the Scheduling toolbar button.
- In tasks view, click
- Change the scheduling settings as needed and then click OK.