Early Access: The content on this website is provided for informational purposes only in connection with pre-General Availability Qlik Products.
All content is subject to change and is provided without warranty.
Skip to main content Skip to complementary content

Dataset architecture in a Qlik Open Lakehouse

When you generate datasets in a Qlik Open Lakehouse, storage tables, change tables, and views are generated automatically by Qlik Talend Data Integration. You can query these views to access current and historical data stored in Iceberg open table format.

An intuitive and guided user interface helps you build, model, and execute data pipelines. You can automatically generate schemas for operational data stores (ODS) and historical data stores (HDS) without manual coding.

Architecture for a Qlik Open Lakehouse data pipeline using Qlik Data Gateway - Data Movement. Data lands in S3, and is stored an Iceberg open lakehouse. Optionally, as shown below, tables can be mirrored to a data warehouse.

Schemas

Artifacts are generated in an internal schema and a data task schema.

  • The internal schema contains a physical table with multiple partitions.

  • The data task schema contains the views that you can use to consume the data.

    When a schema is associated with more than one data task, each data task must use a unique prefix for tables and views. You can set the prefix in the data task settings.

Only internal schemas are checked for naming conflicts. For other schemas, you must ensure there are no naming conflicts in the table names. The best practice is to name the internal schema the same name as the data task schema with _internal added. This ensures that every schema and prefix combination is unique.

Information noteAll tables and views are managed by Qlik Talend Data Integration. Do not alter the data using other tools.

Tables

For each source table, a table is created in the internal schema with the suffix _internal, using the format:

<INTERNAL_SCHEMA>.<TABLE_NAME>_internal

This table contains five partitions, each serving a distinct purpose in how the data is processed and stored. Each partition leverages a subset of the table’s columns based on its role in the data lifecycle.

Information noteThe table in the internal schema is subject to change without notice. You should primarily use views to consume data where possible.

The internal dataset is structured to efficiently manage incoming data from your source system, supporting real-time ingestion and historical tracking. The flow of data is managed through several specialized partitions, each described below:

  • Changes partition – Real-time ingestion

    All changes from the source system—inserts, updates, and deletes—are first appended to the Changes partition.

    • Acting as a raw changelog, the partition provides a complete feed of source activity before any transformations occur.

    • Each change is tagged with a type (I, U, or D), indicating how it is processed.

    • Events are captured from the source as they occur. The partition is updated in near real-time.

    • The Changes partition is available when the following settings are applied in the Lake landing data task.

      • In the General tab, the Update method is set to CDC.

      • Full load is enabled.

  • Asset_state partition - Tracking progress

    As changes are ingested, the Asset_state partition logs two key timestamps:

    • When changes were last appended to the Changes partition.

    • When changes were last applied to the Current partition.

    This provides full visibility into the synchronization pipeline and helps with monitoring or troubleshooting delays.

  • Current partition (ODS) – Latest replica of the source

    Changes from the Changes partition are periodically applied to the Current partition to maintain an up-to-date, q uery-optimized copy of the source data.

    • A background task runs automatically to apply those updates. This is based on the volume of accumulated changes, rather than a fixed schedule.

    • The initial full load is written directly to the Current partition.

    • This partition reflects the current state of the data and is designed for efficient querying.

  • Prior partition (HDS) – Historical data

    When an a record is updated or deleted in the Current partition, a copy of the previous version is written to the Prior partition. 

    • Records contain metadata, including change history and the date range when the data is valid.

    • This supports Type 2 slowly changing dimension (SCD2).

    • The Prior partition is available when Keep historical records and archive of change records is enabled in the General tab of the Storage data task settings.

    • When an updated version of a record enters the Current partition, the previous record is moved to the Prior partition for historical tracking.

  • Reload partition

    The Reload partition acts as a temporary staging area during manual or scheduled full reloads:

    • New data is first written to the Reload partition.

    • Regular maintenance tasks clear any pending changes before they are paused.

    • Data is compared with the Current partition, and only the differences are moved to the Current partition.

    • The Reload partition is cleared when the operation completes.

    This process ensure minimal disruption to the main dataset during full reload operations.

Current partition

Field Type Description
hdr__key_hash VARBINARY (20) Hash of all record primary keys. Hash format is SHA1. Columns are separated by a backspace character.
hdr__from_timestamp TIMESTAMP

Timestamp in UTC:

  • For data taken from a full load, this is the full load start time.

  • For a change coming from the Changes partition, this is the timestamp field of the record.

hdr__operation VARCHAR (1)

Latest operation of this record.

  • D - deleted from Changes partition.

  • U - updated from Changes partition.

  • I - inserted from the Changes partition.

  • L - inserted by the full load task.

  • d - deleted from compare and apply.

  • u - updated from compare and apply.

  • i - inserted from compare and apply.

hdr__inserted_timestamp TIMESTAMP UTC timestamp of the first time the key was added. When using full load, the start time of the full load.
hdr__modified_timestamp TIMESTAMP UTC timestamp of when the last update was made.

Prior partition

Field Type Description
hdr__key_hash VARBINARY (20) Hash of all record primary keys.
hdr__from_timestamp TIMESTAMP Timestamp in UTC.
hdr__to_timestamp TIMESTAMP Timestamp in UTC.
hdr__operation STRING (1)

Latest operation of this record.

  • D - deleted from Changes partition.

  • U - updated from Changes partition.

  • I - inserted from the Changes partition.

  • L - inserted by the full load task.

  • d - deleted from compare and apply.

  • u - updated from compare and apply.

  • i - inserted from compare and apply.

hdr__was_current_from_timestamp TIMESTAMP Timestamp in UTC for the first time record was current.
hdr__was_current_to_timestamp TIMESTAMP Timestamp in UTC for the last time record was current.

Changes partition

Field Type Description
hdr__change_identifier VARCHAR (50)

The change identifier is a string with two parts:

  • Batch run start timestamp in milliseconds since 1/1/1970 (10 digit string).

  • Change sequence from the replication gateway (35 characters).

hdr__operation VARCHAR (1)

Latest operation of this record.

  • D - deleted from the Changes partition.

  • U - updated from the Changes partition.

  • I - inserted from the Changes partition.

  • L - inserted by the full load task.

  • d - deleted from compare and apply.

  • u - updated from compare and apply.

  • i - inserted from compare and apply.

hdr__timestamp TIMESTAMP Timestamp in UTC.
hdr__key_hash BINARY (20) Hash of all record primary keys.
hdr__inserted_timestamp TIMESTAMP UTC timestamp of the time when the change was processed by Qlik.

Asset_state partition

Field Type Description
hdr__apply_change_identifier VARCHAR (50) The change identifier of the latest change applied to the Current partition.
hdr__copy_change_identifier VARCHAR (50) The change identifier of the latest change appended to the Changes partition.

Reload partition

The columns used by the Reload partition are identical to the Current partition.

Views

All views created are updated in near real time. The following views are available to simplify querying and reporting:

Current

The Current view reflects the most recent state of the data. This view represents a replica of the source table, updated in near-real time. It merges data from the Current and Changes partitions.

Name: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>

No header columns are added to the table structure.

History

A History view is generated in the data asset schema for each selected source table when History is enabled in the data task settings. The History view merges data from the Prior and Changes partitions. It provides a complete timeline of changes, ideal for audit or historical analysis.

Name: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>_<Suffix for history views>

The following header fields are added to the History view:

Field Type Description
hdr__key_hash BINARY (20) Hash of all record primary keys.
hdr__from_timestamp TIMESTAMP The timestamp when the change occurred in the source system. This reflects when the user originally made the change.
hdr__to_timestamp TIMESTAMP The timestamp when the change was reversed or updated in the source system.
hdr__store VARCHAR (10)

This indicates where the record resides:

  • CURRENT - if the record resides in the Current partition.

  • PRIOR - if the record resides in the Prior partition with historical data.

hdr__operation STRING (1)

Latest operation of this record.

  • D - deleted from the Changes partition.

  • U - updated from the Changes partition.

  • I - inserted from the Changes partition.

  • L - inserted by the full load task.

  • d - deleted from compare and apply.

  • u - updated from compare and apply.

  • i - inserted from compare and apply.

hdr__deleted BIT Indicates if the record is soft deleted, based on if hdr__operation is D or d.
hdr__was _current_from_timestamp TIMESTAMP The time (UTC) when this row entered the Current partition. This is typically when the storage job ran and the change was applied.
hdr__was _current_to_timestamp TIMESTAMP The time (UTC) when the row was removed from the Current table, for example, due to a newer version.

Information noteThe Current and History views rely on the Changes partition and are always near real-time. However, if you experience slow query performance, it may be due to a large number of changes that have not yet been applied into the Current or Prior partitions. Unmerged deltas require more processing at query time, making access slower.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!