Qlik Open Lakehouse architecture

Qlik Open Lakehouse provides a fully managed, end-to-end solution in Qlik Talend Cloud to ingest, process, and optimize data in an Iceberg-based lakehouse. This solution delivers low–latency query performance and efficient data operations at scale.

The Qlik Open Lakehouse architecture combines secure communication, scalable compute, and efficient data processing to deliver a modern lakehouse experience. Qlik Open Lakehouse leverages AWS-native components, including EC2, S3, and Kinesis.

Core components

The following entities are required to create a Qlik Open Lakehouse.

Data Movement Gateway (CDC)

The data movement gateway runs in your on-premises or cloud environment. It captures changes from source systems, such as RDBMS, SAP, or mainframes, and sends the data to an Amazon S3 landing zone.

Network integration agent (EC2 instance)

The network integration agent is an EC2 instance that facilitates secure communication between Qlik services in the cloud and lakehouse clusters within your environment. The agent is automatically deployed as an On-Demand Instance during the network integration process, and fully managed by Qlik. New versions are deployed automatically on release.

When the network integration functions correctly, a status of Connected is displayed in the Lakehouse clusters view in the Administration Activity Center. The status changes to Disconnected if connectivity issues arise.

Lakehouse cluster (EC2 Auto-Scaling Group)

The lakehouse cluster is a group of AWS EC2 instances responsible for data processing. The cluster instances coordinate and execute the workloads to process incoming data from the landing area and, after processing, store the data in the target location in Iceberg format.

A lakehouse cluster with a single AWS Spot Instance is automatically created during the setup of your network integration. You can manage and create additional clusters to support your ongoing lakehouse requirements. When you configure a cluster, you grant Qlik permission to create, start, stop, scale, or roll back the servers to fulfill data processing requirements. Each cluster is associated with a single network integration, though multiple clusters can run within the same network integration. A single cluster can run many lakehouse tasks.

An AWS Spot Instance uses spare Amazon EC2 capacity at a lower cost than regular instances but can be interrupted by AWS with little notice. By default, Qlik provisions ephemeral Spot Instances for data processing. If there are insufficient Spot Instances available in the AWS Spot market, Qlik automatically uses On-Demand Instances to ensure continuity. The system reverts to Spot Instances when they become available. The lakehouse cluster technology is designed to gracefully transition between Spot and On-Demand instances, moving jobs between nodes. This process happens automatically, without requiring manual intervention. In the cluster settings, you can configure how many Spot and On-Demand instances should be used in the cluster. Utilizing Spot Instances helps reduce the on-going compute costs of your Qlik Open Lakehouse.

Further to defining the number of Spot and On-Demand instances to use, you can configure a scaling strategy that best suits the workload and budget for your project. The following scaling strategies can be applied to a cluster:

Low cost: Ideal for development or QA environments, and workloads that do not depend on fresh, real-time data. Qlik strives to keep the cost as low as possible, resulting in occasional periods of high latency.
Low latency: Designed for non-mission critical workloads where near real-time data freshness is acceptable. While this strategy aims for low latency, brief spikes may be experienced.
Consistent low latency: Suitable for production environments with high-scale data that must have real-time data freshness. Qlik proactively scales the instances to ensure low latency, which can incur higher costs.
No scaling: A good option for workloads that process a consistent volume of data. Select this choice to retain a static number of instances with no automatic scaling and predictable costs.

Kinesis stream (Workload coordination)

Qlik requires a Kinesis stream to collate and relay the state of each server in the lakehouse cluster. The servers report the status of tasks and operational metrics such as CPU and memory directly to Kinesis, as the servers do not communicate with one other. Each server polls data from the Kinesis stream to discover information about the other servers within the cluster. This information exchange enables the synchronization of work.

Amazon S3 buckets

Amazon S3 buckets are used as follows:

Landing data bucket: Raw CDC data lands in an S3 bucket prior to transformation.
Configuration bucket: Stores metadata and configurations used by the lakehouse system.
Iceberg table storage: Data is stored and optimized in Iceberg format tables.

High-level flow

Initial setup

VPC and infrastructure provisioning - Configure a VPC in your AWS account along with subnets, S3 buckets, Kinesis streams, and IAM roles by following the instructions within the Qlik documentation.
Network integration configuration - The tenant admin creates a network integration in Qlik Talend Cloud using previously provisioned infrastructure details.
Deployment of Qlik components - Qlik automatically provisions the data-plane gateway and a lakehouse cluster within your VPC.
Establish communication - The data-plane gateway securely establishes communication with Qlik Talend Cloud.
Gateway deployment - Deploy a Data Movement Gateway (CDC), either on-premises or in your cloud environment, including the data-plane VPC.
Ready to operate - You can create and manage Qlik Open Lakehouse projects and tasks according to their access permissions when set-up is complete.

Creating a Qlik Open Lakehouse project

The following task types are available:

Landing data task

Source configuration - The data movement gateway is configured to capture changes from source systems, including RDBMS, SAP, mainframes, and more.
Data landing - The CDC task continuously sends raw change data to the designated S3 landing bucket in your AWS account.

Storage data task

Register an Iceberg catalog connection, e.g. AWS Glue Data Catalog.
Define a storage task in Qlik Talend Cloud.
Qlik Talend Cloud sends task definitions to the data-plane gateway.
The data-plane gateway securely forwards the task instructions to the Qlik lakehouse cluster.
The cluster continuously reads raw data from a landing bucket in S3, processes it, and writes the output to Iceberg tables in S3.
The lakehouse cluster automatically scales up or down based on load, according to pre-defined preferences in the lakehouse cluster settings.
Monitoring data is sent to Qlik Talend Cloud, and logs and metrics are forwarded to Qlik.

Mirror data task

You can create external Iceberg tables to enable the querying of data stored in your data lake from Snowflake without duplication. This allows you to use the Snowflake analytics engine on top of Iceberg-managed data stored in formats such as Parquet on S3. By referencing external tables rather than duplicating data to Snowflake, this reduces storage costs, maintains a single source of truth, and ensures consistency between lakehouse and warehouse environments.

Communication between your network integration and Qlik Talend Cloud

The network integration establishes an outbound secured connection (HTTPS) to Qlik Talend Cloud. Upon successful acceptance, the connection is converted to a secured Web socket (WSS). An additional, dedicated communication channel (WSS) is established between the network integration and Qlik Talend Cloud to receive lakehouse specific task commands and controls. Periodically, the network integration establishes a secure connection (HTTPS) to Qlik Talend Cloud to receive and send data related events. Metrics and logs are sent to Qlik from the lakehouse clusters.

The following measures are taken to ensure your data is secure:

All connections from your network integration to Qlik Talend Cloud are outbound. No inbound access is required.
Metadata, commands, and control requests are transmitted using communication channels secured with HTTPS, creating an additional layer of encryption between the network integration and Qlik Talend Cloud.
All data flows between resources owned by you. Data is never sent to Qlik Talend Cloud. Metadata, such as table and columns names, for example, is sent to Qlik Talend Cloud to allow task definitions.
Data is anonymized before sending to Qlik. Qlik uses anonymized data to proactively support you should the logs or metrics indicate an issue.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here