Verifying datasets
Once the initial replication process is complete in Qlik Talend Cloud, the new destination schema will contain all data replicated from your sources. Any historical data replicated through Stitch will still exist in your data warehouse under a different schema. Some data that was replicated in the Stitch schema might not be available in the Qlik Talend Cloud target schema for any of the following reasons:
-
The historical data is no longer available.
-
Different connector versions leverage different APIs for an integration, and over time, data replicated through those APIs changes.
-
Source data is deleted over time.
The following verification queries are included to help you identify differences between the datasets that were replicated with Qlik Talend Cloud and those that were replicated with Stitch. Run these example queries against your Qlik Talend Cloud destination schema and your Stitch destination schema, and then compare and identify any differences between them.
Example query using hash row-level comparison
The easiest way to compare the output of each of the schemas (Stitch and Qlik Talend Cloud) is to use a hash row-level comparison to find mismatches in the data.
In the example below, the generated hash is based on a few key elements of the data, such as primary_key and a date row, but it can be extended to include more columns to compare as highlighted in column_1 and column_2 in the hash.
Do the following:
For each query, replace the values stitch_schema.stitch_table and qtc_schema.qtc_table with your Stitch schema name and Qlik Talend Cloud schema name.
-
Generate an MD5 Hash for tables in the existing Stitch schema.
SELECT id, HASH_MD5([primary_key] || '|' || [column_1] || '|' || [column_2] || '|' || [date]) AS hash_value FROM [stitch_schema.stitch_table];
-
Generate MD5 Hash for tables in the new Qlik Talend Cloud schema.
SELECT id, HASH_MD5([primary_key] || '|' || [column_1] || '|' || [column_2] || '|' || [date]) AS hash_value FROM [qtc_schema.qtc_table];
-
Compare the two hashes to identify data inconsistencies:
SELECT id FROM (SELECT id, HASH_MD5([primary_key] || '|' || [column_1] || '|' || [column_2] || '|' || [date]) AS hash FROM [stitch_schema.stitch_table) stitch JOIN (SELECT id, HASH_MD5([primary_key] || '|' || [column_1] || '|' || [column_2] || '|' || [date]) AS hash FROM [qtc_schema.qtc_table]) qtc ON stitch.id = qtc.id WHERE stitch.hash <> qtc.hash
The comparison will return data when there are differences across both schemas. You can extend this query to make the comparison more comprehensive by including more columns to hash.
Example query comparing the structural integrity across schemas
Alternatively, you can compare and verify datasets between Stitch and Qlik Talend Cloud by running a structural validation query on the migrated datasets. This type of query compares the structural integrity across the existing Stitch schema and the new Qlik Talend Cloud schema to identify potential differences.
To effectively run structural validation on the migrated datasets, you need to compare high-level information from both schemas together. This includes comparing row and column counts from each of the migrated tables, and comparing primary keys, null values, and foreign keys in each table.
Do the following:
For each query, replace the values stitch_schema.stitch_table and qtc_schema.qtc_table with your Stitch schema name and Qlik Talend Cloud schema name.
-
Compare the primary key columns that are generated in each schema using the following example queries.
-
Stitch
-
Qlik Talend Cloud
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '[stitch_schema.stitch_table]';
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '[qtc_schema.qtc_table]';
-
-
Compare rows and columns across each table.
-
Stitch
SELECT COUNT(*) FROM [stitch_schema.stitch_table]; -- From Stitch's target (for example, Snowflake)
-
Qlik Talend Cloud
SELECT COUNT(*) FROM [qtc_schema.qtc_table]; -- From Qlik Talend Cloud's target
-
-
Check for missing records that are not available in the Qlik Talend Cloud schema.
SELECT id FROM [stitch_schema.stitch _table] WHERE id NOT IN (SELECT id FROM [qtc_schema.qtc _table]);
Once you have the query reports, compare the results to identify any major changes that affect your dataset now that you have transitioned to Qlik Talend Cloud. If the reports show minimal differences, you can continue with the process of pointing your downstream reporting to the new schemas without any issue. If the reports show major differences with your new destination schema, you will need to adjust downstream reporting and modeling capabilities to work with the destination schema that will be kept up-to-date by the Qlik Talend Cloud replication tasks after the migration.