Question:
Hi all,
I am working on a data warehouse project where we receive orders files on a monthly basis. Each file contains all orders for that month, and the file grows daily as new orders are added.
Currently, we load this data into a raw table daily, but this leads to duplicate records in the cleansed table because the same month’s data is appended every day.
Challenges:
There is no unique order ID in the file.
The combination of Customer + Invoice Date + Invoice Number + SKU is not always unique, because the same SKU may appear multiple times per invoice.
Some fields like amounts, IGST, balances can change in subsequent daily files.
I also need to compare dates, mainly to identify and delete all records for a given month, i.e., end-of-month data.
I want to safely move from one month to the next (e.g., October → November) without losing any data, taking into account timezones, daylight savings, or other transitions.
Current idea: For daily loads, delete all records of the month in the cleansed table and reload the full file for that month. This ensures no duplicates and the latest data is always loaded.
Questions:
Is this approach reasonable for daily loads given a cumulative monthly file?
Are there better ways to handle this scenario in a data warehouse without a unique order key?
How can I efficiently compare date values in SQL to identify all records for a specific month, especially considering month-end transitions and timezone issues?
Any best practices for auditing, performance, or incremental loading in this scenario?
I’m using Azure SQL, but general best practices are welcome.
Thanks in advance!
Question:
Hi all,
I am working on a data warehouse project where we receive orders files on a monthly basis. Each file contains all orders for that month, and the file grows daily as new orders are added.
Currently, we load this data into a raw table daily, but this leads to duplicate records in the cleansed table because the same month’s data is appended every day.
Challenges:
There is no unique order ID in the file.
The combination of Customer + Invoice Date + Invoice Number + SKU is not always unique, because the same SKU may appear multiple times per invoice.
Some fields like amounts, IGST, balances can change in subsequent daily files.
I also need to compare dates, mainly to identify and delete all records for a given month, i.e., end-of-month data.
I want to safely move from one month to the next (e.g., October → November) without losing any data, taking into account timezones, daylight savings, or other transitions.
Current idea:
For daily loads, delete all records of the month in the cleansed table and reload the full file for that month. This ensures no duplicates and the latest data is always loaded.
Questions:
Is this approach reasonable for daily loads given a cumulative monthly file?
Are there better ways to handle this scenario in a data warehouse without a unique order key?
How can I efficiently compare date values in SQL to identify all records for a specific month, especially considering month-end transitions and timezone issues?
Any best practices for auditing, performance, or incremental loading in this scenario?
I’m using Azure SQL, but general best practices are welcome.
Thanks in advance!