Best Approach to Avoid Duplicates When Loading Cumulative Monthly Orders Daily & Handling Month Transitions

Question

Best Approach to Avoid Duplicates When Loading Cumulative Monthly Orders Daily & Handling Month Transitions

YERNAIDU SIRAPARAPU 125

Question:

Hi all,

I am working on a data warehouse project where we receive orders files on a monthly basis. Each file contains all orders for that month, and the file grows daily as new orders are added.

Currently, we load this data into a raw table daily, but this leads to duplicate records in the cleansed table because the same month’s data is appended every day.

Challenges:

There is no unique order ID in the file.

The combination of Customer + Invoice Date + Invoice Number + SKU is not always unique, because the same SKU may appear multiple times per invoice.

Some fields like amounts, IGST, balances can change in subsequent daily files.

I also need to compare dates, mainly to identify and delete all records for a given month, i.e., end-of-month data.

I want to safely move from one month to the next (e.g., October → November) without losing any data, taking into account timezones, daylight savings, or other transitions.

Current idea: For daily loads, delete all records of the month in the cleansed table and reload the full file for that month. This ensures no duplicates and the latest data is always loaded.

Questions:

Is this approach reasonable for daily loads given a cumulative monthly file?

Are there better ways to handle this scenario in a data warehouse without a unique order key?

How can I efficiently compare date values in SQL to identify all records for a specific month, especially considering month-end transitions and timezone issues?

Any best practices for auditing, performance, or incremental loading in this scenario?

I’m using Azure SQL, but general best practices are welcome.

Thanks in advance! Question:

Hi all,

I am working on a data warehouse project where we receive orders files on a monthly basis. Each file contains all orders for that month, and the file grows daily as new orders are added.

Currently, we load this data into a raw table daily, but this leads to duplicate records in the cleansed table because the same month’s data is appended every day.

Challenges:

There is no unique order ID in the file.

The combination of Customer + Invoice Date + Invoice Number + SKU is not always unique, because the same SKU may appear multiple times per invoice.

Some fields like amounts, IGST, balances can change in subsequent daily files.

I also need to compare dates, mainly to identify and delete all records for a given month, i.e., end-of-month data.

I want to safely move from one month to the next (e.g., October → November) without losing any data, taking into account timezones, daylight savings, or other transitions.

Current idea:
For daily loads, delete all records of the month in the cleansed table and reload the full file for that month. This ensures no duplicates and the latest data is always loaded.

Questions:

Is this approach reasonable for daily loads given a cumulative monthly file?

Are there better ways to handle this scenario in a data warehouse without a unique order key?

How can I efficiently compare date values in SQL to identify all records for a specific month, especially considering month-end transitions and timezone issues?

Any best practices for auditing, performance, or incremental loading in this scenario?

I’m using Azure SQL, but general best practices are welcome.

Thanks in advance!

1 answer

Your answer

Answer 1

Unfortunately, some of these questions requires detailed knowledge about the data.

The scenario you describe sounds awful, and one would like to think that the underlying source should be able to produce a unique key.

But if you do not, deleting all data for the current month and loading today's file sounds reasonable, although it may increase the processing time. Then again, no matter how you do it, you will need to read the file of today into a staging table, and then trying merge it into the table would also take resources.

How can I efficiently compare date values in SQL to identify all records for a specific month, especially considering month-end transitions and timezone issues?

This question is not possible without knowledge about the data. However, one possibility is that you add a column with a default value. Say that for the first month you set the default to 1. When you are about to load the first file of the moth, you change the default value to 2. When you get the file the next day, you delete rows based on the column with the default value.

Rather than having a default value that you need to remember to change, you may be able to set it from your loading process. If the names of the files that you get ties the file to a certain month, you could work from that.

Share via

Best Approach to Avoid Duplicates When Loading Cumulative Monthly Orders Daily & Handling Month Transitions

1 answer

Your answer