Ever been faced with a source database that doesn’t have a ‘last modified date’ which you can use to find new or changed data? It can be a nightmare to find the records which you need to add to your data warehouse, particularly if you have a fact table which has records that can be changed after inserting – you have to do things like checksums to try to determine if you need to update an existing record.
Today it hit me that you can use replication to find out changes in a source. The solution relies on replicating the source table to a new database, and marking the records in the subscriber with a modification date. Transactional replication is pretty low cost on the publishing database, and if the source database is pretty busy or runs 24×7, it can be quite a good idea to replicate it out anyway, and use the subscriber as the source for your data warehouse loads. Works really well if you are loading during working hours – perhaps you have to do an intraday load to get near real-time data into your warehouse for example.
Here’s how it works. Create a publication on the source database that includes the tables and columns you need in order to build your warehouse tables. Do an initial synchronisation. You now have a subscriber that’s being constantly kept up-to-date with the source. Here’s a simple example of a source table.
It has a primary key of OrderNumber. This equates to the primary key which you’d find in a real OLTP system.
On the subscribed copy of this table, add an extra column at the end of the table with a data type of ‘datetime’. You’ll have to allow nulls because the records currently in the table (send from the publisher) won’t have this column.
CREATE TRIGGER UpdateModifiedDate
set ModifiedDate = current_timestamp
where OrderNumber = (select OrderNumber from inserted)
Now when a record is added to the source system by any process, in the replicated subscriber, you will get something like this:
Updates are respected too:
So now you can use the Modified Date column as part of your extract management process in SSIS.
Good things about this solution:
- No changes on the client’s source database
- Minimal overhead on the production system
- Because the trigger is fired on a subscriber database, you aren’t adding a performance penalty to the OLTP system itself.
- Can easily be extended – you could have another column on the subscribed table which indicates if the change was an insert or an update. And you could track deletes the same way, by having a second table on the subscriber into which you record deleted order numbers using an ON DELETE trigger.
And a couple of potentially bad things about this solution:
- On the initial synch you won’t have this column populated – but you could soon fix that with a quick UPDATE statement.
- If the subscription is ever re-initialised you will lose your Modified Date column because the schema will be synched back to the publisher and you will have to re-add the column. But you’d know that all the records pushed in by the re-initialisation were already in your data warehouse so you could ignore records that had a null in the column.
- As with any transactional replication, generating the initial snapshot can be a rather painful process consuming a huge amount of server resources – on a big table you really need to do this in an outage window.
Let me know if this sounds useful but you need help implementing it.