Practical Business Intelligence

June 30, 2010

Using Replication to Find Changed Source Data

Filed under: Business Intelligence, ETL, SSIS — Duncan Sutcliffe @ 9:35 pm

Ever been faced with a source database that doesn’t have a ‘last modified date’ which you can use to find new or changed data?  It can be a nightmare to find the records which you need to add to your data warehouse, particularly if you have a fact table which has records that can be changed after inserting – you have to do things like checksums to try to determine if you need to update an existing record.

Today it hit me that you can use replication to find out changes in a source.  The solution relies on replicating the source table to a new database, and marking the records in the subscriber with a modification date.  Transactional replication is pretty low cost on the publishing database, and if the source database is pretty busy or runs 24×7, it can be quite a good idea to replicate it out anyway, and use the subscriber as the source for your data warehouse loads.  Works really well if you are loading during working hours – perhaps you have to do an intraday load to get near real-time data into your warehouse for example.

Here’s how it works.  Create a publication on the source database that includes the tables and columns you need in order to build your warehouse tables.  Do an initial synchronisation.  You now have a subscriber that’s being constantly kept up-to-date with the source.  Here’s a simple example of a source table.

Published table

It has a primary key of OrderNumber.  This equates to the primary key which you’d find in a real OLTP system.

On the subscribed copy of this table, add an extra column at the end of the table with a data type of ‘datetime’.  You’ll have to allow nulls because the records currently in the table (send from the publisher) won’t have this column.

Replicated table with modification tracking column
With this column in place, you can add a trigger to the table for insert and update to add the timestamp.  The code you’d need would be something like:

CREATE TRIGGER UpdateModifiedDate
   ON  Sales
   FOR INSERT,UPDATE
AS
BEGIN

    update Sales
    set ModifiedDate = current_timestamp
    where OrderNumber = (select OrderNumber from inserted)

END

Now when a record is added to the source system by any process, in the replicated subscriber, you will get something like this:

Replicated records after an insert

Updates are respected too:

Replicated records after an update

So now you can use the Modified Date column as part of your extract management process in SSIS.

Good things about this solution:

  • No changes on the client’s source database
  • Minimal overhead on the production system
  • Because the trigger is fired on a subscriber database, you aren’t adding a performance penalty to the OLTP system itself.
  • Can easily be extended – you could have another column on the subscribed table which indicates if the change was an insert or an update.  And you could track deletes the same way, by having a second table on the subscriber into which you record deleted order numbers using an ON DELETE trigger.

And a couple of potentially bad things about this solution:

  • On the initial synch you won’t have this column populated – but you could soon fix that with a quick UPDATE statement.
  • If the subscription is ever re-initialised you will lose your Modified Date column because the schema will be synched back to the publisher and you will have to re-add the column.  But you’d know that all the records pushed in by the re-initialisation were already in your data warehouse so you could ignore records that had a null in the column.
  • As with any transactional replication, generating the initial snapshot can be a rather painful process consuming a huge amount of server resources – on a big table you really need to do this in an outage window.

Let me know if this sounds useful but you need help implementing it.

About these ads

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: