Practical Business Intelligence

December 8, 2010

Determining if an Email Address is Valid in SSIS

Filed under: ETL, Scripting, SSIS — Duncan Sutcliffe @ 10:47 am

Had a requirement to establish whether a provided string in a source system matched a valid pattern for email addresses today.  Not to actually try to send a mail to the address, just to check that it was theoretically valid.

There is a useful bit of regular expression code available from Microsoft at http://msdn.microsoft.com/en-us/library/01escwtf.aspx so all I had to do was work it into a script component.  Here’s how.

Add a script component to the data flow and set the properties you’ll need.  The input column will be the email address field you want to check, then add an output column with the DataType Boolean.

image

In the script itself, you need to add the regular expressions library to the script’s using statements:

using System; using System.Text.RegularExpressions;

Immediately after this, add a new class to actually parse the email address and use the regex to determine its validity:

public class RegexUtilities { public static bool IsValidEmail(string strIn) { // Return true if strIn is in valid e-mail format. return Regex.IsMatch(strIn, @"^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))" + @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$"); } }

Finally, inside the ProcessInputRow add some code to execute the validation class on each row:

public override void Input0_ProcessInputRow(Input0Buffer Row) { string emailAddress; emailAddress = Row.EMAILADDRESS; bool valid = false; if (RegexUtilities.IsValidEmail(emailAddress)) valid = true; else valid = false; Row.ValidEmailAddress = valid; }

 

The field ValidEmailAddress will now be on the dataflow.

December 5, 2010

Extracting Source Records by Time Window

Filed under: Business Intelligence, ETL, SSIS — Duncan Sutcliffe @ 2:32 pm

More and more source systems seem to include 24 hour operation – for example retailers who might once have worked nine to five now accept orders via the Internet.  At the same time, data warehouses are getting closer to ‘real time’ information provision by running load procedures more than once per day.

So it’s more important than ever to make sure you set a defined time period for which your ETL is loading records, with a start and end time.  This period needs to be respected for every extract in a run, so for example you need to make sure you extract order headers and order lines which cover exactly the same time period.

Here’s the simplest but most reliable way I have been able to find to do this.

The first step in the SSIS defines a new load and sets the date/time range which the load covers.  Subsequent steps use these time periods when selecting source records.

Simplified SSIS package showing setting a time period range with an Execute SQL task(In a “real” solution I’d expect the first step to be done in your ‘master’ package and the results passed to subsequent child packages.)

The Execute SQL step does two things: creates a new load and sets the time window it is working with, then places the information in package variables.  The variables required are a load ID, a start date/time and a finish date/time:

image

In the meta-database which logs and controls the ETL runs, there is a simple table which mirrors these three values.

CREATE TABLE [dbo].[LoadWindowControl]( [LoadID] [int] identity NOT NULL, [LoadFrom] [datetime] NOT NULL, [LoadTo] [datetime] NOT NULL ) ON [PRIMARY]; GO

The SQL task calls a stored procedure:

CREATE PROCEDURE [dbo].[uspSetLoadWindow] @LoadID int output, @LoadFrom datetime output, @LoadTo datetime output AS BEGIN SET NOCOUNT ON; declare @StartTime datetime; select @StartTime = dateadd(s, 1, coalesce(max(LoadTo), 0)) from LoadWindowControl; declare @EndTime datetime; set @EndTime = convert(datetime, convert(varchar(30), CURRENT_TIMESTAMP, 120)); insert into LoadWindowControl (LoadFrom, LoadTo) values (@StartTime, @EndTime) select @LoadID = @@IDENTITY, @LoadFrom = @StartTime, @LoadTo = @EndTime END; GO

This is very simple – just adding a new record into the list of loads with an end date of “now” and a start date of the last end plus one second.  All three of these values are passed back as output parameters so the SSIS package can put them into variables.  Calling the proc from SSIS is easy:

image

And the parameter mapping inserts the right value to the right variable:

image

The first time this package runs, the table gets a record like this:

image

To check these values are being passed into the SSIS variables, here’s a quick look at the Locals window for anyone confused by these dates, remember I’m in the UK so the format is dd/mm/yyyy):

image

With these parameter values now populated them you can use them as part of the SQL command in each source.  For example:

image

image

For each package you can pass these variables through as configurations, or you could just pass through the load ID and retrieve the related dates using an Execute SQL task.

June 30, 2010

Using Replication to Find Changed Source Data

Filed under: Business Intelligence, ETL, SSIS — Duncan Sutcliffe @ 9:35 pm

Ever been faced with a source database that doesn’t have a ‘last modified date’ which you can use to find new or changed data?  It can be a nightmare to find the records which you need to add to your data warehouse, particularly if you have a fact table which has records that can be changed after inserting – you have to do things like checksums to try to determine if you need to update an existing record.

Today it hit me that you can use replication to find out changes in a source.  The solution relies on replicating the source table to a new database, and marking the records in the subscriber with a modification date.  Transactional replication is pretty low cost on the publishing database, and if the source database is pretty busy or runs 24×7, it can be quite a good idea to replicate it out anyway, and use the subscriber as the source for your data warehouse loads.  Works really well if you are loading during working hours – perhaps you have to do an intraday load to get near real-time data into your warehouse for example.

Here’s how it works.  Create a publication on the source database that includes the tables and columns you need in order to build your warehouse tables.  Do an initial synchronisation.  You now have a subscriber that’s being constantly kept up-to-date with the source.  Here’s a simple example of a source table.

Published table

It has a primary key of OrderNumber.  This equates to the primary key which you’d find in a real OLTP system.

On the subscribed copy of this table, add an extra column at the end of the table with a data type of ‘datetime’.  You’ll have to allow nulls because the records currently in the table (send from the publisher) won’t have this column.

Replicated table with modification tracking column
With this column in place, you can add a trigger to the table for insert and update to add the timestamp.  The code you’d need would be something like:

CREATE TRIGGER UpdateModifiedDate
   ON  Sales
   FOR INSERT,UPDATE
AS
BEGIN

    update Sales
    set ModifiedDate = current_timestamp
    where OrderNumber = (select OrderNumber from inserted)

END

Now when a record is added to the source system by any process, in the replicated subscriber, you will get something like this:

Replicated records after an insert

Updates are respected too:

Replicated records after an update

So now you can use the Modified Date column as part of your extract management process in SSIS.

Good things about this solution:

  • No changes on the client’s source database
  • Minimal overhead on the production system
  • Because the trigger is fired on a subscriber database, you aren’t adding a performance penalty to the OLTP system itself.
  • Can easily be extended – you could have another column on the subscribed table which indicates if the change was an insert or an update.  And you could track deletes the same way, by having a second table on the subscriber into which you record deleted order numbers using an ON DELETE trigger.

And a couple of potentially bad things about this solution:

  • On the initial synch you won’t have this column populated – but you could soon fix that with a quick UPDATE statement.
  • If the subscription is ever re-initialised you will lose your Modified Date column because the schema will be synched back to the publisher and you will have to re-add the column.  But you’d know that all the records pushed in by the re-initialisation were already in your data warehouse so you could ignore records that had a null in the column.
  • As with any transactional replication, generating the initial snapshot can be a rather painful process consuming a huge amount of server resources – on a big table you really need to do this in an outage window.

Let me know if this sounds useful but you need help implementing it.

Theme: Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.