Practical Business Intelligence

December 8, 2010

Determining if an Email Address is Valid in SSIS

Filed under: ETL, Scripting, SSIS — Duncan Sutcliffe @ 10:47 am

Had a requirement to establish whether a provided string in a source system matched a valid pattern for email addresses today.  Not to actually try to send a mail to the address, just to check that it was theoretically valid.

There is a useful bit of regular expression code available from Microsoft at http://msdn.microsoft.com/en-us/library/01escwtf.aspx so all I had to do was work it into a script component.  Here’s how.

Add a script component to the data flow and set the properties you’ll need.  The input column will be the email address field you want to check, then add an output column with the DataType Boolean.

image

In the script itself, you need to add the regular expressions library to the script’s using statements:

using System; using System.Text.RegularExpressions;

Immediately after this, add a new class to actually parse the email address and use the regex to determine its validity:

public class RegexUtilities { public static bool IsValidEmail(string strIn) { // Return true if strIn is in valid e-mail format. return Regex.IsMatch(strIn, @"^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))" + @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$"); } }

Finally, inside the ProcessInputRow add some code to execute the validation class on each row:

public override void Input0_ProcessInputRow(Input0Buffer Row) { string emailAddress; emailAddress = Row.EMAILADDRESS; bool valid = false; if (RegexUtilities.IsValidEmail(emailAddress)) valid = true; else valid = false; Row.ValidEmailAddress = valid; }

 

The field ValidEmailAddress will now be on the dataflow.

December 5, 2010

Extracting Source Records by Time Window

Filed under: Business Intelligence, ETL, SSIS — Duncan Sutcliffe @ 2:32 pm

More and more source systems seem to include 24 hour operation – for example retailers who might once have worked nine to five now accept orders via the Internet.  At the same time, data warehouses are getting closer to ‘real time’ information provision by running load procedures more than once per day.

So it’s more important than ever to make sure you set a defined time period for which your ETL is loading records, with a start and end time.  This period needs to be respected for every extract in a run, so for example you need to make sure you extract order headers and order lines which cover exactly the same time period.

Here’s the simplest but most reliable way I have been able to find to do this.

The first step in the SSIS defines a new load and sets the date/time range which the load covers.  Subsequent steps use these time periods when selecting source records.

Simplified SSIS package showing setting a time period range with an Execute SQL task(In a “real” solution I’d expect the first step to be done in your ‘master’ package and the results passed to subsequent child packages.)

The Execute SQL step does two things: creates a new load and sets the time window it is working with, then places the information in package variables.  The variables required are a load ID, a start date/time and a finish date/time:

image

In the meta-database which logs and controls the ETL runs, there is a simple table which mirrors these three values.

CREATE TABLE [dbo].[LoadWindowControl]( [LoadID] [int] identity NOT NULL, [LoadFrom] [datetime] NOT NULL, [LoadTo] [datetime] NOT NULL ) ON [PRIMARY]; GO

The SQL task calls a stored procedure:

CREATE PROCEDURE [dbo].[uspSetLoadWindow] @LoadID int output, @LoadFrom datetime output, @LoadTo datetime output AS BEGIN SET NOCOUNT ON; declare @StartTime datetime; select @StartTime = dateadd(s, 1, coalesce(max(LoadTo), 0)) from LoadWindowControl; declare @EndTime datetime; set @EndTime = convert(datetime, convert(varchar(30), CURRENT_TIMESTAMP, 120)); insert into LoadWindowControl (LoadFrom, LoadTo) values (@StartTime, @EndTime) select @LoadID = @@IDENTITY, @LoadFrom = @StartTime, @LoadTo = @EndTime END; GO

This is very simple – just adding a new record into the list of loads with an end date of “now” and a start date of the last end plus one second.  All three of these values are passed back as output parameters so the SSIS package can put them into variables.  Calling the proc from SSIS is easy:

image

And the parameter mapping inserts the right value to the right variable:

image

The first time this package runs, the table gets a record like this:

image

To check these values are being passed into the SSIS variables, here’s a quick look at the Locals window for anyone confused by these dates, remember I’m in the UK so the format is dd/mm/yyyy):

image

With these parameter values now populated them you can use them as part of the SQL command in each source.  For example:

image

image

For each package you can pass these variables through as configurations, or you could just pass through the load ID and retrieve the related dates using an Execute SQL task.

July 21, 2010

Portable SSIS Configs With a Single Script

Filed under: Configurations, Scripting, SSIS — Duncan Sutcliffe @ 9:09 am

Following from yesterday’s post, here’s how to achive SSIS configurations with a script instead of a SQL statement and a loop.

Database setup is exactly the same, and the two variables for environment and configuiration database connection string are set up the same, but the package needs just a script task.  The only property for the script which needs to be set is MaximumErrorCount which should be set to zero.

This script connects to the configuration database, selects all the variables that are valid for the current package and environment and applies them.  In the script code add another using for System.Data.SQLClient then add the following:

        public void Main()
        {
            // Create a variable dispenser and a variables collection
            VariableDispenser disp = this.Dts.VariableDispenser;
            Variables vars = default(Variables);

            String connString = null;
            String environmentName = null;
            String packageName = null;

            try
            {
                disp.LockForRead("User::strConfigDatabaseConnString");
                disp.LockForRead("User::strEnvironment");
                disp.LockForRead("System::PackageName");
                disp.GetVariables(ref vars);

                connString = vars["User::strConfigDatabaseConnString"].Value.ToString();
                environmentName = vars["User::strEnvironment"].Value.ToString();
                packageName = vars["System::PackageName"].Value.ToString();
                try
                {
                    // Connect to the configuration database
                    SqlConnection conn = new SqlConnection(connString);
                    conn.Open();

                    // Retrieve configuration values
                    String sql = String.Concat("select VarName, VarValue, VarType from ConfigVars where EnvironmentName = '",
                        @environmentName, "' and (PackageName = 'Sys' or PackageName = '", @packageName, "')");
                    SqlCommand command = new SqlCommand(sql, conn);

                    SqlDataReader reader = command.ExecuteReader();

                    // Loop through the reader
                    while (reader.Read())
                    {
                        String varName = reader["VarName"].ToString();
                        String varValue = reader["VarValue"].ToString();
                        String varType = reader["VarType"].ToString();

                        // Set the value for the variable
                        SetVariable(varName, varValue, varType);
                    }

                    // Tidy up
                    reader.Close();
                    reader.Dispose();
                    command.Dispose();
                    conn.Close();
                    conn.Dispose();

                }
                catch
                {
                    Dts.Events.FireWarning(0, "Set configuration variable values", "Unable to retrieve configuration values from database.", "", 0);
                }
            }
            catch
            {
                Dts.Events.FireWarning(0, "Set configuration variable values", "Unable to get configuration database connection string.", "", 0);
            }

            Dts.TaskResult = (int)ScriptResults.Success;
        }

        public void SetVariable(string varName, string varValue, string varType)
        {
            VariableDispenser disp = this.Dts.VariableDispenser;

            Variables vars = default(Variables);
            try
            {
                // Lock the variable we are going to provide a value for
                disp.LockOneForWrite(varName, ref vars);

                // Convert the value to the right type for the variable and assign it
                if (varType.ToLower() == "string")
                {
                    vars[0].Value = varValue;
                }
                else if (varType.ToLower() == "int32")
                {
                    vars[0].Value = Convert.ToInt32(varValue);
                }
                else if (varType.ToLower() == "datetime")
                {
                    // Add more strings to this array depending on the format of dates you have used in the config database
                    String[] sFormats = { "yyyy-mm-dd" };
                    DateTime dt = DateTime.ParseExact(varValue, sFormats, null, System.Globalization.DateTimeStyles.None);
                    vars[0].Value = dt;
                }
                else if (varType.ToLower() == "boolean")
                {
                    vars[0].Value = Convert.ToBoolean(varValue);
                }

                Boolean bReFire = true;
                Dts.Events.FireProgress(String.Concat("Set ", varName, " value to '", varValue, "'."), 100, 0, 0, "Set configuration variable values", ref bReFire);

            }
            catch (DtsRuntimeException ex)
            {
                Dts.Events.FireWarning(0, "Set configuration variable values", ex.Message, "", 0);
            }
        }

Very simple and very flexible.

July 20, 2010

A Portable Alternative to Configurations in SSIS

Filed under: Business Intelligence, Configurations, Scripting, SSIS — Duncan Sutcliffe @ 3:55 pm

How do you configure SSIS to work across various environments such as dev, test and production?  At the very least, you are going to have a different target server/database but there are possibly different sources, different directories used for RAW files, different logging systems, different operators to notify on error and any number of other changes to the system between environments,

Using Configurations is the normal way to set up your SSIS packages with the right values needed to make it behave correctly, but when you deploy your SSIS from development to test or live, you need to change those configurations.  There’s no easy mechanism for this.  Here’s a more flexible alternative which doesn’t use Integration Services’ built-in Configuration system, but lets you write your own mechanism.  It takes a bit of setting up but you only need to do it once – then you just reuse the template package.

The basic principle is that for every property which needs to be configured, you create a variable and use that variable as the property value in an expression.  For example, the connection manager for your source has an expression for ‘Connection String’.  So you create a variable named ‘@strSourceConnectionString’ and place that into the ConnectionString expression for the connection.  Then, the value for that variable is read from a database at runtime.  Crucially, you tune the SQL statement that retrieves variable values so that it retrieves the correct ones for your current environment.  This is what improves it over a standard Configuration.

The method:

Firstly, create a database table which will hold your variable values.  I would put this in the ETL management/audit database.  The columns you will need will be for the variable name, the variable value, the variable type, the environment and the package name (if you have variables with the same name but different values in each package).

Variable Values Table

These columns contain:

  • The name of the variable to be configured – must be the same as the variable in SSIS.
  • The value to be assigned to the variable.
  • The data type of the variable in SSIS (because you are going to need to convert the string value to the right data type before it can be assigned).
  • The environment for which this variable should be given this value.  You can put whatever you like in this column depending on how many environments you have and what they’re called.
  • The name of the package for which this variable should be given this value.  This would need to match the name of the SSIS package exactly because it’s going to be searched for using the System::PackageName variable.  You can also use a value such as ‘SYS’ or ‘ALL’ to indicate the same value for the variable is used in all packages.

Examples might be:

Variable Values

You can see that strStringVariable has three entries – for the package VariableTest it has different values for Dev and Prod environments, and for the package VariableTest2 it has a different value.  The other variables apply to ‘Sys’ packages which I’m going to use to mean all packages.

With this structure in place, we can move to SSIS.  Your package will need to know which environment you are currently in and where to find all the configuration variables; for this I think the best way is to create two variables for Environment and ConfigConnectionString, and save this as an XML configuration.  When you deploy the SSIS to a different environment, just copy the .dtsconfig file to the same location (I create a new folder on the root of C:) on the new server.  Then you only ever have to change one config value in one XML file for the environment to retrieve all the right configurations.

Creating XML configuration file

Now create an ADO.NET connection manager and set its ConnectionString expression to the variable for the config database.

Create three string variables named strVarName, strVarValue and strVarType.  Also create the variables which you are going to add values to, remembering they need to be named exactly the same as the value in the VarName column in your config database.

The next thing to do is retrieve all the variable values from the config database and assign them to the variables in the package.  This could be done with a single script (here’s how) but here I am demonstrating grabbing the values then assigning them in a loop – it requires a bit less understanding of C# scripting in SSIS.

Add an Execute SQL Task, a ForEach Loop and a Script Task inside the loop.

SSIS Package components

The SQL task uses the config database connection, and its SQL statement is something like:

select VarName, VarValue,VarType from Configs
where EnvironmentName = @envName and (PackageName in (@pacName, ‘Sys’))

Map the @envName parameter to your Environment variable and the @pacName to the System::PackageName variable.  Set the ResultSet property to FullResultSet and place the result set into an object variable named objVariables.

Properties for the ForEach loop are:

  • Enumerator: Foreach ADO Enumerator
  • ADO object source vasriable: User::objVariables
  • Variable Mappings: User::strVarName, User::strVarValue, User::strVarType

On the loop and the script, set the MaximumErrorCount property to zero.  This is because you will get errors if the variable you are trying to assign a value to doesn’t actually exist in the package – but you don’t want to crash the whole package because of this.

The basis of the script code you need is:

        public void Main()
        {

            // Create a variable dispenser and a variables collection
            VariableDispenser disp = this.Dts.VariableDispenser;
            Variables vars = default(Variables);
           
            string varName = null;
            string varValue = null;
            string varType = null;

            try
            {
                // Lock and assign variables
                disp.LockForRead(“User::strVarName”);
                disp.LockForRead(“User::strVarValue”);
                disp.LockForRead(“User::strVarType”);
                disp.GetVariables(ref vars);
                varName = String.Concat(“User::”, vars["User::strVarName"].Value.ToString());
                varValue = vars["User::strVarValue"].Value.ToString();
                varType = vars["User::strVarType"].Value.ToString();

                vars.Unlock();
            }
            catch
            {
                Dts.Events.FireWarning(0, “Set configuration variable values”, “Unable to lock variables.”, “”, 0);
            }

            // Call method which assigns values to SSIS variables
            SetVariable(varName, varValue, varType);

            // Finish
            Dts.TaskResult = (int)ScriptResults.Success;
        }

        public void SetVariable(string varName, string varValue, string varType)
        {
            VariableDispenser disp = this.Dts.VariableDispenser;

            Variables vars = default(Variables);
            try
            {
                // Lock the variable we are going to provide a value for
                disp.LockOneForWrite(varName, ref vars);

                // Convert the value to the right type for the variable and assign it
                if (varType.ToLower() == “string”)
                {
                    vars[0].Value = varValue;
                }
                else if (varType.ToLower() == “int32″)
                {
                    vars[0].Value = Convert.ToInt32(varValue);
                }
                else if (varType.ToLower() == “datetime”)
                {
                    // Add more strings to this array depending on the format of dates you have used in the config database
                    String[] sFormats = {“yyyy-mm-dd”};
                    DateTime dt = DateTime.ParseExact(varValue, sFormats, null, System.Globalization.DateTimeStyles.None);
                    vars[0].Value = dt;
                }

                Boolean bReFire = true;
                Dts.Events.FireProgress(String.Concat(“Set “, varName, ” value to ‘”, varValue, “‘.”), 100, 0, 0, “Set configuration variable values”, ref bReFire);

            }
            catch (DtsRuntimeException ex)
            {
                Dts.Events.FireWarning(0, “Set configuration variable values”, ex.Message, “”, 0);
            }

            // Unlock variables
            vars.Unlock();
        }
    }

Note that I have used a VariableDispenser where required, so you don’t need to add any variables to the ReadOnlyVariables or ReadWriteVariables.  This script only covers string, integer and datetime variables – if you’ve got other types you’ll need to add additional condition handlers to the if…elseif block to do the conversions required.

What happens when you run the package?  If you set the strEnvironmentName variable to ‘Dev’, you get this:

Variables assigned values for development

If you set the value to ‘Prod’ you get different values:
Variables assigned values for production

So, everything is completely portable between environments.

Let me know (Twitter: @DuncanSutcliffe) if this works for you or if it doesn’t.

June 30, 2010

Using Replication to Find Changed Source Data

Filed under: Business Intelligence, ETL, SSIS — Duncan Sutcliffe @ 9:35 pm

Ever been faced with a source database that doesn’t have a ‘last modified date’ which you can use to find new or changed data?  It can be a nightmare to find the records which you need to add to your data warehouse, particularly if you have a fact table which has records that can be changed after inserting – you have to do things like checksums to try to determine if you need to update an existing record.

Today it hit me that you can use replication to find out changes in a source.  The solution relies on replicating the source table to a new database, and marking the records in the subscriber with a modification date.  Transactional replication is pretty low cost on the publishing database, and if the source database is pretty busy or runs 24×7, it can be quite a good idea to replicate it out anyway, and use the subscriber as the source for your data warehouse loads.  Works really well if you are loading during working hours – perhaps you have to do an intraday load to get near real-time data into your warehouse for example.

Here’s how it works.  Create a publication on the source database that includes the tables and columns you need in order to build your warehouse tables.  Do an initial synchronisation.  You now have a subscriber that’s being constantly kept up-to-date with the source.  Here’s a simple example of a source table.

Published table

It has a primary key of OrderNumber.  This equates to the primary key which you’d find in a real OLTP system.

On the subscribed copy of this table, add an extra column at the end of the table with a data type of ‘datetime’.  You’ll have to allow nulls because the records currently in the table (send from the publisher) won’t have this column.

Replicated table with modification tracking column
With this column in place, you can add a trigger to the table for insert and update to add the timestamp.  The code you’d need would be something like:

CREATE TRIGGER UpdateModifiedDate
   ON  Sales
   FOR INSERT,UPDATE
AS
BEGIN

    update Sales
    set ModifiedDate = current_timestamp
    where OrderNumber = (select OrderNumber from inserted)

END

Now when a record is added to the source system by any process, in the replicated subscriber, you will get something like this:

Replicated records after an insert

Updates are respected too:

Replicated records after an update

So now you can use the Modified Date column as part of your extract management process in SSIS.

Good things about this solution:

  • No changes on the client’s source database
  • Minimal overhead on the production system
  • Because the trigger is fired on a subscriber database, you aren’t adding a performance penalty to the OLTP system itself.
  • Can easily be extended – you could have another column on the subscribed table which indicates if the change was an insert or an update.  And you could track deletes the same way, by having a second table on the subscriber into which you record deleted order numbers using an ON DELETE trigger.

And a couple of potentially bad things about this solution:

  • On the initial synch you won’t have this column populated – but you could soon fix that with a quick UPDATE statement.
  • If the subscription is ever re-initialised you will lose your Modified Date column because the schema will be synched back to the publisher and you will have to re-add the column.  But you’d know that all the records pushed in by the re-initialisation were already in your data warehouse so you could ignore records that had a null in the column.
  • As with any transactional replication, generating the initial snapshot can be a rather painful process consuming a huge amount of server resources – on a big table you really need to do this in an outage window.

Let me know if this sounds useful but you need help implementing it.

Older Posts »

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.