Practical Business Intelligence

May 23, 2011

Configuring Constrained Delegation in SharePoint 2010

Filed under: Business Intelligence, Kerberos, SharePoint — Duncan Sutcliffe @ 1:58 pm

I’ve been doing a *lot* of environment build and configuration work recently, specifically focusing on authentication and the right way to configure Kerberos so that all the business intelligence components of a SharePoint deployment can pass user credentials between themselves and down to data sources.  One of the key things for SharePoint 2010 is that you will need to configure constrained delegation, whereas in earlier versions you could get away with unconstrained delegation.

There will be a proper series of posts about Kerberos coming along (it’s a big subject with a lot of words to write – I teach a two day course if you are really interested), but in the interim here’s a pictorial guide to the delegation you will need to set up.

Constrained Delegation Pathways Map

The key points are:

  • Web site account delegates to itself, to any other web site accounts and to SSRS account over HTTP
  • SSRS account delegates to all report data sources’ accounts over MSOLAPSvc.3 and MSSQLSvc
  • Claims to Windows Token service account delegates to all PerformancePoint, Excel, Visio and InfoPath services data sources’ accounts over MSOLAPSvc.3 and MSSQLSvc
  • PerformancePoint, Excel, Visio and InfoPath services all delegate to all their required data sources’ accounts over MSOLAPSvc.3 and MSSQLSvc

January 13, 2011

Hiding Subreports from the List of Available Report Documents in SharePoint

Filed under: Business Intelligence, Reporting Services, SharePoint, SharePoint Integration — Duncan Sutcliffe @ 9:11 am

Does the SharePoint library which contains your SSRS reports look a bit like this?

SharePoint list with reports and subreports

The user is asked to look at a list of RDL files and guess which one is the top level report, and which are subreports which are only intended to be consumed inside a ‘container’ report or are used as the target of hyperlink actions inside another report.  But it’s easy to hide reports which are not intended for users to run individually by making use of SharePoint’s View features.  This example has screenshots from SharePoint 2010 but you can also do the same thing in MOSS 2007.

The goal is not to physically move the subreports, but to make them hidden from the list a user is presented with by categorising them in a managed way.  So, from the Library ribbon, click “Create Column”.

SharePoint list create column button

In the next dialog, fill in the following values:

  • Column Name: "Report Type"
  • Type of information: "Choice"
  • Require that this column contains information: "Yes"
  • Enforce unique values: "No"
  • Choices: "Report", "Subreport"
  • Display choices using: "Drop-Down Menu" or "Radio Buttons"
  • Allow ‘Fill-in’ choices: "No"
  • Default value: "Choice", "Report"
  • Add to default view: Checked

You should end up with a page something like this:

Add column to categorise as report or subreport

After you create the column, you will be returned to the library. As long as you checked the “Add to default view” box when you created the column, you should now have an additional column headed “Report Type”, with no values for any of the reports.  So, edit the properties of each report and set the value for the new “Report Type” property to either “Report” or “Subreport”.

Add report type property to the document in the SharePoint list

Once you have completed categorising your reports and subreports, create a new view by clicking the “Create View” button in the library ribbon.

SharePoint list create view button

Select “Standard view”, then create the view with the following settings:

  • View name: "Reports only"
  • Make this the default view: Checked
  • View audience: "Create a public view"

You can add in any additional columns or sorts as you wish, then in the “Filter” section, enter the values you need:

Add filter for reports only

After creating the view, you will be taken back to the report library but now you are only seeing the top level reports.

Library with reports only shown

December 7, 2010

Star Schema Modelling Errors–Dimensions with the Same Cardinality as the Fact Table

Filed under: Business Intelligence, Data Modelling — Duncan Sutcliffe @ 12:00 pm

What’s wrong with this data mart model?

image

The sales fact table has an associated dimension which describes the sale record.  There is a surrogate key relationship between the fact and dimension tables but there’s also the business key (SaleNumber) from the source system.

Why is this bad?

  • There will be a one-to-one relationship between the fact table and the dimension, which means the dimension is potentially going to be huge.  Joining between fact and this dimension will be slow at query time.
  • The dimension is very poorly built for Analysis Services.  If Status only has three possible values but those three are being repeated on every line instead of normalised into their own dimension, SSAS build times will be bigger than they should be.
  • If you have facts that get updated in the source, you are going to have to manage this in two places in the data warehouse.  A type 2 SCD with millions of rows in it like this one is going to be painfully slow to populate on your ETL runs.
  • Some of the attributes should have their own dimensions – status, payment method, order method, carrier – or maybe a junk dimension.  This will compress the dimension to the unique values of each of these fields rather than making it as big as the fact table itself.
  • The CarriageWeight field is meaningless as an attribute because no-one would ever want to analyse orders by their carriage weight.  So it should be included on the fact table, though in an SSAS cube its AggregationType would be set to None.
  • Delivery address fields could go in several places:
    • Into a delivery address dimension
    • Onto the fact table if they aren’t used for analysis purposes
    • Into the customer dimension if each customer only has one delivery address

So – dimensions which describe facts in a one-to-one way should be avoided.

December 5, 2010

Extracting Source Records by Time Window

Filed under: Business Intelligence, ETL, SSIS — Duncan Sutcliffe @ 2:32 pm

More and more source systems seem to include 24 hour operation – for example retailers who might once have worked nine to five now accept orders via the Internet.  At the same time, data warehouses are getting closer to ‘real time’ information provision by running load procedures more than once per day.

So it’s more important than ever to make sure you set a defined time period for which your ETL is loading records, with a start and end time.  This period needs to be respected for every extract in a run, so for example you need to make sure you extract order headers and order lines which cover exactly the same time period.

Here’s the simplest but most reliable way I have been able to find to do this.

The first step in the SSIS defines a new load and sets the date/time range which the load covers.  Subsequent steps use these time periods when selecting source records.

Simplified SSIS package showing setting a time period range with an Execute SQL task(In a “real” solution I’d expect the first step to be done in your ‘master’ package and the results passed to subsequent child packages.)

The Execute SQL step does two things: creates a new load and sets the time window it is working with, then places the information in package variables.  The variables required are a load ID, a start date/time and a finish date/time:

image

In the meta-database which logs and controls the ETL runs, there is a simple table which mirrors these three values.

CREATE TABLE [dbo].[LoadWindowControl]( [LoadID] [int] identity NOT NULL, [LoadFrom] [datetime] NOT NULL, [LoadTo] [datetime] NOT NULL ) ON [PRIMARY]; GO

The SQL task calls a stored procedure:

CREATE PROCEDURE [dbo].[uspSetLoadWindow] @LoadID int output, @LoadFrom datetime output, @LoadTo datetime output AS BEGIN SET NOCOUNT ON; declare @StartTime datetime; select @StartTime = dateadd(s, 1, coalesce(max(LoadTo), 0)) from LoadWindowControl; declare @EndTime datetime; set @EndTime = convert(datetime, convert(varchar(30), CURRENT_TIMESTAMP, 120)); insert into LoadWindowControl (LoadFrom, LoadTo) values (@StartTime, @EndTime) select @LoadID = @@IDENTITY, @LoadFrom = @StartTime, @LoadTo = @EndTime END; GO

This is very simple – just adding a new record into the list of loads with an end date of “now” and a start date of the last end plus one second.  All three of these values are passed back as output parameters so the SSIS package can put them into variables.  Calling the proc from SSIS is easy:

image

And the parameter mapping inserts the right value to the right variable:

image

The first time this package runs, the table gets a record like this:

image

To check these values are being passed into the SSIS variables, here’s a quick look at the Locals window for anyone confused by these dates, remember I’m in the UK so the format is dd/mm/yyyy):

image

With these parameter values now populated them you can use them as part of the SQL command in each source.  For example:

image

image

For each package you can pass these variables through as configurations, or you could just pass through the load ID and retrieve the related dates using an Execute SQL task.

October 15, 2010

Handling and Tuning Really Large Analysis Services Cubes

Filed under: Analysis Services, Business Intelligence, Tuning — Duncan Sutcliffe @ 4:04 pm

Until recently, the largest SSAS cube I had worked with was around 500gb – part of the BI solution at a big retailer.  But recently I built a cube that almost hit the terabyte mark – 911 gigabytes when I first deployed it against real data volumes.

Can Analysis Services even deal with this type of cube size?  What kind of hardware do you need in order to run it?  How do you tune a cube in the terabyte range? Just how quick can your query response times be?

In this case, the cube contains just under 14 billion (13, 680, 000, 000 to be exact) fact table records.  These can be drilled down to a time level of minute.  The cube users need to be able to drill from year down to minute, and across dimensions of client, fact location and geography.  And we need to be getting ‘instant’ response using PerformancePoint or Excel for free ad hoc analysis.  This all adds up to needing a very highly tuned cube.  It took about three days to reach a point I was beginning to be happy with the performance of the cube.  The only way to approach it was to think of a very large cube as simply a regular cube but with more data in it and follow a standard tuning procedure, then add more if required.

So, the approach is:

  1. Sort the dimensions out.  When you have this many fact records, the cube needs dimensions set up as efficiently as possible.
  2. Remove many-to-many structures if possible; optimise them if you really can’t live without them. (This was probably the biggest single optimisation.)
  3. Sort the partitions out. Partition on an attribute (or set) that will often be used in queries at the level you are partitioning. Use the slice property. Don’t use any attribute that’s used in a many-to-many relationship as a partition slice.
  4. Aggregate efficiently and appropriately.

Dimension Optimisation

I think this is probably the foundation of cube tuning.  All the aggregations in the world won’t really solve your tuning problems if they are having to work with badly built dimensions.

Here are my rules:

  • Don’t include any attributes that won’t be needed just because they are in the data warehouse dimension.
    For example, things like addresses are almost never needed.  By including these you are simply increasing the number of cells your cube has to contain – therefore the query space and the amount of work the engine has to do when loading up the cube from disk.
  • If there are columns you really must include but will be rarely queried and don’t form part of an attribute hierarchy, turn off Attribute Hierarchy Enabled and make them a property of the key.  They will still be available in Excel or whatever for any users that desperately need them.
  • Keep every attribute’s key small.  You may need an attribute key which has more than one column – if so use the minimum columns you can.  For example, just because your date hierarchy goes Year –> Quarter –> Month, the key for the month attribute doesn’t need to do the same.  Just Year and Month would do.
    For text attributes with lots of different values, seriously consider creating a numeric key column for them in the source dimension.
  • Design hierarchies properly and set up the attribute relationships correctly. Always put attributes into hierarchies if you possibly can – it might look to the end user that they get the right result by using attributes rather than hierarchies but it is nowhere near as efficient.
  • If an attribute is used in a hierarchy, hide it as an attribute.  Just to enforce the hierarchies you’ve designed on your users.  Even in a date dimension don’t let the year/quarter/month/week/etc attributes remain visible.  Put them into logical hierarchies and then hide the attribute.  Create ‘Month of Year’ type attributes and make those visible instead – and make sure you have both a numeric key and a text name for these types of attribute.
  • Make relationships ‘rigid’ if you can (dates will be a good candidate for example), but don’t compromise your ability to perform incremental refreshes just to do this.

Here’s an extreme example.  When the cube was first put together, there was a dimension that looked like this:

Poorly structured dimension

Almost everything there is never going to be used in any type of query.  In fact only one attribute is needed so the dimension should look like this:

Optimised dimension

Not only has the majority of attributes been deleted, but the remaining one has a rigid relationship.

Many to Many Dimensions

Effectively, a many-to-many is to cube performance what Kryptonite is to Superman. The additional amount of work required each time the many-to-many dimension is queried is considerable, and the amount of aggregations required to improve their performance will explode the size of your cube.  If you really must have a many-to-many, keep the intermediate measure group as small as possible by removing repeating rows (i.e. every row in it should be distinct).

One pitfall – if you use a many-to-many dimension as part of your partitioning strategy, you cannot set the slice property (it always produces an error on processing) of the partition.  Even worse, the engine won’t be able to determine which partitions contain the data it needs for query results (i.e. automatically slice) so every partition in the cube will get touched on every query.  These are probably facets of the same problem.

There’s a white paper on tuning many-to-many dimensions at http://blogs.msdn.com/b/sqlcat/archive/2007/12/21/new-best-practices-articles-published-analysis-services-many-to-many-dimensions-query-performance-optimization-techniques.aspx.  But my advice would be – avoid them at all costs on a cube this size because you are going to have so much data to query that their performance penalty is going to be be massive.

Partition Design

Every tip I’ve read on the Internet tells you that no partition should have more than twenty million records in it.  But as the screenshot below shows, my cube has eight or nine times that number per partition.

Partition Counts

Why not make the partitions smaller? Several reasons:

  • The measure group is already partitioned by month (a very common query unit), it’s just that each month has a lot of records. Partitioning by the next smallest time unit (day) would mean we’d need approx 30 times as many partitions which would be:
    • Difficult to manage
    • A lot of individual files for the cube to need to read data from
    • A problem for aggregation since the largest meaningful aggregate would be one day (remember aggregations are by partition)
  • Partitioning by another attribute isn’t practical because unlike dates, most other attributes would probably have a variable number of members (imagine partitioning by client for example) so you would have to dynamically add and remove partitions.
  • The cube partition scheme exactly maps to the data warehouse’s fact table partitioning scheme.  So loading a partition has a nice performance boost – the RDMBS just has to supply the entire contents of one database partition for each partition we process in the cube.  We actually have no indexes at all on the fact tables, which certainly speeds up ETL inserts.
  • Ultimately, the cube performs absolutely perfectly even with these very large partitions.

With the partitions designed and the queries added, make sure you set the Slice property on every partition.  Ignore the common myth that you do not need to set it on MOLAP partitions – it takes a second to set it, it provides a nice safeguard against errors in your partitioning statements, and it guarantees the engine will hit just the required partitions at query time.

Aggregations

On a cube this large, you really will need very good aggregations – aggregations that are specifically designed to support exactly the queries being executed.  This is because there is so much granularity level data that any query which can’t use an aggregation is going to be incredibly slow.  Experience with this terabyte cube showed the best approach was:

  • Find one frequently used attribute and set its aggregation usage to Full. The best candidate here is Month because:
    • That matches the partition strategy
    • There are a relatively small number of members (only 60 if there are five years’ of data) so the aggregation designer should respect your setting
    • Month is a good attribute for almost all queries.  You usually query based on a date range (this month, all months in this year, etc.)
  • Every other attribute in every dimension is just left as Default.
  • Use BIDS Helper to update all the counts in the cube. In a cube this size, be prepared to leave it setting counts overnight!
  • Although the aggregation designer still won’t produce particularly useful aggregations, I ran it anyway on all the measure groups.  Deciding how many aggregations you want it to produce is a tightrope walk – too few and it’s pointless, too many and you can start seeing gigabytes of aggregations being designed.  The rules I followed eventually were:
    • Use the ‘Until I click stop’ setting
    • Watch the size and number of aggregations being generated.  Aim for under 150 aggregations, or less than three or four gigabytes.

Usage Based Optimisations

This is the real place you are going to win with aggregations.  Using the designer, or designing your own (unless you really understand what you are doing) is just playing at it.

Once you’ve got your cube built, get query logging enabled straight away.  This is easy:

  • Create a database to hold the logs
  • Enable logging in the server’s properties and set the name of the table you will hold the logs in.

You want to use the logs in a structured way, so make sure you are doing this in development – don’t wait until the cube is in production and you start getting complaints from users about speed.  Users will be running all sorts of random queries which will be getting logged and affecting the optimisation recommendations.  That’s fine but at this stage you want to remain in control of everything that’s affecting your cube’s design.

With an empty query log (hint: it’s a database table so you can always TRUNCATE it to clear it between tuning runs, but SSAS will empty it for you each time you deploy a change to the cube), navigate to your dashboard and open every page in it. Run each of your reports. I.e. do one of every thing that will be a regular activity.  This will enable the optimisation wizard to put together a set of aggregations that support everything you have built.

At this point, I ran the usage based optimisation wizard against each measure group, accepted its suggestions and merged them into the existing aggregation designs.

Over time, of course, this cube would have further aggregations added based on real-world usage by users.

Results

How quick is the cube after all this work?  The following video should give an idea:

Promise: no smoke and mirrors, frames cut out, or anything like that.

And the hardware?  This is on a DL680 (4 x quad) with 64gb or memory.  Nothing special.

Older Posts »

Theme: Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.