Practical Business Intelligence

October 15, 2010

Handling and Tuning Really Large Analysis Services Cubes

Filed under: Analysis Services, Business Intelligence, Tuning — Duncan Sutcliffe @ 4:04 pm

Until recently, the largest SSAS cube I had worked with was around 500gb – part of the BI solution at a big retailer.  But recently I built a cube that almost hit the terabyte mark – 911 gigabytes when I first deployed it against real data volumes.

Can Analysis Services even deal with this type of cube size?  What kind of hardware do you need in order to run it?  How do you tune a cube in the terabyte range? Just how quick can your query response times be?

In this case, the cube contains just under 14 billion (13, 680, 000, 000 to be exact) fact table records.  These can be drilled down to a time level of minute.  The cube users need to be able to drill from year down to minute, and across dimensions of client, fact location and geography.  And we need to be getting ‘instant’ response using PerformancePoint or Excel for free ad hoc analysis.  This all adds up to needing a very highly tuned cube.  It took about three days to reach a point I was beginning to be happy with the performance of the cube.  The only way to approach it was to think of a very large cube as simply a regular cube but with more data in it and follow a standard tuning procedure, then add more if required.

So, the approach is:

  1. Sort the dimensions out.  When you have this many fact records, the cube needs dimensions set up as efficiently as possible.
  2. Remove many-to-many structures if possible; optimise them if you really can’t live without them. (This was probably the biggest single optimisation.)
  3. Sort the partitions out. Partition on an attribute (or set) that will often be used in queries at the level you are partitioning. Use the slice property. Don’t use any attribute that’s used in a many-to-many relationship as a partition slice.
  4. Aggregate efficiently and appropriately.

Dimension Optimisation

I think this is probably the foundation of cube tuning.  All the aggregations in the world won’t really solve your tuning problems if they are having to work with badly built dimensions.

Here are my rules:

  • Don’t include any attributes that won’t be needed just because they are in the data warehouse dimension.
    For example, things like addresses are almost never needed.  By including these you are simply increasing the number of cells your cube has to contain – therefore the query space and the amount of work the engine has to do when loading up the cube from disk.
  • If there are columns you really must include but will be rarely queried and don’t form part of an attribute hierarchy, turn off Attribute Hierarchy Enabled and make them a property of the key.  They will still be available in Excel or whatever for any users that desperately need them.
  • Keep every attribute’s key small.  You may need an attribute key which has more than one column – if so use the minimum columns you can.  For example, just because your date hierarchy goes Year –> Quarter –> Month, the key for the month attribute doesn’t need to do the same.  Just Year and Month would do.
    For text attributes with lots of different values, seriously consider creating a numeric key column for them in the source dimension.
  • Design hierarchies properly and set up the attribute relationships correctly. Always put attributes into hierarchies if you possibly can – it might look to the end user that they get the right result by using attributes rather than hierarchies but it is nowhere near as efficient.
  • If an attribute is used in a hierarchy, hide it as an attribute.  Just to enforce the hierarchies you’ve designed on your users.  Even in a date dimension don’t let the year/quarter/month/week/etc attributes remain visible.  Put them into logical hierarchies and then hide the attribute.  Create ‘Month of Year’ type attributes and make those visible instead – and make sure you have both a numeric key and a text name for these types of attribute.
  • Make relationships ‘rigid’ if you can (dates will be a good candidate for example), but don’t compromise your ability to perform incremental refreshes just to do this.

Here’s an extreme example.  When the cube was first put together, there was a dimension that looked like this:

Poorly structured dimension

Almost everything there is never going to be used in any type of query.  In fact only one attribute is needed so the dimension should look like this:

Optimised dimension

Not only has the majority of attributes been deleted, but the remaining one has a rigid relationship.

Many to Many Dimensions

Effectively, a many-to-many is to cube performance what Kryptonite is to Superman. The additional amount of work required each time the many-to-many dimension is queried is considerable, and the amount of aggregations required to improve their performance will explode the size of your cube.  If you really must have a many-to-many, keep the intermediate measure group as small as possible by removing repeating rows (i.e. every row in it should be distinct).

One pitfall – if you use a many-to-many dimension as part of your partitioning strategy, you cannot set the slice property (it always produces an error on processing) of the partition.  Even worse, the engine won’t be able to determine which partitions contain the data it needs for query results (i.e. automatically slice) so every partition in the cube will get touched on every query.  These are probably facets of the same problem.

There’s a white paper on tuning many-to-many dimensions at http://blogs.msdn.com/b/sqlcat/archive/2007/12/21/new-best-practices-articles-published-analysis-services-many-to-many-dimensions-query-performance-optimization-techniques.aspx.  But my advice would be – avoid them at all costs on a cube this size because you are going to have so much data to query that their performance penalty is going to be be massive.

Partition Design

Every tip I’ve read on the Internet tells you that no partition should have more than twenty million records in it.  But as the screenshot below shows, my cube has eight or nine times that number per partition.

Partition Counts

Why not make the partitions smaller? Several reasons:

  • The measure group is already partitioned by month (a very common query unit), it’s just that each month has a lot of records. Partitioning by the next smallest time unit (day) would mean we’d need approx 30 times as many partitions which would be:
    • Difficult to manage
    • A lot of individual files for the cube to need to read data from
    • A problem for aggregation since the largest meaningful aggregate would be one day (remember aggregations are by partition)
  • Partitioning by another attribute isn’t practical because unlike dates, most other attributes would probably have a variable number of members (imagine partitioning by client for example) so you would have to dynamically add and remove partitions.
  • The cube partition scheme exactly maps to the data warehouse’s fact table partitioning scheme.  So loading a partition has a nice performance boost – the RDMBS just has to supply the entire contents of one database partition for each partition we process in the cube.  We actually have no indexes at all on the fact tables, which certainly speeds up ETL inserts.
  • Ultimately, the cube performs absolutely perfectly even with these very large partitions.

With the partitions designed and the queries added, make sure you set the Slice property on every partition.  Ignore the common myth that you do not need to set it on MOLAP partitions – it takes a second to set it, it provides a nice safeguard against errors in your partitioning statements, and it guarantees the engine will hit just the required partitions at query time.

Aggregations

On a cube this large, you really will need very good aggregations – aggregations that are specifically designed to support exactly the queries being executed.  This is because there is so much granularity level data that any query which can’t use an aggregation is going to be incredibly slow.  Experience with this terabyte cube showed the best approach was:

  • Find one frequently used attribute and set its aggregation usage to Full. The best candidate here is Month because:
    • That matches the partition strategy
    • There are a relatively small number of members (only 60 if there are five years’ of data) so the aggregation designer should respect your setting
    • Month is a good attribute for almost all queries.  You usually query based on a date range (this month, all months in this year, etc.)
  • Every other attribute in every dimension is just left as Default.
  • Use BIDS Helper to update all the counts in the cube. In a cube this size, be prepared to leave it setting counts overnight!
  • Although the aggregation designer still won’t produce particularly useful aggregations, I ran it anyway on all the measure groups.  Deciding how many aggregations you want it to produce is a tightrope walk – too few and it’s pointless, too many and you can start seeing gigabytes of aggregations being designed.  The rules I followed eventually were:
    • Use the ‘Until I click stop’ setting
    • Watch the size and number of aggregations being generated.  Aim for under 150 aggregations, or less than three or four gigabytes.

Usage Based Optimisations

This is the real place you are going to win with aggregations.  Using the designer, or designing your own (unless you really understand what you are doing) is just playing at it.

Once you’ve got your cube built, get query logging enabled straight away.  This is easy:

  • Create a database to hold the logs
  • Enable logging in the server’s properties and set the name of the table you will hold the logs in.

You want to use the logs in a structured way, so make sure you are doing this in development – don’t wait until the cube is in production and you start getting complaints from users about speed.  Users will be running all sorts of random queries which will be getting logged and affecting the optimisation recommendations.  That’s fine but at this stage you want to remain in control of everything that’s affecting your cube’s design.

With an empty query log (hint: it’s a database table so you can always TRUNCATE it to clear it between tuning runs, but SSAS will empty it for you each time you deploy a change to the cube), navigate to your dashboard and open every page in it. Run each of your reports. I.e. do one of every thing that will be a regular activity.  This will enable the optimisation wizard to put together a set of aggregations that support everything you have built.

At this point, I ran the usage based optimisation wizard against each measure group, accepted its suggestions and merged them into the existing aggregation designs.

Over time, of course, this cube would have further aggregations added based on real-world usage by users.

Results

How quick is the cube after all this work?  The following video should give an idea:

Promise: no smoke and mirrors, frames cut out, or anything like that.

And the hardware?  This is on a DL680 (4 x quad) with 64gb or memory.  Nothing special.

September 30, 2010

Kerberos in a Clustered and Load Balanced BI Deployment

As if it weren’t hard enough getting Kerberos delegation to work properly in your average business intelligence distributed architecture, there is another level of knowledge required when you’re building an environment that’s got lots of resilience built in.

This post covers the following scenarios:

  • Clustered SQL Server
  • Clustered Analysis Services
  • Scale out and load balanced Reporting Services
  • Reporting Services integrated with SharePoint

I’ve been building just such an architecture in a lab environment using SQL Server 2008R2 and SharePoint 2010 but the steps below should be applicable to previous versions as well.

The steps involved are:

  • Install SQL and SSAS clusters and set up DNS
  • Install SSRS for scale out deployment and set up network load balancer DNS
  • Setting SPNs
  • Trusting computers and service accounts for delegation
  • Configuring Reporting Services
  • Configuring SharePoint
  • Configuring Reporting Services/SharePoint integration

SQL and SSAS Clusters

The good news is you really don’t need to do anything that you wouldn’t normally do when setting up a cluster for either of these two services.  Create the cluster, install the services on the nodes, set up MSDTC, then create a DNS name that points at the cluster address.

Make sure each node in the cluster uses the same service account. For example, create a domain account called svc-SQL and use it as the service account for SQL on every node of the SQL cluster.  This is required, because the SPN to be set up later can only use one account for the cluster as a whole.  Do the same for Analysis Services.

SSRS Cluster

Here again there’s nothing out of the ordinary; the order I’d do it in is:

  • Provision the servers, configure the NLB to load balance across them, and create a DNS entry for the RS service on the NLB
  • Install Reporting Services but don’t configure it
  • Install the SharePoint binaries on each server and  join them to the farm (without making them a Web Front End)
  • Run the RS configuration tool on the first server, create a new database in SharePoint mode, then complete the rest of the configuration (report server web service, etc.)
  • Run the RS configuration on each other server, pointing them to the same Reports database.
  • After configuring all the RS servers, add all the servers to the scale-out deployment using the appropriate tab on the configuration tool.  You only need to do this on one of the servers.

Again, you need to make sure the webservice is running under the same account on all servers.

Service Principle Names

First real challenge.  There are some key points to bear in mind:

  • You create SPNs for the cluster name, not for the individual nodes in the cluster. Don’t create SPNs for the individual nodes as this seems to break Kerberos.
  • You must specify the port number, even if you are using the default ports for SQL Server and Analysis Services.
  • You must create SPNs using the fully qualified domain name of the servers, but creating them for the NetBIOS name seems to be optional – your mileage may vary on this.

The SPNs you will need are for:

  • SQL cluster
  • SSAS cluster
  • RS cluster
  • SharePoint Central Admin
  • SharePoint web sites
  • SharePoint applications such as PerformancePoint if you are using them

The commands will follow this template:

setspn –S MSSQLSvc/[SQL cluster].[domain]:1433 [service account]
setspn –S MSOLAPSvc.3/[SSAS cluster].[domain]:2383 [service account]
setspn –S HTTP/[RS NLB].[domain]:80 [service account]
setspn –S HTTP/[central admin server or NLB].[domain]:[port] [service account]
setspn –S HTTP/[web front end or NLB].[domain]:[port] [service account]

Let’s build some examples of that.  In this scenario, the SQL cluster is called SQL, the SSAS cluster is called AS, the SSRS load balancer is called RSNLB, the SharePoint Central Admin box is SP01, and the SharePoint web front end load balancer is called WFENLB. The domain is mycompany.local and all services are on their default port, other than Central Admin which is on port 8080.  You’d need to run the following commands.

setspn –S MSSQLSvc/SQL.mycompany.local:1433 svc-SQL
setspn –S MSOLAPSvc.3/AS.mycompany.local:2383 svc-AS
setspn –S HTTP/RSNLB.mycompany.local:80 svc-RS
setspn –S HTTP/SP01.mycompany.local:8090 svc-SPFarm
setspn –S HTTP/WFENLB.mycompany.com:80 svc-SPPortal

Something to bear in mind: if you have created an alternative URL for your SharePoint site (such as http://intranet), you need to create the SPN for that name instead of the WFE load balancer name.

Trusting for Delegation

On the DC, ensure that each of the following are trusted for Kerberos delegation:

  • SharePoint Central Admin server
  • SharePoint web servers
  • Reporting Services servers
  • SharePoint Central Admin web application pool account
  • SharePoint web site application pool account
  • Reporting Services web app service account

You can’t trust accounts until an SPN has been set up – the ‘Delegation’ tab doesn’t even appear in their account’s properties – so you have to complete this step after the SPNs have been created.

You might find you need to trust the SQL and SSAS boxes and service accounts.  You also might find you need to trust the virtual network names for their clusters.

Configure Reporting Services

On each of the SSRS servers, locate the file rsreportserver.config and edit it:

  • Find the element <AuthenticationTypes> and ensure it contains the element <RSWindowsNegotiate/>.  Remove the element <RSWindowsNTLM/> if it exists.
  • Find the element <UrlRoot> and insert the web address of the SharePoint site (in the example above the correct value would be http://WFENLB or http://intranet).
  • Find the element <Hostname> and add the name of the SSRS cluster (in the example above the correct value would be RSNLB).  The <Hostname> element may not exist, so you will need to create it – it goes inside the <Service> element, the same as the <UrlRoot> element.  You would create a line such as <Hostname>RSNLB</Hostname>.

You must repeat for every SSRS server otherwise only some of the servers will actually work and your users will get ‘intermittent’ errors.

SharePoint

A trip to Central Admin should show that the RS servers have already been added to the farm.  Make sure the Central Admin and all other site collections are set to use Windows Authentication in all their zones which you intend to use SSRS in.

SharePoint and Reporting Services Integration

SharePoint 2010 automatically installs the SSRS add in as part of the pre-requisites.  For MOSS 2007, download the latest version of the add in and ensure your entire SP farm is service-packed.

Integration is then ready to be configured from Central Admin. In SP 2010 the config panel is at General Application Settings –> Reporting Services Integration.  The values you need are:

  • Report Server Web Service URL: the address of the SSRS web service as accessed through the network load balancer, for example http://RSNLB/reportserver
  • Authentication Mode: Windows Authentication
  • Credentials: The username and password of a domain account which is in the local Administrators group on the SSRS servers.  If there isn’t a single account that’s in this group on every one of your SSRS boxes, use an account which is in the Administrators group for the first SSSR box in the deployment.  You need to specify the username in the format [domain]\[user]

After the first server is integrated, you are given the chance to add each of the other SSRS servers in the scale-out deployment to the integration.  You must complete this for each server you have in the deployment.

Finally, you can click ‘Set server defaults’ to change any of the the properties of the integration you want to alter.

Points to Bear in Mind

Not working?  Here’s what I check in my setup:

  • Kerberos needs TCP and UDP ports 88 open in order to work.  Is your firewall blocking that port?  Try disabling the Windows firewall to check.
  • Are the clients using Internet Explorer 6 or above?  Is the browser seeing that the site is in the local intranet zone? Does the zone allow integrated Windows authentication?
  • Go back over the SPNs.  One mistake in your setspn command will have broken Kerberos completely.
  • Check all the accounts and computers are trusted.
  • Try navigating direct to the SSRS web service (i.e. not by opening a report in SharePoint’s interface).  If you can run a report through http://rsnlb/reportserver but get an error message when doing it through SharePoint, the probable cause is with a SharePoint SPN or delegation permission not with Reporting Services.

June 11, 2010

A Better Date Dimension – Modelling in Analysis Services

Filed under: Analysis Services, Date Dimension, Reporting Services — Duncan Sutcliffe @ 6:33 pm

Hopefully you are continuing from part one of this post.

Adding date dimension tables to DSVIf you have run the script to create and populate your date, calendar and bridge tables you can go ahead and add the tables to your Analysis Services project’s data source view.  The tables all have foreign keys so just add them and the DSV will sort out the relationships.

Date tables in DSVNext, create two dimensions – one based on the date dimension and one based on the calendar dimension.  Set the Type of the calendar dimension to ‘Time’ and the ErrorConfiguration to ‘Default’.  On the calendar dimension create these attributes with the associated properties:

  • Calendar Key
    Name: ‘Date’
    Type: Calendar –> Date
    NameColumn: ActualDate
  • Calendar Name
    DefaultMember: [Calendar].[Calendar Name].&[1]

    IsAggregatable: False

    KeyColumns: CalendarOrder
  • Date Name
    Type: Calendar –> Date
  • Year
    Type: Calendar –> Years

    NameColumn: YearName
  • Quarter
    Type: Calendar –> Quarters
    KeyColumns –> Quarter and Year

    NameColumn: QuarterName
  • Month
  • Type: Calendar –> Months
    KeyColumns –> Month, Quarter and Year

    NameColumn: MonthName

You might well want to use some of the other columns in the calendar table such as month of year.  My date dimensions are usually very rich so fill your boots and add whatever you think will help your end users.

Next, create a hierarchy of Calendar Name –> Year –> Quarter –> Month –> Date and set up the attribute relationships.

 Calendar HierarchyCalendar Attribute Relationships

 

At this point you should be able to process the dimension.  You can then browse it to get something like this:

Multiple Calendars in Hierarchy

As you can see, the definition of what constitutes ‘2008’ differs based on the calendar you expand.

Bridging the Many-to-Many

Create a new measure group in your cube using the bridge table as its source.  A count measure will be added automatically.  This is required but useless so set its Visible property to False.

On the Dimension Usage tab, make sure the new measure group is connected to the Date and Calendar dimensions (the relationships in the DSV will do this properly when you add the dimensions to this tab).

Any measure groups that need to join to the calendar dimension must have a regular join to the date dimension and a many-to-many join to the calendar dimension through the bridging table measure group.

Joining many-to-many relationship

You can now process and test the cube.

Here is an example with a very simple sales measure group, showing two calendars being browsed:

Multiple Calendar Browsing

The TOTAL and all the total for each months for each calendar is the same – 75.  This is because the calendar name attribute is set to non-aggregatable.  But the months belong in different quarters.  For example, April is in Q2 in the Gregorian calendar, but it is the first month of Q1 in a fiscal calendar which starts April 1.

You can also use the calendar name as a parameter in a Reporting Services report based on the cube – just make sure the parameter is set to allow single values only.

What if you also create reports from the relational database?  You now have to join to two extra tables every time you want to include a date, don’t you?  Well, no, you don’t, and I will explain why in my next post.

A Better Date Dimension

Filed under: Analysis Services, Date Dimension — Duncan Sutcliffe @ 5:04 pm

This requirement has come up a couple of times but I have only just thought of a solution.  How can your date dimension support lots of different calendars (standard January – December Gregorian, fiscal calendars starting in various months, a manufacturing calendar, etc.) without the table getting immensely wide?  And how can you make the calendar in use user-selectable at report run time?

For example, suppose you want to create a report that can be sent to many different customers containing their activity with you, but each customer wants the data based on their own fiscal calendar?  You need to have multiple calendars and choose the one you want to categorise your facts by at run time rather than design time. You want to be able to give the users a parameter like this:

Selecting a Fiscal Calendar in a Report

The standard way of having different calendars is to add more columns to the date dimension but this quickly becomes impractical – and it doesn’t solve the problem of letting a user choose which calendar to base a report on.

Well, how about making the relationship between facts and dates a many-to-many?  So each fact exists in lots of different calendars?  It’s a model that is supported in Analysis Services and will fit in well with SQL.  Just a question of implementing it.

How to go about this?  Three tables are needed.

  • A Date dimension which the fact tables can join to.  This can be extremely simple – you only really need a date column and a surrogate key column. (Well, you could actually get away without the date column but let’s keep it to stay simple when assigning keys.)
    This will have a many-to-one relationship with the fact tables.  It behaves like a normal date dimension, there’s one record for each day.
  • A Calendars dimension.  This will have a record for each date – for each calendar you want to build.
    The calendars dimension contains all the attributes for each date that would normally go into the date dimension.
  • A bridging table between the two.

What we will end up with is more rows in our calendar dimension, not more columns.

The best way to understand is, of course, to see an example.  So here is a script which creates all the tables, sets relationships between them and populates all three.  The script creates a detailed and fully populated date dimension model.

Next post: setting up the Analysis Services model.

June 4, 2010

Memory Settings on SQL Server and Analysis Services

Filed under: Analysis Services, Configuration, Configuration, SQL Server — Duncan Sutcliffe @ 3:54 pm

I’ve recently been working in a development environment in which database services and Analysis Services are sharing the same server.  We’re in the fortunate position of having a hugely powerful machine to run the services on (16 cores and 128 gigbytes of RAM), but it got me thinking about the different ways the services manage memory, and how you can configure each of them to leave your server responsive under all circumstances.

First, a quick summary of the available options.

SQL Server

The two settings are Minimum Server Memory and Maximum Server Memory.

  • Minimum Server Memory is the least amount of memory SQL Server will use.  In effect, that memory will be reserved for its use.  The default value is zero.
  • Maximum Server Memory is the highest amount of memory SQL Server will use.  The default value is an unfeasibly high number.

Analysis Services

Memory is handled very differently in Analysis Services (predictably!).  Instead of setting usage values in absolute size, there are “percentage of total memory” parameters which control how Analysis Services works with memory.

  • LowMemoryLimit.  Analysis Services will consume as much memory as it needs until it has reached this point.  When it uses memory above this limit, it will begin to clean up its memory.  The default varies between versions but is either 65% or 75%.
  • TotalMemoryLimit.  When Analysis Services’s memory usage reaches this point, it starts to clean up memory much more aggressively.  The default value is 80%.

Points to Note

  • If SQL Server’s minimum server memory is set too low, other applications can snatch so much memory that SQL Server will become unsresponsive.
  • If SQL Server’s maximum server memory is set too high, it can take more memory than the operating system and other applications can afford to give it – making other applications or the whole server unresponsive.

So – set these values after careful consideration.  Make sure you leave enough memory available for the server to accept inbound remote desktop connections in case of emergencies.  Make sure you leave enough memory available for any other services that run on the machine.  On the other hand, make sure you set the minimum memory high enough that SQL Server will continue to perform adequately under all the expected load you’re going to put on it.

On a dedicated SQL server, it’s common for minimum and maximum to be set to the same value – generally almost all the server’s total physical memory leaving just enough for the OS to stay upright when SQL’s under really heavy load.  But if the box is shared, you’ll probably want to have different values here – a minimum which is the least you can work with and a maximum which the server can grab if it really needs to.

  • If Analysis Services’s LowMemoryLimit is set too low, it will start clearing down memory before it needs to.  This will be a performance penalty.
  • If Analysis Services’s TotalMemoryLimit is set too low, it will start very aggressively clearing memory before it needs to .  This is a very serious performance penalty.
  • If Analysis Services’s TotalMemoryLimit is set too high, it will fail to clear down memory leading to unresponsiveness in the server or other applications.  Analysis Services doesn’t take any notice of the memory required by other applications on the machine, it just happily keeps on grabbing more memory within these limits.
  • If the values are set too close together, SSAS doesn’t spend enough time gently disposing memory before it has to start aggressively clearing it.

An Example

Too illustrate, imagine a server with 16 gigbytes of memory, running SQL and Analysis Services, wanting to allocate memory relatively equally between them.  To be absolutely certain not all memory got used you could set:

  • SQL Server Minimum to 6 gigabytes and maximum to 8 gigabytes.  So SQL can take half the total memory if it needs to and it’s available, but it can release a couple of gigbytes if required.
  • Analysis Services total memory to 60% (9.6 gigabytes) and low memory to 45%.

This ensures the maximum amount of total memory that could be used is 15.6 gigabytes, even when SQL and SSAS are doing their hardest work.  Analysis Services has a 15% (2.4 gigabyte) usage band whereby it will try to clear memory gently before it’s forced to get more aggressive.  Of course, 0.4 gigabytes isn’t a lot to leave for the OS to keep running so I’d be keeping a very careful eye on what was going on if I were stressing the server.

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.