Until recently, the largest SSAS cube I had worked with was around 500gb – part of the BI solution at a big retailer. But recently I built a cube that almost hit the terabyte mark – 911 gigabytes when I first deployed it against real data volumes.
Can Analysis Services even deal with this type of cube size? What kind of hardware do you need in order to run it? How do you tune a cube in the terabyte range? Just how quick can your query response times be?
In this case, the cube contains just under 14 billion (13, 680, 000, 000 to be exact) fact table records. These can be drilled down to a time level of minute. The cube users need to be able to drill from year down to minute, and across dimensions of client, fact location and geography. And we need to be getting ‘instant’ response using PerformancePoint or Excel for free ad hoc analysis. This all adds up to needing a very highly tuned cube. It took about three days to reach a point I was beginning to be happy with the performance of the cube. The only way to approach it was to think of a very large cube as simply a regular cube but with more data in it and follow a standard tuning procedure, then add more if required.
So, the approach is:
- Sort the dimensions out. When you have this many fact records, the cube needs dimensions set up as efficiently as possible.
- Remove many-to-many structures if possible; optimise them if you really can’t live without them. (This was probably the biggest single optimisation.)
- Sort the partitions out. Partition on an attribute (or set) that will often be used in queries at the level you are partitioning. Use the slice property. Don’t use any attribute that’s used in a many-to-many relationship as a partition slice.
- Aggregate efficiently and appropriately.
I think this is probably the foundation of cube tuning. All the aggregations in the world won’t really solve your tuning problems if they are having to work with badly built dimensions.
Here are my rules:
- Don’t include any attributes that won’t be needed just because they are in the data warehouse dimension.
For example, things like addresses are almost never needed. By including these you are simply increasing the number of cells your cube has to contain – therefore the query space and the amount of work the engine has to do when loading up the cube from disk.
- If there are columns you really must include but will be rarely queried and don’t form part of an attribute hierarchy, turn off Attribute Hierarchy Enabled and make them a property of the key. They will still be available in Excel or whatever for any users that desperately need them.
- Keep every attribute’s key small. You may need an attribute key which has more than one column – if so use the minimum columns you can. For example, just because your date hierarchy goes Year –> Quarter –> Month, the key for the month attribute doesn’t need to do the same. Just Year and Month would do.
For text attributes with lots of different values, seriously consider creating a numeric key column for them in the source dimension.
- Design hierarchies properly and set up the attribute relationships correctly. Always put attributes into hierarchies if you possibly can – it might look to the end user that they get the right result by using attributes rather than hierarchies but it is nowhere near as efficient.
- If an attribute is used in a hierarchy, hide it as an attribute. Just to enforce the hierarchies you’ve designed on your users. Even in a date dimension don’t let the year/quarter/month/week/etc attributes remain visible. Put them into logical hierarchies and then hide the attribute. Create ‘Month of Year’ type attributes and make those visible instead – and make sure you have both a numeric key and a text name for these types of attribute.
- Make relationships ‘rigid’ if you can (dates will be a good candidate for example), but don’t compromise your ability to perform incremental refreshes just to do this.
Here’s an extreme example. When the cube was first put together, there was a dimension that looked like this:
Almost everything there is never going to be used in any type of query. In fact only one attribute is needed so the dimension should look like this:
Not only has the majority of attributes been deleted, but the remaining one has a rigid relationship.
Many to Many Dimensions
Effectively, a many-to-many is to cube performance what Kryptonite is to Superman. The additional amount of work required each time the many-to-many dimension is queried is considerable, and the amount of aggregations required to improve their performance will explode the size of your cube. If you really must have a many-to-many, keep the intermediate measure group as small as possible by removing repeating rows (i.e. every row in it should be distinct).
One pitfall – if you use a many-to-many dimension as part of your partitioning strategy, you cannot set the slice property (it always produces an error on processing) of the partition. Even worse, the engine won’t be able to determine which partitions contain the data it needs for query results (i.e. automatically slice) so every partition in the cube will get touched on every query. These are probably facets of the same problem.
There’s a white paper on tuning many-to-many dimensions at http://blogs.msdn.com/b/sqlcat/archive/2007/12/21/new-best-practices-articles-published-analysis-services-many-to-many-dimensions-query-performance-optimization-techniques.aspx. But my advice would be – avoid them at all costs on a cube this size because you are going to have so much data to query that their performance penalty is going to be be massive.
Every tip I’ve read on the Internet tells you that no partition should have more than twenty million records in it. But as the screenshot below shows, my cube has eight or nine times that number per partition.
Why not make the partitions smaller? Several reasons:
- The measure group is already partitioned by month (a very common query unit), it’s just that each month has a lot of records. Partitioning by the next smallest time unit (day) would mean we’d need approx 30 times as many partitions which would be:
- Difficult to manage
- A lot of individual files for the cube to need to read data from
- A problem for aggregation since the largest meaningful aggregate would be one day (remember aggregations are by partition)
- Partitioning by another attribute isn’t practical because unlike dates, most other attributes would probably have a variable number of members (imagine partitioning by client for example) so you would have to dynamically add and remove partitions.
- The cube partition scheme exactly maps to the data warehouse’s fact table partitioning scheme. So loading a partition has a nice performance boost – the RDMBS just has to supply the entire contents of one database partition for each partition we process in the cube. We actually have no indexes at all on the fact tables, which certainly speeds up ETL inserts.
- Ultimately, the cube performs absolutely perfectly even with these very large partitions.
With the partitions designed and the queries added, make sure you set the Slice property on every partition. Ignore the common myth that you do not need to set it on MOLAP partitions – it takes a second to set it, it provides a nice safeguard against errors in your partitioning statements, and it guarantees the engine will hit just the required partitions at query time.
On a cube this large, you really will need very good aggregations – aggregations that are specifically designed to support exactly the queries being executed. This is because there is so much granularity level data that any query which can’t use an aggregation is going to be incredibly slow. Experience with this terabyte cube showed the best approach was:
- Find one frequently used attribute and set its aggregation usage to Full. The best candidate here is Month because:
- That matches the partition strategy
- There are a relatively small number of members (only 60 if there are five years’ of data) so the aggregation designer should respect your setting
- Month is a good attribute for almost all queries. You usually query based on a date range (this month, all months in this year, etc.)
- Every other attribute in every dimension is just left as Default.
- Use BIDS Helper to update all the counts in the cube. In a cube this size, be prepared to leave it setting counts overnight!
- Although the aggregation designer still won’t produce particularly useful aggregations, I ran it anyway on all the measure groups. Deciding how many aggregations you want it to produce is a tightrope walk – too few and it’s pointless, too many and you can start seeing gigabytes of aggregations being designed. The rules I followed eventually were:
- Use the ‘Until I click stop’ setting
- Watch the size and number of aggregations being generated. Aim for under 150 aggregations, or less than three or four gigabytes.
Usage Based Optimisations
This is the real place you are going to win with aggregations. Using the designer, or designing your own (unless you really understand what you are doing) is just playing at it.
Once you’ve got your cube built, get query logging enabled straight away. This is easy:
- Create a database to hold the logs
- Enable logging in the server’s properties and set the name of the table you will hold the logs in.
You want to use the logs in a structured way, so make sure you are doing this in development – don’t wait until the cube is in production and you start getting complaints from users about speed. Users will be running all sorts of random queries which will be getting logged and affecting the optimisation recommendations. That’s fine but at this stage you want to remain in control of everything that’s affecting your cube’s design.
With an empty query log (hint: it’s a database table so you can always TRUNCATE it to clear it between tuning runs, but SSAS will empty it for you each time you deploy a change to the cube), navigate to your dashboard and open every page in it. Run each of your reports. I.e. do one of every thing that will be a regular activity. This will enable the optimisation wizard to put together a set of aggregations that support everything you have built.
At this point, I ran the usage based optimisation wizard against each measure group, accepted its suggestions and merged them into the existing aggregation designs.
Over time, of course, this cube would have further aggregations added based on real-world usage by users.
How quick is the cube after all this work? The following video should give an idea:
Promise: no smoke and mirrors, frames cut out, or anything like that.
And the hardware? This is on a DL680 (4 x quad) with 64gb or memory. Nothing special.