Friday, March 23, 2012

Pointers to the best documentation on star joins and the optimiser

Hi All,

we are just starting to do some testing on sql server EE with dimensional models.....we have had one or two problems we have been able to solve using the new peformance dashboards etc.

However, as is inevitable, we are seeing strange behaviour of a query....in a star join it seems to be doing an eager spool and trying to spool the entire fact table to tempdb....hhmmm....

Rather than ask one question at a time.....we have DBAs who went to classes etc at MSFT and the client is some level of MSFT partner.

Could anyone point me to the best documentation for understanding the optimiser and how to influence it to get it to do the right thing in optimising plans for star joins?

Thanks

Peter

The star query optimization in SQL Server is index based optimization. The following resource will tell you basic of the query optimization for the star queries, and also how to figure out whether the final query plan benefits from the star join optimization.

Chapter 14 and looking for 'star queries'.

http://www.microsoft.com/technet/prodtechnol/sql/70/reskit/default.mspx?mfr=true

http://blogs.msdn.com/sqlqueryprocessing/archive/2007/04/09/how-to-check-whether-the-final-query-plan-is-optimized-for-star-join.aspx

In order to understand the spool issue further, if possible, please post the statistics XML you got from the query plan with eager spool. Usually, that would be caused by an underestimation of the cardinalities and the optimizer believes store the data in tempdb and reuse it later is a cheaper solution.

|||

Peter,

Also check out this discussion of index design for star schemas:

http://blogs.msdn.com/bi_systems/pages/164502.aspx.

Also, general advice about statistics and indexed views applies:

http://www.microsoft.com/technet/prodtechnol/sql/2005/impprfiv.mspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx

If you use partitioning, it's usually best to put an explicit date filter predicate on the partitioning column of the fact table, rather than rely only on filtering the fact table on date via a join with the date dimension. That way the optimizer can eliminate partitions at compile time.

Finally, if you are getting poor plans due to bad cardinality estimates, consider splitting your query up -- filter the dimensions into temp tables, then join the temp tables to the fact table in a final query to produce the result.

Another useful trick that sometimes helps is to replace the joins on the fact table with IN-lists, e.g.

select ..., sum(measure)

from Fact F

where F.dim_key1 in (1, 7, 11, ... 22)

and F.dim_key2 in (10990, 9887, ...)

group by ....

This sometimes lets the optimizer get better estimates for the selectivity of the filters induced by the joins with dimension tables.

Query hints, join hints, and index hints can also be used to tune your plan.

Eric

|||

Hi All,

we also found a good article here.. http://blogs.msdn.com/sqlcat/archive/2006/02/17/Partition-Elimination-in-SQL-Server-2005.aspx

We have implemented some tests for partitions and found that we can get the optimiser to go for the correct partitions. We have still not influenced the optimiser to do the star joins as cartesian products. And that may be because from the stats point of view the optimiser believes that the scan is going to cost less.

I did have a question about partitions....

I was thinking...or hoping...that it would be possible to specify two columns in the partitioning scheme.....for reasons I won't go into...I want to be able to house many fact tables in a single table and partition by fact table number....then I also want to be able to partition by the day dimension key to slice data up into periods based on the calendar months........of course, the usual approach is to have a separate table for each fact table and then you only need one partitioning key and then the table can be clustered on the day dimension id....

I am considering clustering on the fact table number and the day dimension id and rather hoping that SQL server can figure this out...

But from the reading of the manual it looks like one number is all I get to do partitioning....and I am reluctant to make the partitioning key a calculated column because I figure that has got to make for a slower query.....though it is only a hunch....for example I could make the partitioning number the fact table number * 1,000,000 and then add the day dim id....because I am not going to be here in 1M days to worry about the key clashes...LOL!

Does anyone else have any insight on how this might be done in SQL Server?

Much appreciated.

Peter Nolan

www.peternolan.com

|||

Hi All,

the link

http://blogs.msdn.com/bi_systems/pages/164502.aspx.

now seems not to lead anywhere and I recall some good tips there...I googled it to see if I could find the original content but no luck...

Does anyone know why this link now says 'the forum you requested does not exist'? And know where this post went to?

Thanks in Advance

Peter

|||

Hi All,

We have been doing some pretty extensive testing on star joins and I thought I would share the results we can publicly here......those who want to contact me directly are welcome to at peter@.peternolan.com. Or post questions to this thread....

We have made star joins work reliably. What we did was remove clustering create partitioned tables and put indexes containing many of the join columns in them. Once we were up over the 10M mark in the fact tables the start join kicked in pretty reliably. While we were down below 10M...and my early tests were on 5M row tables......the star joins would not kick in....and we never got them to kick in on unpartitioned fact tables.......

So, to repeat, the only way we got star joins to work on 2005 was unpartitioned fact tables over 10M rows.

Although, it really does not seem to matter.....

The other experiment we tried was to follow the DB2 optimiser documentation and set up the SQL Server tables exactly as recommended for DB2......mostly on the basis that if MSFT had copied the DB2 optimiser behavior then SQL Server would work the same way....

And so it was.....

So we found that the best way to set up a fact table is to create a partitioned clustered fact table with aligned indexes. And place an index on each of the fact table keys that will be used to join dimension tables on a regular basis. We implemented multiple levels of partitioning on a multi-part 'partitioning column' that is an integer.......

We have implemented a template fact table and now we will go forward with all fact tables implemented in exactly the same way.......

Even when the optimiser picked the cartesian product for unclustered fact tables the clustered fact table based query ran more quickly......

Though, of course, there is a very severe overhead for inserting and updating clustered fact tables in SQL Server......but it looks like we have to 'pay the piper' on update to get the query performance we are after....

Best Regards

Peter Nolan

www.peternolan.com

(Alas, becoming far too much of an sql server dba)

No comments:

Post a Comment