Here is a script I wrote that analyzes datasets and returns all the minimal composite and unary keys that uniquely identify records. I wrote it because I frequently have to analyze client spreadsheets and non-normalized data tables.
On my desktop server it took about two minutes to analyze 2000 permutations of a table with 50 columns and 5000 records.
Please try it out for me and let me know if it chokes on anything, or if you see any ways it could be improved!Dude...still looking at it...
Only 2 minutes you say....
hmmmmmmmmmmmmmm
EDIT: I ran it against Nothwind And I got Nothing|||4 minutes 28 seconds on my desktop
Candidate Fields Permutations Checked
------ -------
61 1891
It only returned my identity column.|||Dooh..
It's for one table at a time...I thought you were doing an entire database
OK, I did Products in Northwind and it didn't pick ProductName as a Natural Key...
Candidate Fields Permutations Checked
10 66
Natural Keys Found
[UnitPrice], [UnitsInStock]
[SupplierID], [UnitPrice], [ReorderLevel]
[SupplierID], [UnitsInStock], [ReorderLevel]
[UnitPrice], [UnitsOnOrder], [ReorderLevel]|||Yowch!
Obviously a bug. I will look into it.
Thanks, Brett!|||It may be because there is no unique index on PoductName...but the data is all unique in the sample...
Go figure M$... I wonder if it was done on purpose to show the "Benefits" of Surrogate keys...
"An Apple is an Apple until it's renamed"|||No, its a bug in my script.
The point of the script is to find natural keys, whether or not they have been defined that way on the table. Otherwise, I'd just query sysindexes.
There is a flaw in the recursive logic which I need to track down. If I comment out a clause meant to eliminate redundant branches, the script returns the correct results, but if I don't eliminate redundant branches then I am reduced to testing every permutation, which is impractical.
The solution will probably occur to me in the car on the way home tonight.
Have a good weekend, all.|||I tried to challenge your logic by going after statistics info rather than doing "select count(distinct [fieldname])..." But got drawn into trying to come up with an algorythm:
Once you get all non-text/image columns into a temp table (I also eliminated sql_variant by doing nullif(prec, 0) is not null), instead of doing a cursor I was thinking to create permutations by doing "order by newid()" within the loop with "top @.number_of_qualifying_columns"... And of course, as usual, got distracted, never finished, etc.
Have you thought of that?|||I've tried two different methods of searching permutations. The first was "bottom up", starting with single columns and then adding from there, but required an exhaustive search.
I'm hoping that by using the "top down" approach that I posted I can identify and eliminate searching branches of permutations that are known not to contain natural keys, or that already contain a subset known to be a natural key.
The algorithm used to create the permutations is not as difficult or imporatant as the algorithm used to eliminate permutations.
I have an idea in the back of my head (which did occur to me in the car on the way home!), and I'm trying to come up with a way to implement it.
I think it is an interesting challenge which would also prove useful to solve, so I'm surprised I haven't seen it done before.
Anybody else here is welcome to take a shot at it! Just write a script that efficiently identifies all the unary and minimal composite keys in a table.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment