Friday 11 March 2011

How To Ease The Pain of SAS in Production - Part 1

One of the many toys we got when The Bank bought The Other Bank way back in the mists of time was the need to use SAS and Business Objects. Full-featured Business Objects might be neat, but what we got, really is a frustrating mess and is painfully slow. Then there's SAS. People make very good money using SAS. It's mainly used in financial services, retail analysis and pharmaceuticals. It's inventor is not quite as rich as Bill Gates, but close. They have made a decision to put the resource into incredibly fast implementations of complicated statistical methods and basic data handling, rather than into user interfaces and a slick scripting language and IDE. That's a choice.

SAS may be a great analytical tool, but it really sucks when used for production. People will tell you that it’s really good “once you get used to it”. Which is a different way of saying "it really sucks until you stop minding". Unfortunately, SAS gets used for production: these are some hints about dealing with it.

The main issue is this: if SAS is being used for production, you’re in a non-professional data environment. expect the random SAS tables you're going to be using to have duplicated records, no defined key field and every sort of data abuse imaginable and quite a few more that aren’t. People really think it’s cute to store ID numbers as strings with leading 0’s.

If in doubt, use a proc sort with the NODUPKEY option on what should be the key field. I call this a "brute force de-dupe" because that's what it does. You will need it to de-duplicate the results of your queries on these megabyte messes. (If you’re using a table without a key field, you’re on your own.) And always do a brute-force de-dup to get the final table you’re actually going to use.

If you’re importing data from other people’s workbooks, assume they will change the order of the fields, the field names and the number of fields every time they give you the report. To be fair, this applies in any environment, but you can’t do much about it in SAS.

If you’re going to modify a reference table with a SAS script, take a backup first.

Never write a script that ends by committing updates. You CANNOT rely on the script to a) run all the way through without errors, b) produce the output in a format you think it should, c) produce the data you thought you were going to get, d) not suddenly create duplicate records because someone did something you would never think of doing. Finish a production script with a stats query that looks for tell-tale null field values, brings back volume and value totals you can check and so on. Commit the changes to your master table with another script.

Put all your temporary working tables into the rwork library. That way you won’t create long-term clutter. Whatever you write will be more portable, as everyone has an rwork library. (Unless they don’t have a server – then use work.)

Part Two follows...

No comments:

Post a Comment