Hey everyone,
Let me introduce you to an algorithm I created while working on one of my modeling projects (to get the bigger picture, read more on credibility theory). Oftentimes at work, I’ll be asked to split a numeric dataset into (somewhat) homogeneous categories and then look for patterns and trends amongst the categories. For example (for confidentiality purposes, all the data/variables are made up), I might need to categorize an inventory of cars into three, roughly equal categories by gas mileage (efficient, normal, inefficient). At first, the task looks easy – I’d simply divide the dataset into three groups and appropriately name the categories, say, 0-15 mpg, 16-25 mpg, and 26+ mpg. However, problems might emerge if I have to divide the dataset equally (for a chi-square test statistic, maybe). If a large number of vehicles were rated at 26 mpg, changing the boundary of a category by 1 mpg could disrupt the equalness of the categories, and as a result, I might need to change the number of or boundary values of the categories. However, after I’ve painstakingly chosen the appropriate categories, a boss or a client, for whatever reason, might ask me to analyze the dataset by 8 categories, 50 categories, or maybe even 100 categories (or do all three). Or perhaps, they would request a similar analysis on an unrelated dataset or on an expanded dataset containing hundreds of thousands of records. When something like that happens, I realize that would be too cumbersome and time-consuming to continue the process manually, and decide that such a procedure would be a prime candidate for automation.
However, automation has its drawbacks. Creating an automated procedure might take a long time – a process that might take me many times longer than a single iteration of the manual procedure which I intend to automate – so I’d have to weigh the cost of sacrificing time during the first few iterations against the benefit of faster iterations in the future. For example, the procedure might not be worth automating if I only intend to do it once, or in the case of a tight deadline, it might be faster to do it manually than to spend time automating it. In this particular case, I chose to automate the task of dividing a numeric dataset into n requested categories, and make them as equal as reasonably possible – not mathematically with the least possible amount of deviation between the categories, but in such a way as to get a “quick and dirty” answer for decision-making.
For this next example, I’ve chosen 300,000 S&P 500 index values (though the actual variable I was working with was much more interesting), and for the sake of simplicity, all the values are randomly generated and aren’t actual values of the index. Suppose that I’ve received S&P 500 index values from a data provider, and that I want to sort them into 30 categories as reasonably evenly as possible. It’s possible that the categories won’t be even – for instance, it might be the case that an index value of 100 for the upper bound (non-inclusive) of the first category yields 3.5% of the index values, but a value of 99 would only yield 3.1% of the index values, instead of the ideal 3.3%.
Above, you’ll see the results of the algorithm. You can see on the upper left hand side that this particular dataset came with invalid records (null values or negative numbers), but perhaps I noticed that an unusually large number of values were coded as “-1” and might mean something more than a simple error (I’d have to call the data provider what it means), so I made a note of it above. The rest of the values have been sorted accordingly. The procedure went (roughly) as follows:
- Remove invalid data (negative and null values)
- Divide the number of records by the number of desired categories, truncate if it doesn’t divide evenly (This will be the approximate/raw interval size)
- Create the first raw interval boundary (non-inclusive upper bound) by finding the value of the record whose relative position (within valid data) equals the approximate/raw interval calculated in step 2
- Increment and decrement this raw interval boundary by 1, and find the relative positions of these values. If multiple records have the same incremented value, select the one with the lowest relative position. If multiple records have the same decremented value, select the record with the highest relative position.
- Find the difference between the relative positions of the decremented/incremented value and the relative position of the raw value. Select the value with the smaller of the two differences. This will be the upper, non-inclusive boundary for the category.
- Repeat the procedure from step 3 until all the upper boundaries have been chosen.
I was able to accomplish this using mostly Excel formulas. Without going into detail, step 3 looks like this:
=IF(G2<=CATEGORYCOUNT,IF(G2=CATEGORYCOUNT,CEILING,INDEX(Data!B:B,$B$6
+ROUND(INTERVALCOUNT*G2,0)+1)),0)
And the formula that counts the number of records in a category looks like this:
=IF(G6<=CATEGORYCOUNT,IF(G6=1,IF(L6-J6<J6-K6,L6,K6)-$B$6-1,IF(L6-J6<J6-K6,L6,K6)-$B$6-1-N5),0)
As you can see, the number of records in each category (column M) isn’t exactly even, but it’s pretty close. I was doing most of my data manipulation in MS Access, and the main purpose of the procedure was to generate query expressions with VBA, to be used in Access queries to categorize the data:
The above query expressions were written automatically using a VBA subroutine that I wrote. You can see that I used nested IIf statements to write the formulas, and the reason why I have 3 expressions is that there’s a limit to how much nesting you can do using IIf statements, so I had to break it down into 3 different queries. You can see that creating such expressions would be quite cumbersome if I had written them by hand, which is why I automated the task.
Some of you may have recognized (maybe immediately) that there ought to be a better way to do this. I think first off, it’s better to use a numeric value for sorting categories since a prefix like “9991_”, used to order the category names in lexicographically ascending order, is more confusing than a simple 1,2,3,4… sorted in numerically ascending order. In that case, I should have split the category name into 2 fields – one numeric, and another string, and leave out the numeric part when creating exhibits for printing.
Also, I need to find a workaround for the IIf statements since working with 3 separate queries increases the probability of error in the analysis, and I think it would be better if I could organize the data using a single expression or query. I ought to be able to do this by using different software (maybe move to SQL server instead of Access?), or by rewriting the expression in a way that doesn’t require IIf statements.
Also, I believe there’s somebody out there who’s already done this, and maybe another person who’s written an algorithm in such a way so as to minimize the sum of the squares of the deviations between the selected and ideal boundaries (or something like that). I just don’t know where I could find the procedure, in the case that it has been done before.
Anyway, this procedure gave me a quick answer to meet my deadline, which is mostly what I needed. In the meantime, I’ll work on improvements to simplify the process and reduce error.