jueves, 21 de junio de 2012

Natural sorting

The other day, somebody asked in StackOverflow how to do "natural" sort of a list of strings containing both words and numbers.
For instance, given the following list of items:
  foo2 foo1 foo23 foo11 bar11 fo4
When ordered by the natural order of its elements it becomes:
  bar11 fo4 foo1 foo2 foo11 foo23
Roughly, the strings are tokenized in words and numbers and then words are compared as words and numbers as integers.
When a word is compared with an integer (i.e. when comparing "foo12" and "45bar") a predefined order is used. The usual custom is to place numbers before words, but it could be the other way if it fits your problem better.
Another point that should be considered is how to handle non alphanumeric characters. For instance, how do "foo12", "foo-12", and "foo.12" compare?
All the responses to the StackOverflow question concentrated on providing a custom comparison function, also, all the pages found when searching for "natural sort" on Google seem to focus on the same approach.
But, as the author of the Sort::Key family of Perl modules, I had a different background and so, my initial approximation to the problem was quite different. Let me explain briefly how sorting in Perl works...

 

Sorting in Perl (a digression)

In Perl, sorting is performed by a highly optimized mergesort function written in C than handles numeric and alphanumeric comparisons natively and that also supports using a Perl callback to compare the values. The problem is that using a custom Perl callback slows down the function to death because calling and running Perl code is several orders of magnitude slower than running a simple and optimized C function as strcmp.
A simple workaround for this performance issue is to generate for every element to be sorted a key that when sorted alphanumerically results in the same order of the custom comparison function applied to the former strings.
In order to generate the key we still have to call Perl code, but the number of times it is called is in the order of O(N) times whereas using a custom comparison function on the mergesort requires calling Perl code in the order of O(NlogN) times.
For an oversimplified example, given the integers (1, 0, 201, 72), we can generate the respective sorting keys ("001", "000", "201", "072"), that when sorted alphanumerically become ("000", "001", "072", "201"). Then we can reverse the mapping and obtain the list of integers sorted as (0, 1, 72, 201).
And that's roughly what my Perl modules do, they generate keys that can be sorted fast using a finite number of predefined C functions.
There are a lot of folklore around this Perl technique and its different variations, the more popular being the Schwartzian transform whose learning is like an initiating rite into the waters of intermediate Perl.

 

Generating sorting keys

Back to the StackOverflow question, after reading it, my thinking was that, right, you can write a fast natural comparison function in C (say, natcmp) and pass it to your favorite mergesort or quicksort implementation but in any case, this natcmp function is not going to be as fast as a raw strcmp. If we can do some preprocessing and then do less work inside the quicksort (let's assume we use quicksort) we could probably get a faster natural sort.
So, let's start by designing a way to generate a natural sorting key that can be compared lexicographically.
But first, as I explained above there are several possible natural orderings, we have to settle on an specific one.
For now, let's assume we want pairs of numbers compared as numbers and everything else compared in lexicographic order. For instance, we want "foo23" tokenized as ("foo", 23), "foo-23" as ("foo-", 23) and "foo..23" as ("foo..", 23). In the end, this just reduces to breaking the strings in substrings containing just digits and substrings containing everything else.
The simplest way to generate a sorting key for that order is probably to pad every numeric substring with zeros at its left up to the length of the longest numeric substring. For instance, given ("foo1", "foo23", "bar23456"), the longest numeric substring is "23456" being its length 5, so we can pad every other numeric substring to this length resulting in ("foo00001", "foo00023", "bar23456"). It is easy to see that when these keys are compared lexicographically, it results in the desired order. The drawback of using this key generation is that the memory usage can be quite high.
A better way to generate the keys is to prefix every numeric substring with its length. For instance ("1", "23", "123") become ("11", "223", "3123"). Again, it is easy to see that these also sorts in the desired order... most of the time. It breaks in two cases: first, when some numeric strings begins by one or more zeros. I.e. ("0001", "23") becomes ("40001", "223") that is not sorted correctly as numerically "0001" is less than "23". But this problem is easy to workaround, just check for that condition and drop any zeros at the left of the substring. That even works for the "0" string that is converted to "0".
The other case where this key generation fails is when it founds numeric substrings longer than 9 digits. For instance, given ("1234567890", "123"), the keys generated are ("101234567890", "3123") resulting on the wrong order.
Overcoming this problem requires a little trick: when the length of the numeric substring is longer than 8 digits, we prefix the substring by the digit "9" repeated int(length/9) times followed by the digit representing length%9. For instance, if we have a string with 34 digits, then as int(34/9) is 3 and 34%9 is 7, we prefixed it by "9997".
On the worst case, the length of the keys generated by this method is 3/2 of that of the former string.

 

Implementation technicalities

The more difficult decision I have to take while implementing the key generation stuff was how to allocate memory for the keys. I didn't want to waste memory allocating significantly more memory than the strictly required. I didn't want to call malloc once for every single key either as that would degrade performance.
I considered preallocating memory for blocks of keys estimating approximately the memory required or some maximum and several other variations, but in all the cases I was able to come with some edge case where it would perform badly.
In the end I decided to go for a two pass approach. In the first pass, the key lengths are computed and then the total required memory is allocated. In the second pass, the keys are calculated and stored in the already allocated memory.
The two pass approach may seem more expensive as keys have to be generated twice, but estimating the keys length would require some processing anyway and on the second key, as we now that enough memory is available we don't need to bother checking for overflows.
Anyway, a possible improvement would be to perform the two passes generation in blocks. That would require more calls to malloc (one for block) but on the other hand, if the size of the blocks is carefully adjusted so that the data fits in the L2 cache, the key calculation would probably run quite faster.
Another interesting thing is how to do the reverse mapping from the ordered array of keys to the former strings. My solution has been to store a pointer to the corresponding former string just before every every key. That way the reverse mapping runs is a O(N) task.

 

A natcmp function

Ok, now, we have our sorting function and we think that it should be faster than the common approach, but how much faster? we need to run some benchmarks... but wait, where is the highly optimized natcmp function we are going to use for comparison? Nowhere, we have to create our own...
We start with the simple straigh-forward approach: loop over the two strings s1 and s2 in parallel, when any of the two characteres being compared is not a digit, perform a normal comparison, if they are equal we advance to the following character, if they are not equal we return -1 or 1 appropriately.
If the two characters are numbers, we have to compare then numerically: first, discard any leading zeros, then check the length of the numeric substrings. If they are of different lenght then we can conclude that the string with the shorted length is to be ordered before the other. If the two numeric substrings have the same length, we compare then character by character.
Here there is a C implementation:
  int
  natcmp(const char *s1, const char *s2) {
    char c1, c2;
    while (1) {
      c1 = *(s1++);
      c2 = *(s2++);
      if ((c1 <= '9') && (c2 <= '9') && (c1 >= '0') && (c2 >= '0')) {
        /* so, both c1 and c2 are digits */
        size_t l1, l2;
        const char *e1, *e2;
        /* skip leading zeros */
        while (c1 == '0') c1 = *(s1++);
        while (c2 == '0') c2 = *(s2++);
        /* find the end of the numeric substrings */
        for (e1 = s1; (c1 >= '0') && (c1 <= '9'); c1 = *(e1++));
        for (e2 = s2; (c2 >= '0') && (c2 <= '9'); c2 = *(e2++));
        /* if the lengths are different we can conclude the result */
        if (e1 - s1 < e2 - s2) return -1;
        if (e1 - s1 > e2 - s2) return 1;
        /* otherwise we just compare them as strings */
        s1--; s2--; e1--;
        while (s1 < e1) {
          c1 = *(s1++);
          c2 = *(s2++);
          if (c1 < c2) return -1;
          if (c1 > c2) return 1;
        }
      }
      /* at least one of the characters is not a digit */
      else if (c1 < c2) return -1;
      else if (c1 > c2) return 1;
      else if (c1 == '\0') return 0;
    }
  } 

 

A better natcmp function

We have a working natcmp function, the problem is that at every position it has to check it the characters are both digits or not. In relative terms, this is a lot of work.
Let's assume an optimistic approach, ignore that we may have found a number and just look for the point where the two strings diverge. When we reach this point we know that to the left both strings are identical. There may be digits to the left of the divergence point or no, we don't know
Well, at this point you have to realice that what is at the left doesn't really matter (most of the times). For instance, comparing ("foo123", "foo1245"), when we reach the divergence point we have ("3", "45") left, we can see that they are numbers, compare them numerically and conclude than the first goes before the second.
Erhh, most of the times, right. The problem again is the leading zeros. If any of the characters at the divergence point is a zero we have to look back and see if this is a number with leading zeros. If it is, we have to discard them. For instance, given ("foo0112", "foo0002"), at the divergence point we have ("112", "002"), we see that one of the characteres is a zero, so we lock back and see that all the digits to the left are zeros so we have to discard them. After that we have the strings ("112", "2") that when compared as numbers result in the first going after the second.
Here there is the C implementation:
  int
  natcmp_alt(const char *s1, const char *s2) {
    const char *start1 = s1;
    const char *start2 = s2;
    char c1, c2;
    while (1) {
      char c1 = *(s1++);
      char c2 = *(s2++);
      if (c1 == c2) {
        if (c1 == '\0') return 0;
      }
      else {
        /* we are at the divergence point */
        if ((c1 <= '9') && (c2 <= '9') && (c1 >= '0') && (c2 >= '0')) {
          /* both characters at the divergence point are digits */
          /* if any of them is a zero we have to perform the leading zeros check */
          if (c1 == '0') {
            const char *p = s1 - 2;
            while ((p >= start1) && (*p == '0')) p--;
            if ((p < start1) || ((*p < '0') && (*p > '9')))
              while ((c1 = *(s1++)) == '0');
            if ((c1 < '0') || (c1 > '9'))
              return -1;
          }
          else if (c2 == '0') {
            const char *p = s2 - 2;
            while ((p >= start2) && (*p == '0')) p--;
            if ((p < start2) || ((*p < '0') && (*p > '9')))
              while ((c2 = *(s2++)) == '0');
            if ((c2 < '0') || (c2 > '9'))
              return 1;
          }
          /* numerical comparison: first by length then by the diverging digit */
          while (1) {
            char d1 = *(s1++);
            char d2 = *(s2++);
            if ((d1 < '0') || (d1 > '9'))
              return ((c1 > c2) && ((d2 < '0') || (d2 > '9')) ? 1 : -1);
            if ((d2 < '0') || (d2 > '9'))
              return 1;
          }
       }
       /* they are not digits */
       return (c1 < c2 ? -1 : 1);
     }
   }
 }
 

 

The code

The full code discussed on this post is available from GitHub: https://github.com/salva/c-natsort.
It includes several variations of the comparison functions. The version using the key pre-generation algorithm also supports sorting ignoring non alphanumeric characters.
Some variations of the strcmp function used in the quicksort are also included. In theory, GCC provides a built-in for strcmp that should be blazing fast, but my benchmarks show that depending of the data and the optimization flags, my variations sometimes are faster. Check for yourself and see.

 

Benchmarking

So let's see how the two exposed approaches to natural sorting perform.
The script bm.pl is a driver that runs the natsort program with the different flags that select the sorting algorithm and implementation, times the running time and generates a nice table with the results.
After benchmarking the different algorithms with different data sets, these are my findings:
  • Using the key pre-generation approach is usually a little faster, but not too much, typically ranging from 0% to 15%.
  • The alternative natcmp function is slightly faster than the simple version, around 10% faster. 

 

Conclusion

For natural sorting strings in C, there is not much point in using the key pre-generation approach as it only gives a small performance improvement and on the other hand uses a lot of memory.

No hay comentarios:

Publicar un comentario