April 23rd, 2010

Convenience features, part 3

There are a couple of nice find-and-replace enhancements in the next release that probably won't seem huge at first glance but are really nice to have in practice.  These new features apply to both String.findReplace() and rexReplace().

First, a tiny one: the 'flags' argument is now optional, and ReplaceAll is the default if the argument is omitted.  This is a small thing, but 90% of the time you just want to do a global replacement, so this cuts down the typing in the common case.

Second, you'll be able to conduct a whole series of replacements in one shot, by specifying a list of patterns and a corresponding list of replacements.  For example, if you wanted to do a bunch of mappings of HTML to plain text, you could do something like this:

  s = s.findReplace(['<p>', '<br>', '<b>', '<i>'], ['\b', '\n', '', '']);

Each element of the pattern list corresponds to an element of the replacement list, so <p> is replaced by \b, <br> is replaced by \n, etc.

The algorithm is pretty robust, by the way, in contrast to similar features in certain other languages (I'm looking at you, php).  Replacements are by default carried out "in parallel", meaning that the function repeatedly looks for the leftmost match, replaces it, and then proceeds with the remainder of the string.  The particularly important feature is that replaced text isn't re-scanned in this algorithm, which makes it actually different from doing a series of replacements, the way you had to in the past:

 s = s.findReplace('<p>', '\b').findReplace('<br>', '\n').findReplace('<b>', '').findReplace('<i>', '');

In this particular case this doesn't really matter, apart from the efficiency gain of not having to go through four separate searches and construct four separate strings.  But it does matter in cases where the replacement text from one pattern happens to contain search text from a later pattern.  Consider this:

 s = s.findReplace('a', 'b').findReplace('b', 'c');

That'll turn any 'a' in the original string into a 'c' in the final string, since the intermediate replacement of 'a' with 'b' will be re-scanned, and the second scan will change the b's to c's.  In contrast, this will simply turn a's to b's, and *existing* b's to c's:

 s = s.findReplace(['a', 'b'], ['b', 'c']);
 
In case you actually do want to do the replacement serially, there's a flag that says to do that.

The third feature is that regular (non-regex) replacements can be done with case-insensitive matching, and case-insenstive matches (regular and regex) can "follow the case" of the match.  Following the case means that the replacement text will be converted to match the upper/lower case pattern of the matched text.  For example, if the pattern is 'hello' and you do a case-insensitive match on 'HELLO', the replacement text will be converted to all caps; if it matches 'hello', the result will be lower-case; if it's 'Hello', the result will have an initial capital.

  s = 'This is this test of THIS function'.findReplace('this', 'that', ReplaceIgnoreCase | ReplaceFollowCase);

The result will be 'That is that test of THAT function'.

The fourth feature is that the replacement text can now be given as a callback function rather than as simple text.  This is surprisingly powerful, and once you get the knack of it, it can really simplify code.  I've already applied it to a number of library functions that formerly had big gnarly loops that stepped through strings a character at a time; in some cases it gets big loops down to a line or two.

Here's an example that converts a string to title case.  The looping equivalent is rather tedious to write, but with a callback it's pretty easy:

titleCase(str)
{
   local r = new function(match, idx)
   {
       /* don't capitalize certain small words, except at the beginning */
       if (idx > 1 && ['a', 'an', 'of', 'the', 'to'].indexOf(match.toLower()) != nil)
           return match;

       /* capitalize the first letter */
       return match.substr(1, 1).toUpper() + match.substr(2);
   };
   return rexReplace('%<(<alphanum>+)%>', str, r);
}