Thursday, September 8, 2011

Regex vs. .Net’s TrimEnd

I recently had a scenario where I had to compare a large list of strings to a keyword or in some instances a series of keywords to find a match. This seemed fairly simple at first, but after implementing the code and testing the results I noticed that certain words were ignored. For example a keyword search for ‘card’ on the sentence “This card goes after that card.” would only return the first ‘card’ in the sentence. The reason for this was that the second ‘card’ was actually ‘card.’ with a period in the list I was dealing with. I had no control over how the list was built so some string manipulation was going to be needed to scrub out all the noise characters similar to how SQL discards noise words (now referred to as stop words). It was decided that I would only scrub the noise characters at the end of each line, and ignore them if they were mid-sentence.

There are two obvious choices here; an old school Regex pattern, or TrimEnd in .Net’s string namespace.

I was always under the impression that regular expressions would outperform .Net’s string operations, so this was an easy opportunity to put that theory to test.

First I wrote the regular expression to clean off any trailing punctuation:

   1: private string GetRegexValue(string sampleText)
   2: {
   3:     Stopwatch s = Stopwatch.StartNew();
   4:     string pattern = @"(\p{P})(?=\Z|\r\n)";
   5:     string result = string.Empty;
   6:  
   7:     for (int i = 0; i <= 100000; i++)
   8:     {
   9:         result = string.Empty;
  10:         result = Regex.Replace(sampleText, pattern, "");
  11:     }
  12:  
  13:     lblRegexTime.Text = string.Format("{0} ms", 
  14:         s.ElapsedMilliseconds.ToString());
  15:  
  16:     return result;
  17: }



Next I wrote a similar function using string.TrimEnd:



   1: private string GetTrimValue(string sampleText)
   2: {
   3:     Stopwatch s = Stopwatch.StartNew();
   4:     string result = string.Empty;
   5:     char[] charsToTrim = { ',', '.', ' ', ':', ';', '!', '-', '?' };
   6:  
   7:     for (int i = 0; i <= 100000; i++)
   8:     {
   9:         result = string.Empty;
  10:         result = sampleText.TrimEnd(charsToTrim);
  11:     }
  12:  
  13:     lblTrimTime.Text = string.Format("{0} ms", 
  14:         s.ElapsedMilliseconds.ToString());
  15:     return result;
  16: }



To get a good Idea of how the performance matched up I ran both through a loop 100,000 times. The results were surprising.


image


After the initial run the TrimEnd and the Regex are even faster.


image


So what does this mean? Are String operation always faster that regular expressions, No. But its always a good idea to test before making any assumptions as to which is faster.