Gem #25: How to Search Text

Let's get started…

The Ada standard defines several subprograms related to searching text in a string. In addition, GNAT provides additional packages to search. This gem will cover the various possibilities.

We are assuming that we are searching for text within a string that is in memory. To search text within a file, the simplest is to load the file into memory, and then search the resulting string.

We are also assuming we are dealing with a String type, not with an Unbounded_String or Bounded_String. In both cases, one can easily convert to a string. Predefined Ada subprograms exist for all three of these types, but the GNAT specific packages deal only with Strings. Of course, converting to a string is not necessarily efficient, and one should point to the GNAT-specific packages Ada.Strings.Unbounded.Aux which allow you to peek inside the unbounded string, but should obviously be used with care.

The text we are searching for (the pattern) can have various forms, however. It can be a fixed text which must be found exactly as is, or it can be a regular expression which matches a whole set of strings.

When searching for fixed text, the obvious candidates are the predefined Ada runtime subprograms from the Ada.Strings.Fixed package. The relevant subprograms all start with the prefix Index. These are reasonably optimized, even though they are not written specifically in assembly for maximum speed. Similar subprograms exist for bounded and unbounded strings.

Some advanced algorithms exist for searching fixed text within very long texts. These algorithms spend a bit of time studying the pattern initially, creating an internal representation (for instance a finite-state automaton), so that later searches are more efficient. One example of this is the Boyer-Moore algorithm. The initial computation can be more costly than the time it takes to do the search when the string is small, so using Index in such cases is the most efficient solution. However, as the size of the string grows, using a more advanced algorithm becomes interesting. The GNAT runtime does not currently provide such an implementation, but the GPS IDE has such a package which will likely be integrated into the GNAT runtime in the future. Let us know of your interest in such a package!

Regular expressions are a well known feature in several programming languages like Perl, and provide advanced search capabilities. They provide a syntax for describing a pattern that should be matched. For instance, "a.*b" would find all substrings starting with "a" and ending with "b", with zero or more characters in between. Likewise, "a[bc]d" would find all three-characters substrings starting with "a", ending with "d", and whose middle character is either "b" or "c". This gem is not a tutorial on how to write regular expressions, since for instance the Perl manual or various tutorials on the internet already cover the topic at length.

GNAT provides two packages for searching for regular expressions.

The one which has been around for longest is GNAT.Regexp. It is not a fully-general regular expression package, since it does not support, for instance, retrieving the substrings matched by parenthesis groups. It also tries to match the whole text; that is, "a[bc]d" will only match if the text itself is three characters long. It supports two versions of the regular expressions, the usual ones as described above and what are generally called "globbing" patterns, as used by the various command line shells on Unix. Therefore, GNAT.Regexp is most often used for searches related to file names.

Here is a short example on how to find all Ada source files in the current directory. As you will notice, the regular expression first needs to be compiled into a finite state automaton, and then provides very efficient pattern matching through the Match function.

with Ada.Directories; use Ada.Directories;
with GNAT.Regexp;     use GNAT.Regexp;

procedure Search_Files (Pattern : String) is
  Search : Search_Type;
  Ent    : Directory_Entry_Type;
  Re     : constant Regexp := Compile (Pattern, Glob => True);

  Start_Search (Search, Directory => ".", Pattern => "");
  while More_Entries (Search) loop
     Get_Next_Entry (Search, Ent);
     if Match (Simple_Name (Ent), Re) then
         ... --  Matched, do whatever
     end if;
  end loop;
  End_Search (Search);
end Search_Files;

Search_Files ("*.adb");

A second package that GNAT provides is GNAT.Regpat. It provides full handling of the usual regular expression constructs, but not some of the recent advances that languages like Perl have seen (look-ahead, unicode,...). A regular expression is first compiled into a byte-code which will provide faster matching. Note that this package is not as efficient as GNAT.Regexp, since some patterns will read characters, discover they do not match, then backtrack to an earlier position and try some other possibility. In practice, the efficiency is generally good enough.

One feature that this package provides over GNAT.Regexp is the capability, after a match, to know what parenthesis groups matched. For instance, in the regular expression "a (big|small) application", it is possible to know which of the two alternatives was matched, as in the following example:

with GNAT.Regpat;   use GNAT.Regpat;
with Ada.Text_IO;   use Ada.Text_IO;

procedure Search_Regexp (Pattern : String; Search_In : String) is
  Re      : constant Pattern_Matcher := Compile (Pattern);
  Matches : Match_Array (0 .. 1);

  Match (Re, Search_In, Matches);
  if Matches (0) = No_Match then
     Put_Line ("The pattern did not match");
     Put_Line ("The application really is "
               & Search_In (Matches (1).First .. Matches (1).Last));
  end if;
end Search_Regexp;

Search_Regexp ("a (big|small) application",
               "Example of a small application for searching");

Although regular expressions are quite powerful, and generally more than enough for most applications, they have limitations that only a full context free grammar can overcome. One classical example is that they cannot ensure that a string has the same number of opening and closing parenthesis, as in "(a(b))", when the number of parentheses is not known in advance.

GNAT provides the package GNAT.Spitbol.Patterns, which contains various subprograms that can be used to implement matching with context free grammars. SPITBOL itself is a full programming language, which GNAT can emulate through the subprograms in GNAT.Spitbol, but in this gem we only concentrate on the pattern matching capabilities.

Here is a small example on using GNAT.Spitbol.Patterns. The goal here is to check whether the given string starts with a balanced expression with respect to "[" and "{", that is if the first character is one of those, then there must be a similar closing character, and the substring in between is also properly balanced.

The language for such a pattern in BNF form would be:

  ELEMENT ::= <any character other than [] or {}>
              | '[' BALANCED_STRING ']'
              | '{' BALANCED_STRING '}'

which gets translated to the following spitbol program (see the documentation in GNAT.Spitbol.Patterns for more information on the format of the patterns). Note that this program does not force the end of the string to be properly balanced, just the beginning of it.

with GNAT.Spitbol.Patterns;  use GNAT.Spitbol.Patterns;

procedure Search_Spitbol is
  Element, Balanced_String : aliased Pattern;
  At_Start : Pattern;

  Element := NotAny ("[]{}")
        or ('[' & (+Balanced_String) & ']')
        or ('{' & (+Balanced_String) & '}');
  Balanced_String := Element & Arbno (Element);
  At_Start := Pos (0) & Balanced_String;
  Match ("[ab{cd}]",   At_Start); --  True
  Match ("[ab{cd}",    At_Start); --  False
  Match ("ab{cd}",     At_Start); --  True
end Search_Spitbol;