Gem #118: File-System Portability Issues and GNATCOLL.VFS

Let's get started...

One of the important issues to address when porting code from one system to another is that of file systems.

There are several aspects of the handling of file names that vary across file systems. For one, the file system might be case-sensitive or case-insensitive. This refers to whether casing of the name is relevant when accessing a file on the disk. In addition, the file system can be case-preserving or not. On some systems, file names are converted to upper case systematically when they are displayed (MS-DOS and VMS are in that category). Systems that do not preserve casing are always case-insensitive.

As a result, there are three categories of file systems: case-sensitive/case-preserving (most Unix file systems), case-insensitive/case-preserving (NTFS) and case-insensitive/case-destructive (FAT and VMS).

When running on a case-insensitive file system, applications should display file names with the same casing that the user used when creating them. However, many applications ported from Unix will simply convert all file names to lower case (to ensure uniqueness of file names internally) and thus have a display that is disturbing for the user. In practice, it has been our experience that file names should only be converted to lower case when comparing the names (for instance, to find out whether two names refer to the same file or when computing a hash). The rest of the time, the casing should always be preserved.

In truth, the introduction above is not quite precise enough: the attributes for casing really depend on the file system, not on the operating system (Windows, Linux,...). For instance, it is possible that your machine mounted a remote file system that has different properties (for instance, a Windows partition mounted on a Linux machine). Apple's OS-X is a special case here, in that its default is to be case-sensitive, but users can choose to make the file system case-insensitive. All of this shows that testing the system you are running on is not enough in practice. Unfortunately, we haven't found a good way to test the file system dynamically (not to mention that it would be very expensive, since for each file one would have to test for what file system it is on).

Another difficulty regarding file names is that of the character set in which they are encoded. When a file name only contains ASCII characters, there are in general no difficulties with manipulating the name. However, it is valid on most system to use accented characters in file names. But some file systems do not force an encoding, and just view the file name as a series of bytes, whose interpretation is left to applications (Windows Explorer, terminals, etc.) that will display the name. In general, those applications will take into account the user's locale for the display. Other file systems always interpret the file names as UTF-8. Again, for the application to get this exactly right would require testing the file system for each file, rather than simply testing the system itself and assuming some defaults.

Another issue is symbolic links. On a lot of file systems, a file can be accessed through different paths when using symbolic links. Although these links have no impact when opening and reading the file, they make it more complicated to check whether two paths refer to the same physical file. This can be done by checking each component of the path to see whether it is a link, and if so, convert it to a normalized form. This computation can be expensive (especially on slow or remote file systems), so its result should be cached when possible.

The GNAT Components Collection (GNATCOLL) provides a useful package to abstract such aspects, namely GNATCOLL.VFS. This package provides several types that are used to manipulate files and their names:

   type Filesystem_String is new String;
   type Virtual_File is tagged private;

The first type above is intended as an initial replacement for the strings that are generally used to represent a file name. There is no conversion to or from Unicode. The intent is to remind users that the exact interpretation should not be a string that can be displayed as is, but a series of bytes that need to be interpreted in the context of a specific character set (most often UTF-8, but also ISO-8859-1 and variants).

Use of the second type involves a bigger change for most application: the idea is that it encapsulates and caches various information about a file and its name, and thus abstracts notions like case-sensitivity.

Let's consider some examples. We first need to get a representation for a file from the disk. For this, we can use one of the Create functions available in GNATCOLL. For instance,

   declare
      F : Virtual_File;
   begin
      F := Create ("/tmp/Foo.txt");
   end;

If we pass F to some subprogram that should display it in a GUI, for instance, we expect the name to appear exactly as "Foo.txt", and not as an all-lower-case version "foo.txt", even on a case-insensitive file system. To get the name, we could, for example, write:

  declare
     Name : constant Filesystem_String := F.Base_Name;
  begin
     Put_Line (+Name);
  end;

As noted earlier, the name should be considered as a series of bytes, the interpretation of which depends on your system. Most of the time, it is relatively safe to assume this is UTF-8. For such a case, GNATCOLL.VFS provides a "+" operator to convert the Filesystem_String to a String.

If we now create another instance of Virtual_File, we can test whether the two reference the same file. The result would be true on Windows, for instance, but not on Unix.

  declare
     F2 : constant Virtual_File := Create ("/tmp/foo.txt");
  begin
     if F = F2 then
        null;
     end if;
  end;

On Unix, we could create a symbolic link from "/temp" to "/tmp". If we want our application to support symbolic links properly and recognize that "/temp/foo.txt" and "/tmp/foo.txt" are indeed the same file, we need to tell GNATCOLL that we are ready to pay the performance penalty, by calling:

   GNATCOLL.VFS.Symbolic_Links_Support (True);

This support is turned off by default. When loading a big project in GPS for instance (with several thousand files) on a slow file system (ClearCase), not checking explicitly for symbolic links is at least an order of magnitude faster. That's why this is left as an explicit choice to the application.

However, GNATCOLL is clever enough to cache the symbolic resolution, as well as to normalize the file name. So if you reuse a Virtual_File several times, it will not need to perform the system calls again.

The API in GNATCOLL.VFS is much more extensive than what we have seen above, and provides ways to test whether we have a directory, whether the file is writable, read the contents of a file efficiently into memory, get the list of files in a directory, and even modify files. In each of these cases, GNATCOLL will make sure it uses the proper form of the file name when communicating with the system.

In fact, GNATCOLL.VFS also provides support for remote file systems (this is the basis of the remote mode in GPS), where network operations are performed transparently when you access a file, but this will be the subject of another Gem.

Converting an application to GNATCOLL.VFS is no small amount of work. But it provides a number of benefits in terms of portability and performance.