cache_basics.py   [plain text]


#! /usr/bin/python2.4

# Copyright 2007 Google Inc.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301,
# USA.
# 

"""Classes enabling definition and composition of caches.

This file defines caches used to speed up the does-this-file-exist
test that forms the basis of the C preprocessor's include-file
handling, and takes most of its time.

When the preprocessor sees a line like "#include <foo/bar.h>" it looks
for a file named "bar.h" in many directories: /usr/include/foo/bar.h,
./foo/bar.h, and so forth.  More precisely, the preprocessor is given
a "search path", which is a list of directory-names.  (By default, the
search-path looks like ['/usr/include', '/usr/local/include', ...],
but it's often extended via gcc flags like -I, -isystem, -iprefix,
etc.)  To resolve a single #include like "#include <foo/bar.h>", the
preprocessor goes through every directory in the search path, running

   os.stat(os.path.join(current_working_dir, search_dir, 'foo/bar.h'))

until the stat call succeeds.  With dozens of search-dirs to look
through, dozens of #include lines per source file, and hundreds of
source files per compilation, this can add up to millions of stat
calls.  Many of these calls are exactly the same, so caching is a big
win.

The cache of stat calls takes a filename as input and produces a bool
as output, saying if the filename exists.  For reasons that will
become clear in a moment, we actually represent the input filename as
a triple that breaks the filename into its three components:

   1) currdir: the current working directory (usually os.path.absdir('.'))
   2) searchdir: an element of the search path (eg '/usr/include', 'base')
   3) includepath: the thing that comes after "#include" in source files
                   ("foo/bar.h" in our examples above).

Why do we break the input into three parts?  Consider what cache-lookups
we have to do for a single source file:
   cache[os.path.join(currdir, searchdir1, includepath1)]   # #include <ipath1>
   cache[os.path.join(currdir, searchdir2, includepath1)]   # #include <ipath1>
   cache[os.path.join(currdir, searchdir3, includepath1)]   # #include <ipath1>
   [etc...until the cache-lookup returns True]
   cache[os.path.join(currdir, searchdir1, includepath2)]   # #include <ipath2>
   cache[os.path.join(currdir, searchdir2, includepath2)]   # #include <ipath2>
   cache[os.path.join(currdir, searchdir3, includepath2)]   # #include <ipath2>
   [etc]

By having the key be a triple, we avoid all those unnecessary
os.path.join calls.  But even if we do this, we notice bigger fish
to fry: the Python interpreter still has to do a string-hash of
currdir for every lookup, and also has to string-hash searchdirX and
includepathX many times.  It would be much more efficient if we did
those hashes ourselves, reducing the number of string-hashes from
O(|search-path| * |#include lines|) to
O(|search-path| + |#include lines|).

This motivates (finally!) the data structures in this file.  We have
three string-to-number maps, for mapping each currdir, searchdir, and
includepath to a small integer.  We put that all together in a cache,
that takes a triple of integers as its key and produces True if the
file exists, False if it does not, or None if its status is unknown.

The String-to-number Map(s)
---------------------------
The basic map that converts a filepath-path -- a currdir, searchdir,
or includepath -- to a small integer is called MapToIndex.  MapToIndex
provides mapping in both directions:
   index: a dictionary mapping paths (strings) to indices in 1..N, and
   string: an array of size N + 1 that implements the reverse mapping

So:
   obj.string[obj.index[path_as_string]] == path_as_string
   obj.index[obj.string[path_as_number]] == path_as_number

Note we map from 1..N, and not 0..N-1, which leave us 0 free to use as
a synonym for None or False.

There are also classes that specialize MapToIndex for specific purposes.

DirectoryMapToIndex assumes the input is a directory, and in
particular a directory that does not have a slash at the end of it (eg
"/etc").  It adds the trailing slash before inserting into the map.
This is useful because it allows us to use + to join this directory
with a relative filename, rather than the slower os.path.join().

RelpathMapToIndex assumes the input is a relative filepath, that is,
one that does not start with /.  When combined with DirectoryMapToIndex
entries, + can be used as a fast alternative to os.path.join().

CanonicalMapToIndex is a MapToIndex that canonializes its input before
inserting it into the map: resolving symlinks, getting rid of ..'s,
etc.  It takes an absolute path as input.

Other Caches
------------
Besides the maps from strings to integers, there are three other caches.
One is the realpath-cache, that takes a filename and returns
os.path.realpath(filename).  We cache this because os.path.realpath()
is very slow.  This is called CanonicalPath.

The second cache, the DirnameCache, maps an arbitrary pathname to
dirname(pathname), that is, the directory the pathname is in.  The
input pathname is represented by a (currdir_idx, searchdir_idx,
includepath_idx) triple.  The output is likewise represented as a
number: an index into the DirectoryMapToIndex structure.

The third cache is called SystemdirPrefixCache.  It tells you, for a
given absolute filepath, whether it is prefixed by a systemdir (that
is, one of the searchdirs that's built into cpp, such as /usr/include).
This is useful to cache because there are several systemdirs, and it's
expensive to check them all each time.

Naming Conventions
------------------
  currdir: the current working dir.
  searchdir: an element of the search-path (places cpp looks for .h files).
  includepath: the string a source file #includes.
  realpath: a full filepath with all its symlinks resolved:
     os.path.realpath(os.path.join(currdir, searchdir, includepath))
  FOO_idx: the small integer associated with the string FOO.

  includepath_map: the map that takes includepaths to their idx and back
     (a RelpathMapToIndex).
  directory_map: the map that takes currdirs and searchdirs to their
     idx and back.  It also is used to store dirname(filepath) for arbitrary
     filepaths -- basically, anything we know is a directory (a
     DirectoryMapToIndex).
  realpath_map: the map that takes full filepaths to their idx and back,
     canonicalizing them first (by resolving symlinks) (a
     CanonicalMapToIndex).

  searchlist: a list of searchdirs.  In gcc/cpp documentation, this is
     called the "search path", but for consistency, in this code we reserve
     the name "path" to mean "filesystem component," never "list of dirs".
     (A list of strings).
  systemdir: a searchdir that's built into cpp, rather than set via -I.
     (A string.)

  resolved_filepath: given an includepath, and a (possibly implicit)
     currdir and searchlist, the resolved_filepath is
        os.path.join(currdir, searchdir, includepath)
     for the first searchdir in searchlist for which the joined string
     exists.  This path can be represented in many ways: 1) a string like
     "foo/bar/baz.h" (if so, this string has been canonicalized to resolve
     symlinks and the like); 2) an index into realpath_map associated with
     that string; 3) a triple of indices; or 4) a pair of indices plus an
     assumption that os.getcwd() == currdir.

Pair Represenation of Filepaths
-------------------------------
A file is uniquely determined by the triple
   (currdir_idx, searchdir_idx, includepath_idx)
For a single compilation unit, the code will often start with a
chdir(currdir).  After that, we often refer to a file by the pair
   (searchdir_idx, includepath_idx)
which might be either an absolute filename or relative to $PWD.

We refer to this pair as a filepath_pair.
TODO(csilvers): find a better name?

The function IsFilepathPair(x) tests whether x is a pair that could
plausibly have a searchdir_idx as its first element and an
includepath_idx as its second.

Tests
-----
This code is currently only tested by regression tests of modules
using this one.
"""

__author__ = "opensource@google.com (Nils Klarlund, Craig Silverstein)"

import os
import os.path
import sys

import basics
import headermap
import statistics
import compiler_defaults

DIR_ARRAY_SIZE = 500

# We currently use the stat and realpath of GNU libc stat and
# realpath. They are about an order of magnitude faster than their
# Python counterparts, even when called through the Python/C
# interface.

try:
  import distcc_pump_c_extensions
  _OsPathExists = distcc_pump_c_extensions.OsPathExists
  _OsPathIsFile = distcc_pump_c_extensions.OsPathIsFile
  _PathRealpath = distcc_pump_c_extensions.Realpath
  _path_realpath_works = True
except ImportError:
  _OsPathExists = os.path.exists
  _OsPathIsFile = os.path.isfile
  _PathRealpath = os.path.realpath
  # os.path.realpath might have some bugs.  TODO(csilvers): check that here
  _path_realpath_works = False

Debug = basics.Debug
DEBUG_TRACE = basics.DEBUG_TRACE
DEBUG_TRACE1 = basics.DEBUG_TRACE1
DEBUG_TRACE2 = basics.DEBUG_TRACE2
DEBUG_WARNING = basics.DEBUG_WARNING
NotCoveredError = basics.NotCoveredError


####
#### SIMPLE CACHES
####

class CanonicalPath(object):
  """Memoizing calculation of realpaths.  realpath(x) is the 'canonical'
  version of x, with all symbolic links eliminated.
  """

  def __init__(self):
    self.cache = {}

  def Canonicalize(self, filepath):
    """Find a really canonical path, possibly memoized.

    Arguments:
      filepath: a filepath (string)
    Returns:
      the realpath of filepath (string)

    The following is irrelevant if we always use the distcc_pump_c_extensions
    realpath function.
    ---
    Apparently, in some versions of Python 2.4 at least, realpath does
    *not* resolve the last component of a filepath if it is a link:
       https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1213894&group_id=5470
    Make up for that: follow that final link until a real realpath has
    been found.

    Also, realpath is not idempotent.

    Solution (?): turn filepath into abspath before applying realpath;
    then we can cache results as well (without worring about value of
    current directory).

    The final problem -- that os.path.realpath is very slow, at least
    an order of magnitude slower than the gnu libc one --- is solved
    through caching all uses through an object of the present class.
    """
    assert isinstance(filepath, str)
    try:
      return self.cache[filepath]
    except KeyError:
      if _path_realpath_works:
        r = _PathRealpath(filepath)
        self.cache[filepath] = r
        return r
      # Fix for os.path.realpath idempotencey bug (Python 2.4).
      filepath_ = os.path.abspath(filepath)
      filepath_ = _PathRealpath(filepath_)
      # Fix for os.path.realpath bug (Python 2.4): symlinks at end not
      # resolved.
      for unused_i in range(10):
        if not os.path.islink(filepath_):
          break
        filepath_ = os.path.join(os.path.dirname(filepath_),
                                 os.readlink(filepath_))
      else:
        raise NotCoveredError("Too many symlinks in '%s'." % filepath)
      self.cache[filepath] = filepath_
      return filepath_

class DirnameCache(object):
  """Cache the mapping from filepath pairs to index of their directory names.

  The key is a triple (currdir_idx, searchdir_idx, includepath_idx).  The
  value is

    (dir_idx, dir_realpath_idx)

  where dir_idx is the index of dirname of the corresponding filepath, which
  possibly is relative, and dir_realpath_idx is the realpath index of the
  absolute location of the dirname.  The value currdir_idx is of possible
  importance for deteterming dir_realpath_idx, but plays no role in determining
  dir_idx."""

  def __init__(self, includepath_map, directory_map, realpath_map):
    """Constructor.
      Arguments:
        includepath_map: the map used to construct the includepath_idx
           that will be passed in as arguments to Lookup().
        directory_map: the map used to construct both the currdir_idx
           and searchdir_idx that will be passed in as arguments to
           Lookup().  It's also the data structure that produces dir_idx.
        realpath_map: a string-to-int map of canonicalized filepaths
    """
    self.includepath_map = includepath_map
    self.directory_map = directory_map
    self.realpath_map = realpath_map
    self.cache = {}

  def Lookup(self, currdir_idx, searchdir_idx, includepath_idx):
    """Return the directory and realpath indices of the dirname of the input.

    Arguments:
      currdir_idx: the directory index of the current directory
      searchdir_idx: a directory_map index
      includepath_idx: an includepath index
    Returns:
      a pair (directory map index, realpath index)

    See class documentation.

    Example: if the strings of the arguments indices put together make
    '/usr/include/foo/bar.h', then this routine will insert '/usr/include/foo/'
    into self.directory_map, and then return the corresponding pair (directory
    index of /usr/include/foo/, real path index of /usr/include/foo/).  If the
    arguments put together form "foo.h", then the directory index returned is
    that of "", the current directory, and the realpath index is that of
    currdir.
    """
    try:
      return self.cache[(currdir_idx, searchdir_idx, includepath_idx)]
    except KeyError:
      full_path = basics.PathFromDirMapEntryAndInclude(
                                  self.directory_map.string[searchdir_idx],
                                  self.includepath_map.string[includepath_idx])
      assert full_path
      directory = os.path.dirname(full_path)
      dir_idx = self.directory_map.Index(directory)
      rp_idx = self.realpath_map.Index(
          os.path.join(self.directory_map.string[currdir_idx],
                       directory))
      self.cache[(currdir_idx, searchdir_idx, includepath_idx)] = (dir_idx,
                                                                   rp_idx)
      return (dir_idx, rp_idx)
      

class SystemdirPrefixCache(object):
  """A cache of information about whether a file exists in a systemdir.

  A systemdir is a searchdir that is built in to the C/C++
  preprocessor.  That is, when the preprocessor is figuring out what
  directory an #include is in, these are the directories it's
  hard-coded in to check (you can add other directories via -I).  This
  cache records, for a given filepath, whether it starts with a
  systemdir.  This is useful to identify whether the path is likely to
  correspond to a system include-file (such as stdio.h).  Such files are
  unlikely to change, and are likely to already exist on the distcc
  servers, both of which are useful things to know for optimization.

  For speed, users can access self.cache directly, rather than going
  through the StartsWithSystemdir API.  Be sure to call FillCache() to
  make sure the cache is populated, before accessing it!
  """

  def __init__(self, systemdirs):
    """Constructor.

    Argument:
      systemdirs: the list of system-directories the preprocessor
        uses.  It's a list of strings, probably extracted from the
        preprocessor itself.  Each systemdir should end in a slash.

    In practice, systemdirs will start empty, and later some routine
    (in parse_command.py) will magically fill it.  So be sure to wait
    for that before calling FillCache!
    TODO(csilvers): normalize this; ideally pass systemdirs in to FillCache.
    """
    self.systemdirs = systemdirs
    # self.cache[i] will be True, False, or None for not-yet-checked.
    self.cache = [None]

  def FillCache(self, realpath_map):
    """Ensures that there's a cache entry for every index in realpath_map.

    Argument:
     realpath_map: a string-to-int map of canonicalized filepaths we know.
     
    After this function is called, the cache entry is True iff
    realpath.startswith(systemdir) is True for any of the systemdirs
    passed in to our constructor.
    """
    if len(self.cache) >= realpath_map.Length():
      return    # we're already all full
    for realpath_idx in xrange(len(self.cache), realpath_map.Length()):
      realpath = realpath_map.string[realpath_idx]
      for systemdir in self.systemdirs:
        if realpath.startswith(systemdir):
          self.cache.append(True)
          break
      else:    # we get here if the for never 'break'ed
        self.cache.append(False)

    assert len(self.cache) == realpath_map.Length()

  def StartsWithSystemdir(self, realpath_idx, realpath_map):
    """Return True iff realpath starts with a systemdir.

    Arguments:
     realpath_idx: the index of the realpath we want to check.
     realpath_map: the map from realpath_idx to a string.

    Return True iff realpath.startswith(systemdir) for any of the
    systemdirs passed in to our constructor.  (For speed, you can
    access self.cache directly instead of calling this, but make
    sure FillCache() has been called first!)
    """
    self.FillCache(realpath_map)
    return self.cache[realpath_idx]


####
#### MAP_TO_INDEX AND ITS SPECIALIZATIONS
####

class MapToIndex(object):
  """Maps every object it sees to a unique small integer.  In
  practice, this class is used to map path-components (which are strings).
  """

  def __init__(self):
    """Constructor.

    Instance variables:
      map: a dictionary such that map[path] is the index of path
      string: a list satisfying: string[i] is the path such that map[path] = i
    """

    # Do not make the mistake of letting a real index be 0. (Hint:
    # because "if path:" then does not distinguish between 0 and None.)
    self.index = {None:None}
    self.string = [None]

  def _Invariant_(self):
    return len(self.index) == len(self.string)

  def Index(self, path):
    """Returns the index i > 0 of path."""
    assert self._Invariant_()
    try:
      return self.index[path]
    except KeyError:
      self.index[path] = len(self.string)
      self.string.append(path)
      return len(self.string) - 1

  def String(self, i):
    """Returns the path such that Index(path) == i."""
    assert self._Invariant_()
    assert 0 < i < self.Length()
    return self.string[i]

  def Length(self):
    """One more than the number of elements indexed."""
    assert self._Invariant_()
    return len(self.string)


class DirectoryMapToIndex(MapToIndex):
  """Like a normal MapToIndex, but assumes the keys are directories,
  and in particular, directories without a trailing slash (eg "/etc").
  It stores the directories in the map, but appends the trailing slash
  first.  This is another type of normalization, and useful for cheap
  path-joining (eg using + instead of os.path.join).
  """

  def Index(self, directory):
    """Return index d > 0 of normalized directory.
    Argument:
      directory: a string, either empty or not ending in '/'.

    The empty string is not changed, but other strings are stored with
    a '/' appended.
    """
    if directory != "" and directory != "/":
      assert directory[-1] != '/', directory
      directory = directory + '/'
    return MapToIndex.Index(self, directory)


class RelpathMapToIndex(MapToIndex):
  """Like a normal MapToIndex, but assumes the keys are relative
  filesystem paths, that is, filesystem paths not starting with /.
  This is useful for "cheap" normalization: this invariant ensures that
  os.path.join(some-directorymap-string, some-relpathmap-string) can
  be implemented using +.

  We actually do allow storing absolute paths if option
  --unsafe_absolute_includes is in use.  But, then, we're careful in Resolve
  (below) to bail out.
  """

  def Index(self, relpath, ignore_absolute_path_warning=False):
    """Return index d > 0 of relative path.
    Args:
      directory: a string not starting with /.
      ignore_absolute_path_warning: a Boolean

    The variable ignore_absolute_path_warning is set to True in order to
    override the requirement that filepaths are relative. This is useful for the
    compilation unit filepath and filepaths of -include's: they are permitted to
    be absolute because the command line can still be rewritten on the server.
    The server tweaks their location to become relative to the server root.
    """
    if os.path.isabs(relpath) and not ignore_absolute_path_warning:
      if basics.opt_unsafe_absolute_includes:
        Debug(DEBUG_WARNING,
              "absolute filepath '%s' was IGNORED"
              " (correctness of build may be affected)", relpath)
      else:
          raise NotCoveredError("Filepath must be relative but isn't: '%s'."
                                " Consider setting INCLUDE_SERVER_ARGS='--"
                                "unsafe_absolute_includes'."
                                % relpath,
                                send_email=False)
    # Now, remove leading "./" so as not to start an infinite regression when
    # say foo.c contains:
    #
    #   #include "./foo.c"
    #
    # which mighy seduce a recursive include analyzer down the forbidden path:
    #
    #   "foo.c", # "./foo.c", "././foo.c." etc.
    while relpath.startswith("./"):
      relpath = relpath[2:]
    return MapToIndex.Index(self, relpath)


class CanonicalMapToIndex(MapToIndex):
  """Like a normal MapToIndex, but assumes the keys are absolute
  filepaths, and canonicalizes them before inserting into the map.
  'Canonicalize' means to do the equivalent of os.path.realpath(),
  which mostly involves resolving symlinks in the filepath.
  """

  def __init__(self, canonicalize):
    """Constructor.
    Argument:
      canonicalize: an instance of the CanonicalPath cache."""
    MapToIndex.__init__(self)
    self.canonicalize = canonicalize

  def Index(self, filepath):
    """Return the realpath index r of filepath.  filepath should be
    an absolute filename.
    """
    return MapToIndex.Index(self, self.canonicalize(filepath))

    
def RetrieveDirectoriesExceptSys(directory_map, realpath_map,
                                 systemdir_prefix_cache, directory_idxs):
  """Calculate the set of non-system directories of an index list.

  Arguments:
    directory_map: a DirectoryMapToIndex cache
    realpath_map: a CanonicalMapToIndex cache
    directory_idxs: a list or tuple of directory_map indices
  Returns:
    the corresponding tuple of directories except for those whose
    realpath has a prefix that is a sysdir
    
  The directories in the returned list have their trailing '/'
  stripped.
  """
  result = []
  for dir_idx in directory_idxs:
    # Index the absolute path; this will let us know whether dir_idx is under a
    # default systemdir of the compiler.

    # NOTE: This does NOT handle the directory_map entries for framework
    # search paths that start with '*H'/'*P', but this routine is only called
    # from parse_command_test.py so we're letting it go.

    rp_idx = realpath_map.Index(os.path.join(
	os.getcwd(), directory_map.string[dir_idx]))
    systemdir_prefix_cache.FillCache(realpath_map)
    if not systemdir_prefix_cache.cache[rp_idx]:
      result.append(directory_map.string[dir_idx].rstrip('/'))
  return tuple(result)


####
#### THE STAT CACHES
####

class SimpleBuildStat(object):
  """Stat cache that works with strings, not indices."""

  def __init__(self):
    self.cache = {}

  def Lookup(self, filepath):
    """Returns true if filepath exists."""
    try:
      return self.cache[filepath]
    except KeyError:
      result = self.cache[filepath] = _OsPathExists(filepath)
      return result


class BuildStatCache(object):
  """A highly optimized mechanism for stat queries of filepaths,
  as represented by a triple of indexes: currdir_idx, searchdir_idx,
  filepath_idx.  Given this input, we can say whether a regular file
  represented by this triple exists on the filesystem, and if so,
  what its canonical pathname is: that is, the pathname after all
  symlinks have been resolved.

  The hash table is three-level structure:
   - build_stat[currdir_idx] contains an array for each includepath_idx
   - build_stat[currdir_idx][includepath_idx] is this array, and
   - build_stat[currdir_idx][includepath_idx][searchdir_idx] is either
      * False if os.path.join(currdir, searchdir, includepath) does not exist
      * True if it does
      * None when it is not known whether it exists or not
  In addition, we keep a parallel structure for the realpath, that lets us
  quickly map from a filepath to os.path.realpath(filepath).
   - real_stat[currdir_idx] contains an array for each fp
   - real_stat[currdir_idx][includepath_idx] is this array, and
   - real_stat[currdir_idx][includepath_idx][searchdir_idx] is either
      * realpath_idx, such that realpath_map.string[realpath_idx] =
           os.path.realpath(os.path.join(currdir, searchdir, includepath))
        when build_stat[currdir_idx][includepath_idx][searchdir_idx] = True
      * None, otherwise
  """

  def __init__(self, includepath_map, directory_map, realpath_map):
    self.build_stat = {}
    self.real_stat = {}
    self.includepath_map = includepath_map
    self.directory_map = directory_map
    self.realpath_map = realpath_map
    self.path_observations = []
    self.framework_fallback = 'FRAMEWORK_FALLBACK'

  def _Verify(self, currdir_idx, searchdir_idx, includepath_idx):
    """Verify that the cached result is the same as obtained by stat call.
    Prerequisite: we've done a chdir(currdir) before this call.
    """
    assert 1 <= includepath_idx < self.includepath_map.Length()
    assert 1 <= searchdir_idx < self.directory_map.Length()
    if __debug__: statistics.sys_stat_counter += 1

    # Since we know directory_map entries end in /, and includepaths don't
    # start with / (who does "#include </usr/include/string.h>"??), we can
    # use + instead of the more expensive os.path.join().
    # Make sure $PWD is currdir, so we don't need to include it in our stat().
    assert os.getcwd() + '/' == self.directory_map.string[currdir_idx]
    built_path = basics.PathFromDirMapEntryAndInclude(
                             self.directory_map.string[searchdir_idx],
                             self.includepath_map.string[includepath_idx])
    really_exists = built_path and _OsPathIsFile(built_path) or False
    cache_exists = self.build_stat[currdir_idx][includepath_idx][searchdir_idx]
    assert isinstance(cache_exists, bool)
    if cache_exists != really_exists:
      filepath = os.path.join(self.directory_map.string[currdir_idx],
                              basics.PathFromDirMapEntryAndInclude(
                                self.directory_map.string[searchdir_idx],
                                self.includepath_map.string[includepath_idx]) or
                              (self.directory_map.string[searchdir_idx] +
                               self.includepath_map.string[includepath_idx]))
      sys.exit("FATAL ERROR: "
               "Cache inconsistency: '%s' %s, but earlier this path %s." % (
        filepath,
        really_exists and "exists" or "does not exist",
        cache_exists and "existed" or "did not exist"))

  def WarnAboutPathObservations(self, translation_unit):
    """Print new paths found according to path observation expression option.

    Args:
      translation_unit: a string embedded in warning
    """
    for (includepath, relpath, realpath) in self.path_observations:
      Debug(DEBUG_WARNING,
            "For translation unit '%s',"
            " lookup of file '%s' resolved to '%s' whose realpath is '%s'.",
             translation_unit, includepath, relpath, realpath)
    self.path_observations = []

  def Resolve(self, includepath_idx, currdir_idx, searchdir_idx,
              searchlist_idxs, includer_dir_idx):
    """Says whether (currdir_idx, searchdir_idx, includepath_idx) exists,
    and if so what its canonicalized form is (with symlinks resolved).
    TODO(csilvers): rearrange the order of the arguments.

    Args:
      includepath_idx: The index of an includepath, from e.g. "#include <foo>"
      currdir_idx:     The index of the current working dir.  Note that we
                       require os.getcwd() == currdir before calling Resolve!
      searchdir_idx:   A single searchdir, which is prepended to searchlist,
                       or None to not prepend to the searchlist.
      searchlist_idxs: A list of directory indices.
      includer_dir_idx: The index of the directory containing the file that
                        #included the file being resolved.  Used for
                        subframework include fallbacks.

    Returns:
      1) (None, None) if, for all sl_idx in [searchdir_idx] + searchlist_idxs,
         os.path.join(currdir, sp, includepath) does not exist.
      2) ((sl_idx, includepath_idx), realpath_idx)
         if, for some sl_idx in [searchdir_idx] + searchlist_idxs,
         os.path.join(currdir, sp, includepath) does exist.  In this case,
         sl_idx is the index of the first searchlist entry for which the
         exists-test succeeds, and realpath_idx is the index into the
         realpath_map of os.path.join(currdir, sp, includepath).

    Again, we require as a prequesite that os.getcwd() must equal currdir:
       os.getcwd() + '/' == self.directory_map.string[currdir_idx]

    To support Apple-style subframework includes, if the include is not found
    in the directories identified by searchdir_idx or searchlist_idxs, and
    the include form appears suitable for a framework include, and the
    including file comes from a framework, this function will look inside
    the including framework's subframeworks directory for the include.  As the
    Apple compiler does, this behavior is treated only as a fallback, after
    the normal search lists are consulted.  Apple's code for handling
    framework includes is in:
    http://www.opensource.apple.com/darwinsource/DevToolsOct2008/gcc_42-5566/gcc/config/darwin-c.c

    This also supports header map lookups: if any sl_idx in [searchdir_idx] +
    searchlist_idxs corresponds to a header map file, that header map will
    be checked for includepath rather than searching the filesystem.
    """
    includepath = self.includepath_map.string[includepath_idx]
    if includepath.startswith('/'):
      # We really don't want to start exploring absolute includepaths; what's
      # the sl_idx to return for example? And what about the use of '+'
      # (as an optimization) below instead of os.path.join.
      return (None, None)
    dir_map_string = self.directory_map.string   # memoize the fn pointer
    build_stat = self.build_stat
    real_stat = self.real_stat
    if __debug__:
      dir_map = self.directory_map
      assert 0 < includepath_idx < self.includepath_map.Length()
      assert 0 < currdir_idx < dir_map.Length()
      assert 0 < includer_dir_idx < dir_map.Length()
      assert searchdir_idx is None or 1 <= searchdir_idx < dir_map.Length()
      for sl_idx in searchlist_idxs:
        assert sl_idx < dir_map.Length()
      assert os.getcwd() + '/' == dir_map_string[currdir_idx], (
	"'%s/' != '%s'" % (os.getcwd(),  dir_map_string[currdir_idx]))
      Debug(DEBUG_TRACE2, "Resolve: includepath: '%s', currdir: '%s', "
            "searchdir: '%s', searchlist: %s" %
	    (includepath,
	     dir_map_string[currdir_idx],
	     searchdir_idx and dir_map_string[searchdir_idx],
	     "  \n".join([dir_map_string[idx] for idx in searchlist_idxs])))
    try:
      # Locate the array (list) relative to currdir_idx and includepath_idx
      searchdir_stats = build_stat[currdir_idx][includepath_idx]
      # Locate the corresponding array of realpath names
      searchdir_realpaths = real_stat[currdir_idx][includepath_idx]
    except KeyError:    # We'll need to grow the relevant arrays
      currdir_stats = build_stat.setdefault(currdir_idx, {})
      currdir_realpaths = real_stat.setdefault(currdir_idx, {})
      searchdir_stats = currdir_stats[includepath_idx] = \
                        [None] * DIR_ARRAY_SIZE
      searchdir_realpaths = currdir_realpaths[includepath_idx] = \
                        [None] * DIR_ARRAY_SIZE

    # Try searchdir_idx if not None, then try every index in searchlist_idxs.
    # This inner loop may be executed tens of millions of times.
    # Do not try to form [searchdir_idx] + searchlist_idxs -- too expensive!
    for searchlist in (searchdir_idx and [searchdir_idx] or [],
                       searchlist_idxs,
                       self.framework_fallback):

      if searchlist is self.framework_fallback:
        # When the special framework_fallback object is reached, attempt to
        # do a subframework include.  The special token defers the "might
        # this be a framework include?" and "was it included by a framework?"
        # analysis until it might actually be needed.
        searchlist = []
        if '/' in includepath:
          # The #include has a slash in it, so it might be a framework
          # #include.
          includer_dir = dir_map_string[includer_dir_idx]
          dot_framework_slash = '.framework/'
          dot_framework_index = includer_dir.find(dot_framework_slash)
          if dot_framework_index != -1:
            # The includer was a framework, so look for a subframework include
            # on this pass.
            subframework_dir = includer_dir[0:dot_framework_index +
                                              len(dot_framework_slash)] + \
                               'Frameworks'
            # *H and *P are explained in ParseCommandArgs.IndexDirs.
            searchlist = [self.directory_map.Index('*H' + subframework_dir),
                          self.directory_map.Index('*P' + subframework_dir)]

      for sl_idx in searchlist:
        if __debug__:
          statistics.search_counter += 1
          statistics.build_stat_counter += 1
        try:
          # We expect that searchdir_stats[sl_idx] == False, because
          # we've usually seen sl_idx before for our includepath and
          # our currdir --- and includepath does not usually exist
          # relative to the sp directory.  We're optimizing for this
          # case of course. That should give us a rate of a couple of
          # million iterations per second (for this case).
          #
          # _Verify should never be called when sl_idx corresponds to a
          # header map.
          if searchdir_stats[sl_idx] == False:
            if __debug__: self._Verify(currdir_idx, sl_idx, includepath_idx)
            continue
          if searchdir_stats[sl_idx]:
            if __debug__: self._Verify(currdir_idx, sl_idx, includepath_idx)
            return ((sl_idx, includepath_idx), searchdir_realpaths[sl_idx])
        except IndexError:   # DIR_ARRAY_SIZE wasn't big enough; let's double
          searchdir_stats.extend([None] * max(sl_idx, len(searchdir_stats)))
          searchdir_realpaths.extend([None] * max(sl_idx, len(searchdir_stats)))

        # If we get here, result is not cached yet.
        sl_str = dir_map_string[sl_idx]

        # If this search "directory" is actually a header map, query the
        # header map.  Apple gcc does a file type check for this, we cheat
        # and look for "directories" whose names end in ".hmap", as that's
        # how Xcode always saves its header maps.  os.path.abspath causes
        # the header map to be taken as relative to currdir_idx because
        # os.getcwd() == currdir within this function.
        relpath = \
            sl_str.endswith('.hmap/') and \
            headermap.global_collection.Resolve(os.path.abspath(sl_str),
                                                includepath)
        if relpath:
          if relpath[-len(includepath):] == includepath:
            # When relpath ends with includepath, just strip includepath off
            # and use the preceding part of relpath as the directory that
            # contains the header.
            sl_str = relpath[0:-len(includepath)]
          else:
            # When relpath does not end with includepath, rewrite includepath
            # to just be the last component of relpath, and use the preceding
            # part of relpath as the directory that contains the header.
            # This invalidates includepath_idx!
            includepath = os.path.basename(relpath)
            sl_str = os.path.dirname(relpath) + '/'

          sl_idx = self.directory_map.Index(sl_str[:-1])

          # Trust the header map.  Don't check for the file on disk, and
          # don't touch searchdir_stats.  It's necessary to leave
          # searchdir_stats alone to avoid calling _Verify, and because
          # searchdir_stats is based on includepath_idx which may no longer
          # correspond to includepath.
          found = True
        else:
          if __debug__: statistics.sys_stat_counter += 1
          # We do not explictly take into account currdir_idx, because
          # of the check above that os.getcwd is set to current_dir.
          relpath = basics.PathFromDirMapEntryAndInclude(sl_str, includepath)
          found = relpath and _OsPathIsFile(relpath) or False
          searchdir_stats[sl_idx] = found

        if found:
          rpath = os.path.join(dir_map_string[currdir_idx], relpath)
          realpath_idx = searchdir_realpaths[sl_idx] = (
            self.realpath_map.Index(rpath))
          # This is the place to catch errant files according to user defined
          # regular expression path_observation_re.
          if basics.opt_path_observation_re:
            realpath = self.realpath_map.string[realpath_idx]
            if basics.opt_path_observation_re.search(realpath):
              self.path_observations.append((includepath, relpath, realpath))
          return ((sl_idx, includepath_idx), realpath_idx)

    if __debug__: Debug(DEBUG_TRACE2, "Resolve: failed")
    return (None, None)


class SetUpCaches(object):
  """Erect the edifice of caches.

  Instance variables:
    includepath_map: RelpathMapToIndex
    directory_map: DirectoryMapToIndex
    realpath_map: CanonicalMapToIndex

    canonical_path: CanonicalPath
    build_stat_cache: BuildStatCache
    dirname_cache: DirnameCache
    simple_build_stat: SimpleBuildStat

    client_root: a path such as /dev/shm/tmpX.include_server-X-1
                 (used during default system dir determination)

    IsFilepathIndex: test for filepath index
    IsDirectoryIndex: test for director index
    IsRealpathIndex: test for realpath index
    IsFilepathPair: test for filepath pair
  """

  def __init__(self, client_root):
     
    # A memoizing (caching) class to canonicalize a path: mostly by
    # resolving any symlinks in the path-component.
    self.canonical_path = CanonicalPath()

    # The index-map for includepath names: things seen after '#include'.
    self.includepath_map = RelpathMapToIndex()

    # The index-map for searchdir names and currdir as well.  Also used any
    # other time we have something we know is a directory (eg dirname(foo)).
    self.directory_map = DirectoryMapToIndex()

    # The index-map for realpaths: the full pathname of an include, with
    # symlinks resolved and such (hence the name realpath).
    self.realpath_map = CanonicalMapToIndex(self.canonical_path.Canonicalize)

    # A cache of the directory part of filepaths.  Note it uses the
    # directory_map to actually store the mapping.
    self.dirname_cache = DirnameCache(self.includepath_map, self.directory_map,
                                      self.realpath_map)

    # A cache of whether a realpath starts with a system searchdir or
    # not.  Note: at this time, system_dirs_default_all will be empty.
    # It will get filled via processing in parse_command.py.  This is
    # why we need to store the compiler_defaults instance, to make
    # sure "our" system_dirs_default_all is updated.
    # TODO(csilvers): get rid of this once prefix_cache TODO is cleaned up
    self.compiler_defaults = compiler_defaults.CompilerDefaults(
      self.canonical_path.Canonicalize, client_root)
    self.systemdir_prefix_cache = SystemdirPrefixCache(
      self.compiler_defaults.system_dirs_default_all)

    # The main caches, that say whether a file exists or not.  We have
    # two: a simple one that takes a filepath (string) as an argument,
    # and the complicated one that works with index-triples.
    self.simple_build_stat = SimpleBuildStat()
    self.build_stat_cache = BuildStatCache(self.includepath_map,
                                           self.directory_map,
                                           self.realpath_map)

    # Convenient function closures to test for various semantic datatypes.
    self.IsIncludepathIndex = (lambda x:
                                 isinstance(x, int)
                                 and 0 < x < self.includepath_map.Length())

    self.IsSearchdirIndex = (lambda x:
                               isinstance(x, int)
                               and 0 < x < self.directory_map.Length())

    self.IsCurrdirIndex = (lambda x:
                             isinstance(x, int)
                             and 0 < x < self.directory_map.Length())

    self.IsFilepathPair = (lambda x:
                             isinstance(x, tuple)
                             and len(x) == 2
                             and self.IsSearchdirIndex(x[0])
                             and self.IsIncludepathIndex(x[1]))

    self.IsRealpathIndex = (lambda x:
                              isinstance(x, int)
                              and 0 < x < self.realpath_map.Length())