mansearchhelp.aux [plain text]

Content-type: text/html

<HTML>
<HEAD>
<TITLE>Manual Pages - Search Help</TITLE>
<!-- Changed by: Michael Hamilton,  6-Apr-1996 -->
<!-- Note: this is not html, but requires preprocessing -->
</HEAD>
<BODY>
<H1>Manual Pages - Search Help</H1>

<A HREF="%cg/mansearch">Perform another search</A>
<BR>
<A HREF="%cg/man2html">Return to Main Contents</A>
<P>
<HR>
<P>
The full text index uses the <I>Glimpse</I> text indexing system.
The 
<A HREF="%cg/man2html?1+glimpse">glimpse(1)</A>
manual page documents glimpse in full.  This summary documents those
features of glimpse that are valid when searching through the manual pages.
<P>
<HR>

<H2>Search Options</H2>

The following search options must be at the start of the search string.

<DL COMPACT>

<DT><B>-</B> <I>#</I>
<DD>
<I>#</I> is an integer between 1 and 8
specifying the maximum number of errors
permitted in finding the approximate matches (the default is zero).
Generally, each insertion, deletion, or substitution counts as one error.
Since the index stores only lower case characters, errors of
substituting upper case with lower case may be missed.

<DT><B>-B</B>
<DD>
Best match mode.  (Warning: -B sometimes misses matches.  It is safer
to specify the number of errors explicitly.)
When -B is specified and no exact matches are found, glimpse
will continue to search until the closest matches (i.e., the ones
with minimum number of errors)
are found.
In general, -B may be slower than -#, but not by very much.
Since the index stores only lower case characters, errors of
substituting upper case with lower case may be missed.

<DT><B>-L <I>x</I> | <I>x</I>:<I>y</I> | <I>x</I>:<I>y</I>:<I>z</I></B>
<DD>
A non-zero value of <I>x</I> limits the number of matches
that will be shown.  
A non-zero value of <I>y</I> limits the number of man pages
that will be shown.
A non-zero valye of <I>z</I> will only show pages that have
less that z matches.
For example, -L 0:10 will output all matches for the first 10 files that
contain a match.

<DT><B>-F</B> <I>pattern</I>
<DD>
The -F option provides a pattern that restricts the search results to
those filenames that match the pattern.
or example, <I>-F 8</I> effectively restricts matches to section 8.

<DT><B>-w</B>
<DD>
Search for the pattern as a word - i.e., surrounded by non-alphanumeric
characters.  For example, 
<I>-w -1 car</I> will match cars, but not characters and not
car10.
The non-alphanumeric <I>must</I>
surround the match;  they cannot be counted as errors.
This option does not work with regular expressions.

<DT><B>-W</B>
<DD>
The default for Boolean AND queries is that they cover one record
(the default for a record is one line) at a time.  
For example, glimpse 'good;bad' will output all lines containing
both 'good' and 'bad'.
The -W option changes the scope of Booleans to be the whole file.
Within a file glimpse will output all matches to any of the patterns.
So, glimpse -W 'good;bad' will output all lines containing 'good' 
<I>or</I> 'bad', but only in files that contain both patterns.

<DT><B>-k</B>
<DD>
No symbol in the pattern is treated as a meta character. 
For example, <I>-k a(b|c)*d</I> will find  
the occurrences of a(b|c)*d whereas <I>a(b|c)*d</I>
will find substrings that match the regular expression 'a(b|c)*d'.
(The only exception is ^ at the beginning of the pattern and $ at the
end of the pattern, which are still interpreted in the usual way.  
Use \^ or \$ if you need them verbatim.)

</DL>

<P>
<HR>

<H2>Patterns</H2>

<I>Glimpse</I> 
supports a large variety of patterns, including simple
strings, strings with classes of characters, sets of strings, 
wild cards, and regular expressions (see <A HREF="#limitations">LIMITATIONS</A>).

<DL COMPACT>
<DT><B>Strings   </B><DD>
Strings are any sequence of characters, including the special symbols
`^' for beginning of line and `$' for end of line.
The following special characters (
`<B>$</B>',

`^<B>',</B>

`<B>*</B>',

`<B>[</B>'<B>,</B>

`<B>^</B>',

`<B>|</B>',

`<B>(</B>',

`<B>)</B>',

`<B>!</B>',

and
`<B>\</B>'

) 
as well as the following meta characters special to glimpse (and agrep):
`<B>;</B>',

`<B>,</B>',

`<B>#</B>',

`<B>&lt;</B>',

`<B>&gt;</B>',

`<B>-</B>',

and
`<B>.</B>',

should be preceded by `\' if they are to be matched as regular
characters.  For example, \^abc\\ corresponds to the string ^abc\,
whereas ^abc corresponds to the string abc at the beginning of a
line.
<DT><B>Classes of characters</B><DD>
a list of characters inside [] (in order) corresponds to any character
from the list.  For example, [a-ho-z] is any character between a and h
or between o and z.  The symbol `^' inside [] complements the list.
For example, [^i-n] denote any character in the character set except
character 'i' to 'n'.
The symbol `^' thus has two meanings, but this is consistent with
egrep.
The symbol `.' (don't care) stands for any symbol (except for the
newline symbol).
<DT><B>Boolean operations</B><DD>
<B>Glimpse </B>

supports an `AND' operation denoted by the symbol `;' 
an `OR' operation denoted by the symbol `,',
or any combination.  
For example,
<I>glimpse 'pizza;cheeseburger'</I> will output all lines containing
both patterns.
<I>glimpse -F 'gnu;\.c$' 'define;DEFAULT'</I>
will output all lines containing both 'define' and 'DEFAULT'
(anywhere in the line, not necessarily in order) in
files whose name contains 'gnu' and ends with .c.
<I>glimpse '{political,computer};science'</I> will match 'political science' 
or 'science of computers'.
<DT><B>Wild cards</B><DD>
The symbol '#' is used to denote a sequence 
of any number (including 0) 
of arbitrary characters (see <A HREF="#limitations">LIMITATIONS</A>).  
The symbol # is equivalent to .* in egrep.
In fact, .* will work too, because it is a valid regular expression
(see below), but unless this is part of an actual regular expression,
# will work faster. 
(Currently glimpse is experiencing some problems with #.)
<DT><B>Combination of exact and approximate matching</B><DD>
Any pattern inside angle brackets &lt;&gt; must match the text exactly even
if the match is with errors.  For example, &lt;mathemat&gt;ics matches
mathematical with one error (replacing the last s with an a), but
mathe&lt;matics&gt; does not match mathematical no matter how many errors are
allowed.
(This option is buggy at the moment.)
<DT><B>Regular expressions</B><DD>
Since the index is word based, a regular expression must match
words that appear in the index for glimpse to find it.
Glimpse first strips the regular expression from all non-alphabetic
characters, and searches the index for all remaining words.
It then applies the regular expression matching algorithm to the
files found in the index.
For example, <I>glimpse</I> 'abc.*xyz' will search the index
for all files that contain both 'abc' and 'xyz', and then
search directly for 'abc.*xyz' in those files.
(If you use glimpse -w 'abc.*xyz', then 'abcxyz' will not be found, 
because glimpse
will think that abc and xyz need to be matches to whole words.)
The syntax of regular expressions in <B>glimpse</B> is in general the same as
that for <B>agrep</B>.  The union operation `|', Kleene closure `*',
and parentheses () are all supported.
Currently '+' is not supported.
Regular expressions are currently limited to approximately 30
characters (generally excluding meta characters).  Some options
(-d, -w, -t, -x, -D, -I, -S) do not 
currently work with regular expressions.
The maximal number of errors for regular expressions that use '*'
or '|' is 4. (See <A HREF="#limitations">LIMITATIONS</A>.)

</DL>

<HR>

<H2><A NAME="limitations">Limitations</A></H2>

The index of glimpse is word based.  A pattern that contains more than
one word cannot be found in the index.  The way glimpse overcomes this
weakness is by splitting any multi-word pattern into its set of words
and looking for all of them in the index.
For example, <B>glimpse 'linear programming'</B> will first consult the index
to find all files containing both <I>linear</I> and <I>programming</I>,
and then apply agrep to find the combined pattern.
This is usually an effective solution, but it can be slow for
cases where both words are very common, but their combination is not.
<P>

As was mentioned in the section on PATTERNS above, some characters
serve as meta characters for glimpse and need to be
preceded by '\' to search for them.  The most common
examples are the characters '.' (which stands for a wild card),
and '*' (the Kleene closure).
So, &quot;glimpse ab.de&quot; will match abcde, but &quot;glimpse ab\.de&quot;
will not, and &quot;glimpse ab*de&quot; will not match ab*de, but 
&quot;glimpse ab\*de&quot; will.
The meta character - is translated automatically to a hyphen
unless it appears between [] (in which case it denotes a range of
characters).
<P>

The index of glimpse stores all patterns in lower case.
When glimpse searches the index it first converts
all patterns to lower case, finds the appropriate files,
and then searches the actual files using the original
patterns.
So, for example, <I>glimpse ABCXYZ</I> will first find all
files containing abcxyz in any combination of lower and upper
cases, and then searches these files directly, so only the
right cases will be found.
One problem with this approach is discovering misspellings
that are caused by wrong cases.
For example, <I>glimpse -B abcXYZ</I> will first search the
index for the best match to abcxyz (because the pattern is
converted to lower case); it will find that there are matches
with no errors, and will go to those files to search them
directly, this time with the original upper cases. 
If the closest match is, say AbcXYZ, glimpse may miss it,
because it doesn't expect an error.
Another problem is speed.  If you search for &quot;ATT&quot;, it will look
at the index for &quot;att&quot;.  Unless you use -w to match the whole word,
glimpse may have to search all files containing, for example, &quot;Seattle&quot;
which has &quot;att&quot; in it.
<P>

There is no size limit for simple patterns and simple patterns
within Boolean expressions.
More complicated patterns, such as regular expressions,
are currently limited to approximately 30 characters.
Lines are limited to 1024 characters.
Records are limited to 48K, and may be truncated if they are larger
than that.
The limit of record length can be 
changed by modifying the parameter Max_record in agrep.h.
<P>

Glimpseindex does not index words of size &gt; 64.
<A NAME="lbAQ">&nbsp;</A>

<HR>
</BODY>
</HTML>