design.ms   [plain text]


.nr PS 12
.nr VS 14
.LP
.TL
Design of grohtml
.sp 1i
.SH
What is grohtml
.LP
Grohtml is a back end for groff which generates html.
The aim of grohtml is to produce respectible html given
fairly typical groff input.
.SH
Limitations of grohtml
.LP
Although basic text can be translated
in a straightforward fashion there are some areas where grohtml
has to try and guess text relationship. In particular whenever
grohtml encounters text tables and indented paragraphs or
two column mode it will try and utilize the html table construct
to preserve columns. Grohtml also attempts to work out which
lines should be automatically formatted by the browser.
Ultimately in trying to make reasonable guesses most of the time
it will make mistakes occasionally.
.PP
Tbl, pic, eqn's are also generated using images which may be
considered a limitation.
.SH
Overview of html.cc
.LP
This file briefly provides an overview of how html.cc operates.
The html device driver works as follows:
.IP (i) .5i
firstly it creates a linked list of all words on a page.
.IP (ii) .5i
it runs through the page and finds the left most margin. Later
on when generating the page it removes the margin.
.IP (iii) .5i
scans a page and builds two kinds of regions ascii text and graphical.
The graphical regions consist of tbl's, eqn's, pic's
(basically anything that cannot be textually displayed).
It will scan through a page to find lines (such as footer etc)
and places these into tiny graphical regions. Certain fonts
also are treated as a graphical region - as html has no easy
equivalent. For example Greek math symbols.
.LP
Finally all graphical regions are translated into png files and
all text regions into html text.
.PP
To give grohtml a sporting chance of accuratly deciding which
is a graphical region and which is text, the front end programs
tbl, eqn, pic have all been tweeked to encapsulate pictures, tables
and equations with the following lines:
.sp
.nf
\f[CR]\&.if '\\*(.T'html' \\X(graphic-start(\c

\&.if '\\*(.T'html' \\X(graphic-end(\c
\fP
.fi
.sp
these appear to grohtml as:
.sp
.nf
\f[CR]\&x X graphic-start

\&...

\&x X graphic-end\fP
.fi
.sp
.LP
In addition to graphic-start and graphic-end there are two
other "special characters" which are used.
.sp
\f[CR]\&x X index:N\fP
.sp
where N is a number. The purpose of this sequence is to stop
devhtml from automatically producing links to headings which
have a header level >N.
The line:
.sp
\f[CR]\&x X html:STRING\fR
.sp
.LP
allows a STRING to be passed through to the output file with
no processing whatsoever. Ie it allows users to include html
commands, via macro, such as:
.sp
\f[CR]\&.URL "Latest Emacs" "ftp://somewonderful.gnu.software"\fP
.sp
.LP
Where the URL macro bundles the info into STRING above.
For more info consult: \f[CR]tmac/tmac.arkup\fP.
.PP
While scanning through a page the html device copies headings and titles
into a list of links which are later written to the beginning
of the html document.
.SH
Table handling code
.LP
Provided that the -t option is not present when grohtml is run the grohtml
driver will attempt to find textual tables and generate html tables.
This allows .RS and .RE commands to operate with auto formatting. It also
should grohtml to process .2C correctly. However, the table handling code
has to examine the troff output and \fIguess\fR when a table starts and
finishes. It is well to know the limitations of this approach as it
sometimes makes the wrong decision.
.LP
Here are some of the rules that grohtml uses for terminating a html table:
.LP
.IP "(i)" .5i
A table will be terminated when grohtml finds line which is all in bold
font (it believes that this is a header which is outside of a table).
This might be considered incorrect behaviour especially if you use .2C
which generates a heading on the left column when the corresponding
right row is blank.
.IP "(ii)" .5i
A table is terminated when grohtml sees that the complete line is
has been spanned by words. Ie no gaps exist.
.IP "(nb)" .5i
the documentation about these rules is particularly incomplete and needs finishing
when time prevails.
.SH
To do
.LP
.IP (i) .5i
finish working out the max and min x, y, extents for splines.
.IP (ii) .5i
check and test thoroughly all the character descriptions in devhtml
(originally taken from devX100)
.IP (iii) .5i
improve tmac.arkup
.IP (vi) .5i
also improve documentation.
.IP (v) .5i
fix the bugs which are exposed by Eric Raymonds pic guide,
\fBMaking Pictures With GNU PIC\fR. It appears that grohtml becomes confused
about which sections of the document are text and which sections need
to be rendered as an image.
.IP (vi) .5i
it would be nice to modularise the source. A natural division might be
to extract the table handling code from html.cc into table.cc.
The table.cc could be expanded to recognise output from tbl and try
and generate html tables with lines/rules/boxes. The code as it stands
should cope with very simple plain text tables. But of course at present
it does not get a chance to do this because the output of gtbl is
bracketed by \fCgraphic-start\fR and \fCgraphic-end\fR.
.IP (vii) .5i
introduce anti aliasing for the images as mentioned by Werner.
.SH
Dependencies
.LP
Grohtml is dependent upon grops, gs which are invoked to
generate all png files. Png files are generated whenever a table, picture,
equation or line is encountered.