groves

Groves This is precisely what the IETM community
faced in the late eighties and early nineties. The
answer was to convert everything to HTML to get the
job done. Now they have to convert again. Every time
that is done, the costs recur and the information gets a
little more damaged.

While it is hard on some to understand
terms like "notation", it is vital to understand what
Paul is saying. This is where everything stopped.
This where CALS broke down. This is where the
war was lost and the shaky structure known as the
World Wide Web emerged a winner because no two
ISO committee heads could agree.

Study the history of CGM. Look at the relentless
duplication of the effort of SGML for reasons I
still cannot fathom. Note how SGML was blamed.
Look at what is happening
in other notations being invented today for graphics.
Note the deliberate duplication of effort spent trying
to keep separate syntaxes. Note how XML is blamed.

Study groves. Write the simplified explanations
if you need them. You do need them.

A collection of references on groves in the "Topics" section of the SGML/XML Web Page, viz.,
http://www.oasis-open.org/cover/topics.html#groves

You can find out more about DSSSL and the SGML property set at James Clark's web site, http://www.jclark.com/dsssl

The grove is available as a DCOM object or a C++ object. You may make your own implementation if you want or change the interface. For more information: http://www.netfolder.com/DSSSL

The real reason groves were invented was to answer the question:
* what is the result of hyperlinking into an arbitrary media type?

And the picture clarify it even more ! http://www.cogsci.ed.ac.uk/~ht/grove.html

XML parser node structure based on a DOM point of view: http://rpmfind.net/veillard/XML/

That would be the Data::Grove (in libxml-perl) module and it's only
current set of node classes, the XML::Grove module. If you'd like to
look at a very simple implementation of groves, those would be the
ones.

In Data::Grove, ``nodes'' are Perl objects (named, or ``blessed'',
hashes), node-lists are Perl arrays, and named-node-lists are
un-named, or ``anonymous'', hashes. Even though nodes are ``objects''
they are still accessed directly using Perl hash syntax (i.e. there
are no accessor methods). (The same could be done as easily in any
language, of course.)

In the current implementation of Data::Grove the grove is ``passive''
-- the view, or ``grove plan'', of the grove is fixed at the time the
grove is built, there's no support for enforcing constraints, and
character data does not appear normalized.

This differs from other grove implementations which I would call
``active'' -- the view of the grove can be changed dynamically,
constraints are checked and/or enforced, and groves always appear
normalized. Grove plans were described in another message, but an
example is where you could choose to see entity reference nodes in the
grove or just the characters that replace the entity reference.

Groves are often compared to DOM. In grove terms, DOM Level 1
presents two very specific grove plans, Core and HTML. DOM defines
these grove plans in terms of accessor methods for specific properties
and a fixed set of constraints. In contrast, groves act more like
generic containers. The API for accessing the generic containers is
left to the implementation, property and value constraints are defined
seperately, and the user is allowed to choose which set of property
and value constraints to use at any particular time.

Simply put, the idea behind groves is to seperate the definition of
properties from the API used to access those properties.

>Groves are going to turn out to be like Linux, which began with a very
>few people who had a vision that turned out to work.

I'll start by making it clear that I've been on the W3C DOM working group
for more than two years, so you know where my biases lie ... Nevertheless, I
speak as someone who has spent a fair amount of time wrestling with defining
abstract data models and APIs for XML documents and considering what the
grove paradigm offers.

There are three basic reasons why the DOM is not more groves-like. First, as
someone pointed on earlier in the "XML and Databases" thread, not enough
people outside the hard-core SGML/XML community understand the groves
paradigm, so there was no general familiarity that we felt we should
leverage. Second, the available documentation for groves (at least a
couple of years ago) was mainly the DSSSL spec, which is very difficult for
non-specialists to make sense out of. Third, there was a widespread
perception that the groves model implies, in DOM terms, that "every
character is a node", and people concerned about implementing the DOM API
felt strongly that this would lead to unacceptable footprint and run-time
overhead.

Groves may or may not become the next Linux; if this is going to happen, two
obstacles must be overcome.

Most importantly, someone is going to have to write a *clear* statement of
the paradigm, its power, why it's "the next big thing, etc. Sortof "Groves
for Dummies", or the "Grove Manifesto". I can't stress enough the
importance of writing this for a fairly general audience. I recall
conversations a couple of years ago with very smart technical marketing
people at large companies who are now big players in the XML world; the
level of fustration they expressed at trying to make sense out of things
like the DSSSL spec was quite memorable! I have not read the recent
attempts by groves adovocates to offer tutorials, etc., so forgive me if
this has already been done. I frankly doubt it, because if a clear and
compelling case for the groves paradigm has been made, it hasn't come to the
world's attention.

Also, even if the "grove paradigm" is a fundamentally more powerful way of
looking at XML and other types of data than what is in wide use today, it's
unlikely to be adopted unless there is a clean migration path from familiar
APIs like ODBC/SQL, the W3C DOM, the (forthcoming?) JCP XML data binding
spec, etc. One of the most eye-opening aspects of my experience on the DOM
WG has been to understand that most users of Web scripting languages, Visual
Basic, etc. know very little about computer science. I began my DOM
"career" assuming that everyone who would be using such APIs understood tree
and graph data structures and would understand how nicely they represented
the types of things we were talking about. I was quickly set straight by my
colleagues from companies with larger customer bases: ARRAYS are about as
sophisticated a data structure as the typical Web scripter or VB programmer
can handle. [I *know* that someone reading this wants flame me, but rest
assured that I don't like this notion any more than you do, and just about
every conceivable counter-argument has been raised, and very reluctantly
dismissed, by the DOM WG already.] So, I would *love* to see someone define
a grove API that extends the DOM, and/or to see the grove paradigm cleanly
incorporated into the Java Community Process XML data binding, and/or to see
a repository-friendly API that builds from ADO or JDO and incorporates
groves concepts. But don't expect the typical consumer of XML APIs to be
impressed by a groves API that offers a "new paradigm" that assumes that the
reader understands graph theory and data structures and builds up from there
with little reference to existing efforts.

Imagine that you had been trained for several years on the ODBC API. Now
you go to university and a professor starts talking about the
"relational database model." You might ask: "how is that different than
ODBC." The answer is that ODBC is a particular representation of the
relational model for programming languages. The DOM can be considered a
particular representation of the grove model for programming languages.

But the ODBC versus relational model is not quite sufficient to capture
the difference. ODBC was built on top of the relational model. The DOM
people used the grove model "for ideas" but the DOM is not based on any
well-defined abstract model at all -- not the grove, not the information
set, not the relational model, not anything. This is reflected in its
odd mix of methods, properties and names "getElementsByTagName"
"TagName", "NodeValue" etc. It's notion of iteration and state is famous
for being rather odd. The DOM is more like a pre-relational API in this
regard.

The major difference between the DOM and an API that could be built on
top of groves (such as that created by James Clark or by GroveMinder) is
that the DOM model is to add support for different data types one by one
through a central committee.

But the important thing to recognize is that the grove is not an API. It
can get lost in Steve's understandable focus on GroveMinder but the use
of the grove as an API is an implementation optimization (in the sense
of optimizing programmer time). The real reason groves were invented was
to answer the question:

* what is the result of hyperlinking into an arbitrary media type?

What are the properties of the abstract object returned? The grove
answers that question: the object has properties such as "parent",
(possibly null), "children" (possibly null), "containing entity" and so
forth.

You cannot build a sophisticated hypertext system without answering that
question. This will become apparent after XLink, XPointer and RDF are
implemented. We'll start to see many divergences of behavior when links
are made into (e.g.) PDF, MPEGs, JPEGs and so forth. Over time we will
need to develop a framework for describing the correct results of links
in a generic way. Then we will reinvent groves.

Or not... we could keep doing things in an ad hoc manner for ever I
suppose. It would be expensive and inefficient but it is possible.

> You cannot build a sophisticated hypertext system without answering that
> question. This will become apparent after XLink, XPointer and RDF are
> implemented. We'll start to see many divergences of behavior when links
> are made into (e.g.) PDF, MPEGs, JPEGs and so forth. Over time we will
> need to develop a framework for describing the correct results of links
> in a generic way. Then we will reinvent groves.

I am just learning about groves now and am interested in your claim.
There's a note in the Property Set Requirements annex of the HyTime
specification:

NOTE 440 Property sets are designed to support the HyTime and DSSSL
processing and representation of notation-specific data by providing the
information needed by those processes. They are not intended as a general
model for making notation-specific data available for arbitrary
processing.

But you're saying that we should be able to use groves for other
arbitrarily structured data? I guess I can see that in a way, but I would
like to see example groves of things like relational databases?

Has anyone attempted to define a subset of ESIS for XML yet? I'm very
interested in seeing something like that, along with a corresponding
grove plan for XPath's (and therefore XSLT and XPointer/XLink's) data
model.

Having just struggled through the abomination that is the DOM Level 1 in
order to attempt to implement XPath, I can definitely see a need for
XML processing using efficient, tailored data models produced for each
kind of processing, and maybe groves are it.

If there's nothing like this yet I would like to collaborate with any
other interested parties in defining an XML property set and XPath grove
plan, and perhaps also a Java implementation for grove manipulation.

. . . Sean.