- Subject: Re: Is XML < SGML? For how long?...
- From: "W. Eliot Kimber" <eliot@isogen.com>
- Date: Tue, 17 Feb 1998 11:41:29 -0600
- Newsgroups: comp.text.sgml
- Organization: ISOGEN International Corp.
- References: <34E7279B.4AE64F07@cs.cmu.edu> <34E8B6DB.79C96365@cas.org> <davep-1602981953260001@209.61.104.118> <6cannd$1p2$2@mozart.no.where>
<Warning>Long Post</Warning>
pegri@sbox.tu-graz.ac.at wrote:
>
> David Peterson <davep@acm.org> wrote:
> > In article <34E8B6DB.79C96365@cas.org>, "Daniel R. Williamson"
> >> One item that comes to mind is that XML does *not* require
> >> a dtd, SGML does.
> > "Did". The WebSGML TC has changed that.
> I'm a SGML newbie, so forgive a stupid question: What's the use of
> SGML if
> you don't have DTD?
To answer this question, you have to understand the distinction between
a document type *definition* and a document type *declaration*. The
short answer is: there's a difference between the syntactic rules that
govern document *instances* and the more general rules (syntactic and
semantic) that govern *classes* of documents. Even if you don't
associate syntactic rules with a document instance, there are still
semantic rules that govern the class of document of which your document
is a member (even if there is exactly one document of that class). At
which point the question becomes: how is your document associated with
the rules that govern its class (realizing that SGML DOCTYPE
declarations *do not and cannot* associate document instances with
classes of documents, even though that's what they were intended to do
and the way we all try to use them, more about which at the end of this
post)?
The SGML standard defines "DTD" as "document type definition", by which
it means the set of rules (of all types) that govern the interpretation
and processing of a document. Many of these rules will not be expressed
or expressible by SGML declarations because they are policies or rules
that only humans can understand or interpret (for example, "a paragraph
should consist of a single thought").
By this definition, all documents have a DTD (that is, a set of
governing rules), even if the rules are "there are no constraints of any
sort on this document" and even if the rules exist only in the mind of
the document's author. In other words, even if you just "start typing
tags", you still have some sort of rule set that governs your creation
of the document, no matter how informal or arbitrary they might be.
By contrast, the DOCTYPE declaration (which is what most people mean
when they use the term "DTD") is nothing more than a definition of the
*syntactic* rules for an SGML document, that is, how the tags and
attributes can be specified in an SGML document string. These syntactic
rules may reflect the larger, semantic rules, but they do not, by
themselves define the semantic rules (because they can't). In
particular, it is not possible, in the general case, to infer semantic
rules solely from the syntactic rules expressible by SGML DOCTYPE
declarations. A DOCTYPE declaration is a property of *the document*
that has the declaration, even if part of the DOCTYPE declaration is
stored in an external declaration subset--that subset is still a
*syntactic* component of the document that refers to it. External
DOCTYPE subsets do not have any sort of priveledged status, even though
most users and tools associate a priveledged status with them. Such
priveledged status is in no way justified by the rules of ISO 8879.
The element and attribute list declarations that make up the bulk of a
DOCTYPE declaration serve two important purposes in SGML:
1. To enable validation of document instances against a set of
formally-defined syntactic rules.
2. To enable the use of markup minimization features, such as start tag
omission.
However, if you do not use markup minimization and do not care about
validation, then *you do not need explicit element and attribute list
declarations* in order to correctly parse an SGML document into a set of
elements.
Take a moment to fully understand the implications of this: explicit
declarations are only *required* when you want to validate or use
minimization. If you don't care about validation at parse time or don't
use minimization, then you don't need a DOCTYPE declaration.
XML does not provide any markup minimization features (except default
attribute values) and therefore no XML document is required to have
explicit declarations just to be parsed. There are many use scenarios
in which validation is not relevant (either because you truly don't care
or the validity of the documents has been assured at the time they were
produced or through some other means). When you don't care to validate,
why should your documents be forced to carry around a set of
declarations that don't add any value to you? They shouldn't and XML
(and now SGML with the WebSGML TC) correctly lets you omit the
declarations if you don't need them. If you do need them, you can have
them, but they are no longer required.
But,
But,
But,
This begs a very important question: if a document has no DOCTYPE
declaration how do I know what class of documents it is a member of?
That is, how do you know what set of larger syntactic and semantic rules
govern the document?
Good question. In fact, the recognition that DOCTYPE declarations are
*purely syntactic* reveals that, in fact, *you never had a way of
knowing what the rules for documents were by 8879-defined means*!
Think about this for a minute, because it's a very important but subtle
point: The SGML standard provides *no formal mechanism* for associating
a document instance with the semantic rules that govern it. No
mechanism. None. DOCTYPE declarations are not it.
Here's proof:
Observe this document:
<!DOCTYPE Book [
<!ELEMENT Book - - ANY >
]>
<Book></Book>
Now answer the following questions:
1. Is this document a Docbook document?
2. Is this document a valid SGML document?
The answer to the first question can only be "don't know". The answer to
question 2 is most certainly "yes".
It might be a Docbook document: I know that the Docbook DTD defines an
element type called "book", but I have no way of knowing, from the
information given in the document, that the document is *intended* to be
a Docbook document. There are many document types that have an element
type called "book". Could be any one of them or none of them.
Note that this document has no external DOCTYPE subset. This is
perfectly fine: no *conforming* SGML tool can require that documents use
an external DTD subset. Therefore, you cannot, I repeat *cannot*, depend
on the use of a particular external DTD subset by documents. In
addition, even when you do have an external DTD subset, there's no
guarantee that those declarations will even be used.
Observe this document:
<!DOCTYPE Foo PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [
<!ELEMENT Foo - - (#PCDATA) >
]>
<Foo>Is this a Docbook document?</Foo>
Now answer the following questions:
1. Is this document a Docbook document?
2. Is this document a valid SGML document?
The answer to question one must be "don't know" because there is nothing
in the document that associates it with the Docbook document type. The
use of the Docbook external DTD subset is a stronger clue that it
*might* be our intent that this be a Docbook document, but it's not
proof. In fact, we can see that this document doesn't actually use any
element types declared in the external DTD subset, which is a clue that
it might not be intended to be a Docbook document. But we still have no
way of knowing for sure.
The answer to question two is "yes" (assuming you can resolve the public
ID to a set of declarations that are themselves valid).
Note that the use of the external DTD subset provided as part of the
Docbook DTD's definition (remembering that DTD means *definition*, not
*declaration*) doesn't *in any way* imply that the rules of Docbook
govern the document. The instance, in this case, completely ignores the
declarations, choosing instead to declare the only element type it uses
in the internal DOCTYPE declaration subset. This is perfectly valid SGML
and anybody can do it who has write access to a document entity.
Have I made my point? DOCTYPE declarations tell you nothing about the
class of documents of which a document is a member. Nothing. At best,
the use of external declaration subsets referenced by public ID is a
*convention* and a weak one at that, because any author can circumvent
or subvert it at will (the non-conforming restrictions of some editing
tools notwithstanding--even ADEPT can be used as a very nice text
editor). Obviously, SGML validation alone is not sufficient because the
Foo document above is perfectly valid, even though anyone who knows the
rules of Docbook can see by inspection that it's not a valid Docbook
document--clearly SGML validation alone won't tell you if a document
follows the rules for a class of documents: it only tells you if a
document follows *its own* rules, which the previous document clearly
does.
This means that there is a BIG PROBLEM: for the last 10 years, something
we thought we were doing turns out to not be that at all. We have all
been living a lie for the last 12 years. What to do?
The ISO-defined answer is SGML architectures as defined by Annex A.3 of
ISO/IEC 10744:1997 (Architectural Forms Definition Requirements, or AFDR
for short).
An SGML "architecture" is nothing more than a document type that is used
by reference rather than syntactically included in documents (as
external DTD subsets are). In other words, architectures are what we
thought DTDs were all these years.
This is good, because it means you can use your existing DTDs as
architectures *without changing the DTDs*. Using tools like the SP
parser from James Clark, you can get *exactly the same validation* with
architectures that you get from DOCTYPE declarations, *whether a
document has a DOCTYPE declaration or not*.
Think carefully about what this means: I can have documents with no
private DOCTYPE declarations *and still get all the validation I want*
at almost no extra cost.
All it costs to use this is a few lines in the document to declare the
use of the architecture. Free tools already exist that implement the
validation. Given this declaration, you get everything you thought you
had before *plus* a formal association of a document with the definition
of the class of documents of which the document is member (that is, the
general rules that govern the class of documents, that is, the document
*type*).
As originally published, the AFDR provides a convention for using
notation declarations to declare the use of architectures. However, this
facility depends on the use of data ("notation") attributes, which XML
does not support. Therefore, in December, we submitted to ISO/IEC
JTC1/WG4 a proposed amendment to the AFDR Annex that provides an
alternative PI-based syntax that can be used with XML documents. This
proposal is WG4 document N1957, which you can find at the WG4 Web site,
"http://www.ornl.gov/sgml/wg4/docs".
The basic idea is a simple one: give your architecture (document type) a
globally-unique name (public ID or URN) and then invoke that name from
documents to assert conformance to the rules defined by the
architecture. That's all there is to it. The whole rest of the
architecture facility is about syntax and validation convenience, the
details of which are important to implementors but irrelevant for this
discussion, except to say that the AFDR provides a convenience feature
called "auto mapping" that simply says that if an element or attribute
in a document has the same name as one in the governing architecture,
then the element or attribute is automatically mapped to the
corresponding element or attribute in the architecture. This eliminates
the need to provide explicit mappings in all cases and allows you to
have no explicit mappings when the documents declarations and the
architecture declarations are identical (which they will be if you use
what was the document's external declaration subset as the architectural
DTD--they can be same physical file).
This automapping facility is very important and I'll come back to in it
a bit. But first, a quick explanation of how the
document-to-architecture association works, because that's the really
important part of architectures, the thing that SGML has lacked and that
we've needed all these years.
First, as the owner of an architecture, define a public identifier for
it. For example, the Docbook DTD is clearly an architecture by the
foregoing definitions and could be named like so:
"-//HaL and O'Reilly//NOTATION DocBook Architecture//EN"
Note that this public ID identifies a *notation*, not a *DTD*. That's
because this name refers to the *whole set of rules* that govern Docbook
documents, not just the SGML-definable syntactic rules. The Docbook
declaration set, which we still need in order to enable SGML validation,
has the same public ID it always did:
"-//HaL and O'Reilly//DTD DocBook//EN"
And we will use both of them. But the defining name, the one that says
without any ambiguity that a document is a Docbook document, is the
NOTATION name.
To declare that a document is in fact a Docbook document, you can use a
declaration like this:
<?IS10744:arch
name="docbook"
public-id="-//HaL and O'Reilly//NOTATION DocBook Architecture//EN"
dtd-public-id="-//HaL and O'Reilly//DTD DocBook//EN"
doc-elem-form="book"
>
<!DOCTYPE Book [
<!ELEMENT Book - - ANY >
]>
<Book></Book>
The PI is an "architecture use declaration PI" (as defined in the
proposed Amendment 1, N1957). It says that the document claims
conformance to the architecture whose public id is the value of the
"public-id" attribute. The "dtd-public-id" attribute points to the
"architectural DTD", that is, the set of DTD declarations that define
the SGML-definable syntactic rules for the architecture. These
declarations are required in order to enable automapping (a form of
markup minimization) and syntactic validation of the document to the
architecture (the two reasons for which DTD declarations are required).
Note that the DOCTYPE declaration and instance did not change. This is
because in this case I'm using automapping to automatically associate
the document's Book element with Docbook's Book element. By the
AFDR-defined automapping rules, the Book element is taken to be mapped
to an architecture-defined element called "book" unless you explicitly
block the mapping (which I haven't done). It's also automatically mapped
to the "architectural document element form" because it is the document
element of the document and therefore must map to the architectural
document element, which is named by the "doc-elem-form" attribute.
This now establishes a formal, machine-processible, association between
the local element type "Book" and the element type "Book" in the Docbook
architecture, something that the use of external DTD subsets alone
cannot do.
Finally, just to hammer home the point about DOCTYPE declarations being
optional, consider this version of the above document:
<?XML version="1.0"?>
<?IS10744:arch
name="docbook"
public-id="-//HaL and O'Reilly//NOTATION DocBook Architecture//EN"
dtd-public-id="-//HaL and O'Reilly//DTD DocBook//EN"
doc-elem-form="book"
?>
<Book></Book>
>From the point of view of Docbook-aware processing operating at the
architectural level (that is, operating on the result of the mapping
from the document to the Docbook architecture), the two versions of the
document are identical: the formal associations are the same and the
ability to validate the documents against the Docbook rules are the
same.
Now back to automapping and why it's important.
Automapping is important because it lets you apply architecture-based
validation to any existing document without doing anything more than
adding the architecture use declaration. By doing this, you can prevent
the sort of subversion represented by the "Foo" example above. Here is
that example again:
<!DOCTYPE Foo PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [
<!ELEMENT Foo - - (#PCDATA) >
]>
<Foo>Is this a Docbook document?</Foo>
Again, I ask the question, is this document a Docbook document? Again,
the answer is "don't know". But now, add the above architecture use
declaration:
<?IS10744:arch
name="docbook"
public-id="-//HaL and O'Reilly//NOTATION DocBook Architecture//EN"
dtd-public-id="-//HaL and O'Reilly//DTD DocBook//EN"
doc-elem-form="book"
?>
<!DOCTYPE Foo PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [
<!ELEMENT Foo - - (#PCDATA) >
]>
<Foo>Is this a Docbook document?</Foo>
Ask the questions again: Is this a Docbook document? Yes. Is this a
valid SGML document? Yes. (The use of architectures doesn't affect SGML
parsing of the *instance*.) Is this a valid Docbook document? No.
How do I know that it's not valid? I use architectural validation
services.
Here's the result of processing the above document using the
architecture validation feature of NSGMLS:
F:\TEMP>nsgmls -c d:\adept\doctypes\catalog -Adocbook book.sgm
AID IMPLIED
ALANG IMPLIED
AREMAP IMPLIED
AROLE IMPLIED
AXREFLABEL IMPLIED
AFPI IMPLIED
ALABEL IMPLIED
(BOOK
)BOOK
D:\SP\BIN\NSGMLS.EXE:book.sgm:11:32:E: element "BOOK" unfinished in
meta-DTD
Note what happened:
1. I requested that the document be processed with respect to its
architectural mapping to the Docbook architecture ("-Adocbook")
2. NSGMLS automatically mapped the document element to the architectural
document element form defined in the architecture use declaration
('doc-elem-form="book"'). This mapping is evidenced by the NSGMLS
output, which reflects the docbook-defined attributes and the Book
element start and end tags.
3. NSGMLS then validated the result of that mapping (foo->book) against
the architectural declarations, in which the Book element has required
subelements, which the document did not provide.
Thus, I was able to provide normal SGML-style validation for my Foo book
and determine that it was in fact not a valid Docbook document although
it asserted conformance to the Docbook architecture. And all I did was
add the architecture use declaration: I didn't change the document's
DOCTYPE declaration or instance in any way.
This means that in any situation where you are currently requiring the
use of an external DOCTYPE declaration subset to try to enforce a common
set of rules, you can now do that reliably and unambiguously by using
the simple trick of adding the necessary architecture use declaration
and using your *existing* external DTD subset as the architectural DTD
(which is exactly what I did in the example above--the version of the
Docbook declaration I set I used was the one shipped with ADEPT--I even
used ADEPT's out-of-the-box SGML Open catalog to resolve the public ID).
Think about what this means for "DTD-less documents": you can eat your
cake and have it. You can have completely declarationless documents to
which you can still apply complete SGML syntactic validation. Any
document that currently has an external DTD subset can instead use that
subset as an architectural DTD without changing any other declarations
or anything in the instance without any loss of validation capability
and with the added ability to know what the intended set of rules are.
This is very powerful and very important. It is, I think, key to the
success of XML where DTD-less documents are a significant advantage but
where you still need to know what *kind* of document you have received,
even if you don't care to validate it. Because you can't even depend on
the presence of an external DTD subset or even a DOCTYPE declaration,
the need for a formal way to associate documents with their governing
rules becomes even more urgent.
Finally, note that the use of architectures does not eliminate the need
for non-SGML ways to formally define some of the rules for documents
(e.g., "schemas" as that word is generally being used in the current XML
discussions). Far from it--it makes it even clearer that such formalisms
are needed as part of the total set of rules that make up an
architecture because we already know that DTD declarations alone are not
enough because they are limited to syntactic, not semantic, rules and
even the syntactic constraints they define are a subset of the syntactic
constraints we'd like to define.
What architectures provide is a formal mechanism for giving a single,
universally-unique name to a bag of rules, which will include SGML DTD
declarations, other syntactic constraint specifications, and semantic
rules that can only be expressed in prose for human understanding. This
is what SGML document types *always were* but we had no good way to
given them formal names independent of the machine-readable components
that make up part of the rule definitions.
That's why it makes sense to have "DTD-less" documents in SGML: they
never did anything for you anyway except enable markup minimization and
syntactic validation, so the lack of them doesn't hurt much and, in any
case, the function of validation can be completely provided at the
architectural level at almost zero cost.
Final note on tools support:
The current version of SP (1.3, James' latest test version) doesn't yet
support the N1957 form of architecture use declaration. Here is the
version of the Foo document that you can use with NSGMLS today:
<!DOCTYPE Foo PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [
<?ArcBase docbook >
<!ENTITY docbook.meta-DTD PUBLIC "-//HaL and O'Reilly//DTD
DocBook//EN">
<!NOTATION docbook PUBLIC "-//HaL and O'Reilly//NOTATION DocBook
Architecture//EN" >
<!ATTLIST #NOTATION docbook
arcDTD CDATA #FIXED "docbook.meta-DTD"
arcDocF NAME #FIXED "book"
>
<!ELEMENT Foo - - (#PCDATA) >
]>
<Foo>Is this a Docbook document?</Foo>
This version uses the notation declaration form of architecture use
declaration. It is semantically and functionally identical to the PI
shown above.