What is an XML Document?

An XML document consists of three parts, two of which must always be explicitly present (the third is implicit if not provided):

1. The XML declaration: <?xml version="1.0"?>
2. The DOCTYPE declaration (which can be omitted)
3. The document instance (the tag stuff)

The DOCTYPE declaration declares the name of the root element (the "document type") and declares the element types used in the instance. It also contains declarations of any entities used by the document (entities are either string macros or files (technically, abstract storage objects).

Logically the DOCTYPE declaration is a flat list of element type, attribute list, and entity declarations (and notation declarations, but I haven't talked about those). Physically, the declaration can be organized into two parts, one in the document's main file (the "document entity", that is, the file that contains the XML declaration, the DOCTYPE declaration (if there is one), and the root element) and one in an external file.

Because these two parts of the DOCTYPE declaration make up the larger whole, they are both subsets. The one inside the document is the "internal" subset and the one outside the document is the "external" subset.

A typical DOCTYPE declaration looks like this:

<!DOCTYPE foo SYSTEM "myexternalsubset.dtd" [
<!-- This is the internal subset -->
<!ELEMENT foo (#PCDATA) ><!-- Declaration of element type 'foo' --> ]>

The external subset is the file named by the filename following the SYSTEM keyword.

While logically there is no difference between the internal and external subsets, XML defines slightly different rules for how the internal and external subsets must be processed. This reflects common (although misguided, IMNSHO) practice and makes things easier for processors that don't do validation and therefore don't need to process the declarations.

One key rule about internal and external subsets is that the internal subset is always parsed *before* the external subset, which means that any entities declared in the internal subset will take precedence over entities with the same name declared in the external subset. This allows a weak form of modularization of declaration sets: external DTD subsets can provide entities that are intended to be redeclared in internal subsets.