Text and Markup

This part of the HTML reference is an explanation of SGML syntax as it applies to HTML. For lexical issues, the purpose is to take the standard and reduce it from the abstract system that is SGML to a concrete language, HTML. For structural issues, the purpose is to give you enough background to read the DTD.

Structured Text

An HTML document is a hierarchy of elements. Each element has a name, some attributes, and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:

<HTML> <TITLE> A sample HTML document </TITLE> <H1> An Example of Structure </H1> Here's a typical paragraph. <P> <UL> <LI> Item one has an <A NAME=anchor> anchor </A> <LI> Here's item two. </UL> </HTML> Some elements (e.g. P, LI) are "empty." They have no content. They show up as just a start tag.

For the rest of the elements, the content is a sequence of data characters and nested elements. The content must match the element's model group from its declaration in the DTD.

Using the example from above, the content of the UL element is the sequence "LI, #PCDATA, A, LI, #PCDATA". This matches the model group from the UL element declaration: "(#PCDATA|LI|A)+".

Parsing Content Into Data and Markup

An HTML document is like a text file, except that some of the characters are interpreted as markup, rather than document content. The following table lists the special character sequences that separate data from markup in an HTML document.

SGML delimiters

CRO
Character Reference Open: "&#", when followed by a letter or a digit, signals a character reference. SGML idioms include things like "&#168;" and "&#SPACE;". It is not used in HTML.
ERO
Entity Reference Open: "&", when followed by a letter, signals an entity reference.
ETAGO
End Tag Open: "</", when followed by a letter, signals an end tag.
MDO
Markup Declaration Open: "<!", when followed by a letter or "--" or "[", signals one of several SGML markup declarations. The only purpose it serves in HTML is to introduce comments.
MSC
Marked Section Close: "]]", when followed by ">" signals the end of a marked section. While marked sections are not used by HTML, this sequence of characters is recognized and reported as an error by conforming SGML parsers.
PIO
Processing Instruction Open: "<?" signals a processing instruction. It is not used in HTML.
STAGO
Start Tag Open: "<", when followed by a letter, signals a start tag.

Normal Text: Parsed Character Data

In the DTD, the symbol PCDATA stands for parsed character data, the normal text characters in an HTML document.

The text consists of a stream of lines. The division into lines has no significance apart from indicating a word end.

All of the SGML delimiters listed in the table of delimitersare recognized in PCDATA.

Raw Text: Character Data

In the DTD, the symbol CDATA stands for character data, the text without markup in an SGML document. Only the end tag open delimiters is recognized in CDATA.

Tags

The characters in an SGML document are organized into a heirarchy of elements by the use of tags. Tags are set off from the data characters by angle brackets: '<' and '>'.

Names

The element name immediately follows "<". Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive.

Attributes

Following the element name, whitespace and attributes are allowed. An attribute consists of a name, an equal sign, and a value. Spaces are allowed around the equal sign.

The value is either a token or a literal. A token is up to 34 letters, digits, periods, or dashes. Tokens are case sensitive.

A literal is a string surrounded by single quotes or a string surrounded by double quotes. Entity references are processed inside attribute values as inside PCDATA. The length of an attribute value (after entity processing) is limited to 1024 characters.

Each attribute has a type, which puts constraints on the values it can have. For example, the NAME attribute of the A element is an ID. An ID is a name that must be unique among all IDs in the document.

Entities

In order to include characters that would otherwise be parsed as markup, you can use entity references refer to some of characters.

An entity reference is an ampersand, followed by a name, followed by a semicolon. No spaces are allowed within an entity reference. For example:

This is how you include a &amp;lt;tag&amp;gt; as data.

Comments

Comment declarations can be used include information aimed at persons and tools that read the document in source form. This information will be ignored when the document is processed by an SGML parser.

Comments begin with the character sequence "<!--" and end with "--", which must be followed by '>'. (Technically, whitespace is allowed between the closing "--" and '>'.) They are only allowed in PCDATA.