Abstract

A document that uses polyglot markup is a document that is a stream of bytes that parses into identical document trees (with some exceptions, as noted in the Introduction) when processed as HTML and when processed as XML. Polyglot markup that meets a well-defined set of constraints is interpreted as compatible, regardless of whether they are processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific DOCTYPE, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on void elements, named entity references, and the use of scripts and style.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document summarizes design guidelines for authors who wish their XHTML or HTML documents to validate on both HTML and XML parsers. This specification is intended to be used by web authors, particularly authors who want to serve receivers which may have either (but not both) XML or HTML parsers available. This commonly arises in legacy systems and content syndication. Polyglot is one of several transition mechanisms from legacy XML to HTML5 and this document serves to describe it accurately.

No recommendation is made in this document or by the W3C regarding whether or not to publish polyglot content. In general, authors are encouraged to publish HTML content using HTML5 syntax and media types (either HTML syntax and text/html, or XHTML syntax and application/xhtml+xml).

This document is not a specification for user agents and creates no obligations on user agents. Note that this recommendation does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html. For user agent guidance and for these definitions, see [HTML5] and [RFC2854].

Please submit bugs for this document by using the W3C's public bug database ( http://www.w3.org/Bugs/Public/) with the product set to HTML WG and the component set to HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff). If you cannot access the bug database, submit comments by email to the mailing list noted below.

This document was published by the HTML working group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-html@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

This section is non-normative.

It is sometimes valuable to be able to serve HTML5 documents that are also well formed XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. The language used to create documents that can be parsed by both HTML and XML parsers is called polyglot markup. Polyglot markup is the overlap language of documents that are both HTML5 documents and XML documents. It is recommended that these documents be served as either text/html (if the content is transmitted to an HTML-aware user agent) or application/xhtml+xml (if the content is transmitted to an XHTML-aware user agent). Other permissible MIME types are text/xml, application/xml, and any MIME type whose subtype ends with the four characters "+xml". [XML-MT]

1.1 Scope

This section is non-normative.

Polylglot markup is a robust – but entirely optional – profile of the HTML vocabulary. All web content need not be authored in polyglot markup and it is primarily an option for authors wanting to increase the robustness of their documents. Polyglot markup works best, and can be a beneficial option, in controlled environments and for authoring tools.

Polyglot markup is ideal for publishing when there's a strong desire to serve both HTML and XML tool chains without simultaneously having to maintain dual copies of the content: one in HTML and a second in XHTML. In addition, a single polyglot markup output requires less infrastructure to produce than to produce both HTML and XHTML output for the same content. Polyglot markup is also be beneficial when lightweight processes—such as quick testing or even hand-authoring—are applied to content intended to be published both as HTML and XHTML, especially if that content is not sent through a tool chain.

Note

XML-based HTML tools or systems intended for the most general contexts of use cannot depend on polyglot input: for maximum flexibility, such tools should use the technique of using an HTML parser that produces an XML-compatible DOM or event stream.

1.2 Robustness

Polyglot markup is a means to an end – robustness. It is not a goal in itself. However, authors do not need to understand these benefits in order to use and benefit from this syntax. But neither does anyone need to exaggerate its benefits. For instance, polyglot markup does not add semantics. Polyglot markup does, however, work to preserve semantics, including during the authoring process. Polyglot markup also doesn’t ensure accessibility - as it does not add any requirements that other relevant specs have not allready added. But it can work to preserve accessibility.

The motivation behind, and reason for polyglot markup to exist as a specification, is its widely supported robustness. With robust (also known as conservative) markup, authors can maximize compatibility with current and future user agents and authoring tools. [WCAG20]

Polyglot markup seeks to define constraints on the serialization of a DOM tree in a robust manner that is likely to retain semantics when said serialization is reparsed using a variety of parsers, be they full featured and bug free HTML5 parsers, somewhat HTML-aware parsers, and even XML parsers.

For the most part, polyglot markup is just a pure deduction of the validity constraints and syntax requirements that HTML and XHTML dictate, many of which took polyglotness into consideration when they were added to HTML5. However, for reasons of robustness, the spec sometimes goes a little further than the principle of the lowest common denominator would have required.

For instance, included in the set of constraints on the serialization is the requirement to use the UTF-8 encoding. This requirement is not only because of the documented benefits (the HTML-specific benefits are described in HTML5 [HTML5]) – which in turn has lead the HTML5 specification to recommend that all new documents use UTF-8, but also because it is the sole encoding that every parser, be it an HTML parser or an XML parser, is required to support. Also, UTF-8 might in some situations be the sole HTML-conforming option, since it is one of only two encodings (the other being UTF-16, with its own, separate set of well-known issues) for which XML well-formed rules doesn’t require the encoding to be explicitly declared. This in turn has the benefit that the anyhow HTML-invalid XML encoding declaration kan reliably be skipped without causing any side-effects. E.g. if one opted to use the KOI8-R, encoding, then, as a side-effect of HTML-conformance and XML well-formedness requirements, the author would have been forced to rely on a higher protocol (such as MIME Content-Type) in order to support XML parsers. By requiring UTF-8, this side-effect is avoided. And so, while not the only theoretical possibility, the choice of UTF-8 as the sole option, is justified by the underlying principle of robustness.

Using robust syntax can enable documents to be parsed more reliable in less capable parsers. But even if the document can be expected to be parsed and validated by fully HTML5 conforming tools, polyglot markup adds robustness. As an example, when serialized as HTML, the closing tag for the p element is entirely optional and will be inferred if not present. But inclusion of closings tags, as required by XML and, thus, by polyglot markup, cause no harm beyond a minor increase in transfer size (an increase often mitigated by compression), but does allow validators to detect situations where the implicit closing rules don't match what the author intended.

Note

Polyglot markup is not defined as ”robust markup” because the XML-based polyglot markup syntax is not the only way to increase robustness. For instance, an HTML validator or an authoring tool could require all tags to be closed even if this is not required by the HTML syntax. But then again, polyglot markup, being valid XML, has some sometimes practical benefits which such a custom setup alone would not have.

2. Syntax

2.1 Principles

Polyglot markup results in:

Polyglot markup is not constrained:

Polyglot markup is scripted according to the rules of XML (does not use document.write, for example) and excludes HTML elements that are impossible to replicate in an XML parser (does not use the noscript element, for example). Polyglot markup triggers non-quirks mode in HTML parsers, as non-quirks mode is closest to XML-mode rendering, in regard to both DOM and CSS. Polyglot markup results in the same encoding and the same language in both HTML-mode and XML-mode.

Polyglot markup, itself being valid HTML5, supports extensibility as it is defined in Section 2.2.3 Extensibility of HTML5, so long as the extension does not violate the rules of polyglot markup. [HTML5] In addition, being well formed XML, polyglot markup can be extended when it is served as application/xhtml+xml.

3. Writing HTML documents

3.1 Processing instructions and the XML declaration

Processing Instructions and the XML Declaration are both forbidden in polyglot markup.

3.2 Specifying a document’s character encoding

Polyglot markup uses the UTF-8 character encoding, the only character encoding for which both HTML and XML require support. HTML requires UTF-8 to be explicitly declared to avoid fallback to a legacy encoding [HTML5]. For XML, UTF-8 is an encoding default. As such, character encoding MAY be left undeclared in XML with the result that UTF-8 is still supported [XML10].

Polyglot markup declares the UTF-8 character encoding in the following ways, which may be used separately or in combination (but note that here can only be a single HTML encoding declaration):

Note

Both XML and HTML parsers are required to support the byte order mark. The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8.

The W3C Internationalization (i18n) Group recommends to always include a visible encoding declaration in a document, because it helps developers, testers, or translation production managers to check the encoding of a document visually.

3.3 The DOCTYPE

Polyglot markup uses a document type declaration (DOCTYPE) specified by section 8.1.1 of [HTML5]. In addition, the DOCTYPE conforms to the following rules:

Note

The string html SHOULD be in lowercase letters, in order to be both well-formed and valid XML; however, the string MAY be in mixed case or uppercase letters and still be well-formed XML.

Note that using about:legacy-compat in XML may yield unpredictable parsing results, depending on the XML processing pipeline.

Polyglot markup does not use document type declarations for HTML4, HTML3, or HTML2, regardless of whether they contain a URI or not and regardless of their effect in HTML5 parsers, as these document type declarations are not compatible with XHTML.

3.4 Namespaces

The following rules apply to namespaces used in polyglot markup.

3.4.1 Element-level namespaces

[HTML5] introduces undeclared (native) default namespaces for the root HTML element, html, the root SVG element, svg, and the root MathML element, math. Polyglot markup declares the following default namespaces, when the markup languages are included in the document, to maintain XML-compatibility [XML10]:

  • <html xmlns="http://www.w3.org/1999/xhtml">
  • <math xmlns="http://www.w3.org/1998/Math/MathML">
  • <svg xmlns="http://www.w3.org/2000/svg">

Polyglot markup declares the default namespaces on the root HTML element, html, the root SVG element, svg, and the root MathML element math, and on any HTML elements used as children of SVG or MathML elements. Polyglot markup does not declare any other default or prefixed element namespace, because [HTML5] does not natively support the declaring of any other default or prefixed element namespace.

3.4.2 Attribute-level namespaces

[HTML5] introduces undeclared (native) support for attributes in the XLink namespace and with the prefix xlink:. Polyglot markup declares the XLink namespace on the HTML root element (html) or once on the foreign element where it is used (svg or math), to maintain XML-compatibility [XML10].

In polyglot markup, the xlink prefix uses the namespace declaration xmlns:xlink="http://www.w3.org/1999/xlink" before using the xlink prefix for the following attributes:

  • xlink:actuate
  • xlink:arcrole
  • xlink:href
  • xlink:role
  • xlink:show
  • xlink:title
  • xlink:type

Note that there are other prefixed attributes that can be used beyond xlink:href (such as xml:base). Polyglot markup does not declare these prefixes via xmlns. The prefixes are implicitly declared in XML and are automatically applied to the appropriate attributes in HTML.

The namespaced attributes, such as xml:lang="" and xmlns="", are "namespaced" within XHTML, SVG and MathML. Thus, the rules for how they can be sued as CSS selectors is governed by CSS namespaces. [CSS3NAMESPACE] For more on the issues related to attribute selectors and namespaces, with and without prefix, see the section on Scripting and styling polyglot markup.

3.5 Element syntax

Polyglot markup conforms to the following rules regarding elements.

3.5.1 Required elements and tags

HTML5’s concept of optional tags – start tags and/or end tags – covers elements that the HTML parser itself automatically adds to the DOM if the code doesn’t contain the tags for them. However, since XML does not have a feature whereby elements with one or both tags that have been omitted from the code (such as when start and end tags of html are omitted) are added to the DOM, omitting a tag in polyglot markup is equivalent of producing a not well-formed document or, if both tags are omotted, equivalent of not adding the element at all. Therefore, polyglot markup does not operate with optional tags.

That polyglot markup doesn’t operate with optional tags, may create surprises e.g. for someone not used to adding e.g. the tbody tags in their code or to someone accustomed to omitting the end tag of the p element. However, the requirement to be complete with regard to tags, is a key feature of polyglot markup that makes the code robust against subpar parsers and authoring surprises.

3.5.1.1 A minimal HTML document

Every polyglot markup document therefore ontains an html, head, title, and body element, represented in the code with their tags. The html element is the root element. The head and body elements are children of the html element. The title element is a child of the head element. Therefore, the following source code would be the most basic polyglot markup document.

Example 4
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
  <head>
    <title></title>
  </head>
  <body>
  </body>
</html>
3.5.1.2 Required tags examples

Whenever it uses a tr element, polyglot markup always wraps the tr element inside a tbody, thead, or tfoot element. In HTML, if a group of one or more adjacent tr elements are not explictly wrapped inside a tbody, thead, or tfoot element, the HTML parser creates and wraps a new tbody element around the tr elements. XML parsers do not create the tbody element, thus offering the potential for creating different DOMs.

Correct:

Example 5
<table>
<tbody>
<tr>...
Incorrect:
Example 6
<table>
<tr>...

Whenever it uses col elements within a table element, polyglot markup explicitly uses a colgroup element surrounding groups of the col elements. In HTML, if a group of one or more adjacent col elements are not explicitly wrapped inside a colgroup element, the HTML parser creates and wraps a new colgroup element around the col elements. XML parsers do not create the colgroup element, thus offering the potential for creating different DOMs.

Correct:

Example 7
<table>
<colgroup>
<col>...
Incorrect:
Example 8
<table>
<col>...

3.5.2 Excluded elements and tags

The noscript element is non-conforming in XHTML, and therefore also in polyglot markup, due to the fact that XML has no mechanism by which to produce the effect it has in HTML.[HTML5]

Note

Elements with features designed for HTML alone, are non-polyglot from the outset. Currently, all such elements are legacy elements, and all but noscript, which HTML5 forbids in XHTML alone, are also obsoleted by the HTML specification for both HTML and XHTML.

3.5.3 Case-sensitivity

The following apply to any usage of element names, attribute names, or attribute values in markup, script, or CSS. Polyglot markup uses lower case letters for all ASCII letters. For non-ASCII letters—such as Greek, Cyrillic, or non-ASCII Latin letters—polyglot markup respects case sensitivity as it is called for.

3.5.3.1 Element names

Polyglot markup uses the correct case for element names.

  • Polyglot markup uses lowercase letters for all HTML element names.
  • Polyglot markup uses lowercase letters for all MathML element names.
  • Polyglot markup uses lowercase letters for all SVG element names except the following, for which polyglot markup uses mixed case:
    • altGlyph
    • altGlyphDef
    • altGlyphItem
    • animateColor
    • animateMotion
    • animateTransform
    • clipPath
    • feBlend
    • feColorMatrix
    • feComponentTransfer
    • feComposite
    • feConvolveMatrix
    • feDiffuseLighting
    • feDisplacementMap
    • feDistantLight
    • feFlood
    • feFuncA
    • feFuncB
    • feFuncG
    • feFuncR
    • feGaussianBlur
    • feImage
    • feMerge
    • feMergeNode
    • feMorphology
    • feOffset
    • fePointLight
    • feSpecularLighting
    • feSpotLight
    • feTile
    • feTurbulence
    • foreignObject
    • glyphRef
    • linearGradient
    • radialGradient
    • textPath
3.5.3.2 Attribute names

Polyglot markup uses the correct case for attribute names.

  • Polyglot markup uses lowercase letters in attribute names for all HTML elements.
  • Polyglot markup uses lowercase letters in attribute names for all MathML elements except the lowercase definitionurl, which polyglot markup changes to the mixed case definitionURL.
  • Polyglot markup uses lowercase letters in attribute names for all SVG elements except the following, for which polyglot markup uses mixed case:
    • attributeName
    • attributeType
    • baseFrequency
    • baseProfile
    • calcMode
    • clipPathUnits
    • contentScriptType
    • contentStyleType
    • diffuseConstant
    • edgeMode
    • externalResourcesRequired
    • filterRes
    • filterUnits
    • glyphRef
    • gradientTransform
    • gradientUnits
    • kernelMatrix
    • kernelUnitLength
    • keyPoints
    • keySplines
    • keyTimes
    • lengthAdjust
    • limitingConeAngle
    • markerHeight
    • markerUnits
    • markerWidth
    • maskContentUnits
    • maskUnits
    • numOctaves
    • pathLength
    • patternContentUnits
    • patternTransform
    • patternUnits
    • pointsAtX
    • pointsAtY
    • pointsAtZ
    • preserveAlpha
    • preserveAspectRatio
    • primitiveUnits
    • refX
    • refY
    • repeatCount
    • repeatDur
    • requiredExtensions
    • requiredFeatures
    • specularConstant
    • specularExponent
    • spreadMethod
    • startOffset
    • stdDeviation
    • stitchTiles
    • surfaceScale
    • systemLanguage
    • tableValues
    • targetX
    • targetY
    • textLength
    • viewBox
    • viewTarget
    • xChannelSelector
    • yChannelSelector
    • zoomAndPan
3.5.3.3 Attribute values

For characters in attribute values, polyglot markup maintains case consistency between markup, DOM APIs, and CSS when these attributes are used on HTML elements.

Polyglot markup maintains case consistency for values on the following attributes, which occur on MIME types, language tags, charsets, booleans, media queries, and keywords. Though not required, an easy way to maintain case-consistency is to use only lower case values for these attributes. Polyglot markup maintains case consistency for these values because, for the purpose of selector matching, attribute values in XML are all treated case sensitively; however, HTML treats the values of these attributes as case insensitive (See 4.14.1 Case-sensitivity, in the HTML5 specification). [HTML5]

  • accept
  • accept-charset
  • charset
  • checked
  • defer
  • dir
  • direction
  • disabled
  • enctype
  • hreflang
  • http-equiv
  • lang
  • media
  • method
  • multiple
  • readonly
  • rel (for values that do not contain a colon)
  • scope
  • selected
  • shape
  • target (keywords only; browsing context names are case-sensitive)
  • type (on a, link, object, script, or style elements)
  • type (on input)

Note that other specifications, such as RDFa, may place additional restrictions on the allowed values of certain attributes.

3.6 Element contents

For the different kinds of elements that HTML documents contain, polyglot markup conforms to the following contents rules.

3.6.1 Void elements

In the HTML syntax, void elements are elements that always are empty and never have an end tag. All elements listed as void in the HTML specification or in an extension spec, MUST in polyglot markup have the syntactic form of an XML empty-element tag (<foo/>). Other elements MUST NOT use the XML empty-element tag syntax.

Fig. 1 The void elements of the HTML specification at the time of writing.
area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr

Example: Polyglot markup uses the minimized tag syntax for void elements, e.g. <br/>, and does not use <br></br>.

Example: Given an empty instance of an element whose content model is not EMPTY (for example, an empty title or paragraph) polyglot markup does not use the minimized form. E.g. the document uses <p></p> and not <p/>.

Note

Elements in foreign content, such as MathML and SVG elements, may be either self-closing or contain content.

3.6.2 Raw text elements (script and style)

In polyglot markup, the contents of all elements listed as raw text elements in the HTML specification or in an extension spec, MUST conform to the extra requirements defined in this section.

Fig. 2 HTML5”s list of raw text elements
script, style

In the HTML syntax, the contents of raw text elements is raw text, by which it is referred to the fact that the HTML parser will not treat contained code that look like tags (element tags and comment tags), character references, CDATA etc as tags, character references, CDATA etc, but as raw text. (See HTML5 for the exact rules.) In the XHTML syntax, however, the same constructs will be treated as tags, character references, CDATA etc.

As result, in HTML, it is simpler than it is in XHTML, for authors to comply with the requirement of the default MIME types of the raw text elements. On the other side, by the use of CDATA, the raw text contents parsed as XHTML, can be made ven less semantic than the raw text data of HTML, leading to potential harms if the document is parsed as HTML

Fig. 3 Overview over the differences in how HTML and XML parse raw text elements
Ambiguous stringInfoHTML interpretationXML interpretation
if inside <[CDATA[section]]>if outside <[CDATA[section]]>
< LESS-THAN SIGNuninterpreted (but see the </script and </style rows) uninterpretedinterpreted (commences tags, comments, CDATA)
&AMPERSANDuninterpreteduninterpretedinterpreted commences character reference or entity
<--start of commentpartly unintepreteduninterpretedinterpreted
-->end of commentpartly unintepreteduninterpretedinterpreted
<[CDATA[start of CDATA declarationuninterpreteduninterpretedinterpreted (begins CDATA block)
]]>end of CDATA declarationuninterpreteduninterpretedinterpreted (ends CDATA block)
cdata contentthe content of CDATA sectionsuninterpreted
</script if occuring inside script element and followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F)terminates parentuninterpretedinterpreted
</styleif occuring inside style element and followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F)terminates parentuninterpretedinterpreted
<foo></bar>all other tags, wellformed or notuninterpreteduninterpretedinterpreted subject to normal parsing rules
&#foo;character referencesuninterpreteduninterpretedinterpreted subject to normal parsing rules
none of the above stringsAny other stringuninterpreteduninterpreteduninterpreted

Syntactically, the polyglot subset is found by

  • either limiting the content to safe content, that is: text that gets interpreted the same way in HTML and in XML.
  • or trying to even out the constraints differences by wrapping the contents in a CDATA section. The CDATA code is then seen as text by the HTML parser (and can thus interfere with the scripting or styling language!), while the XML parser sees the content as text without markup semantics.

Limiting the contents to safe content requires more planning and control over the code, but can be said to be more robust than the CDATA option as it requires no extra, potentially breakable code to make the scripting or styling language work. The CDATA option on the other hand, gives more freedom and robustness against various errors that can happen because the author isn’t aware of the safe content limitations or because the code is inserted by a tool that is unable to guarantee that the content is safe.

3.6.2.1 The safe text content option

The safe text content option comes in two variants:

  • The external safe text content variant. This implies to include the scripts or stylesheet by linking to an external file rather than including all the code in-line. External files are parsed as the respective script or stylesheet, and are thus not limited by the safe text content restrictions.
    Fig. 4 Using external safe content.
    Example 9
    <!-- Ways to link to external scripts or stylesheets -->
    <script src="external.js" ></script>
    <link href="external.css" rel="stylesheet"/>
    <style>@import "external.css";</style>
  • The inline safe text content variant. This option implies to abstain from using characters and constructs which HTML and XML interpret differently, namely the characters < and & as well as the CDATA end mark string – ]]>.
    Fig. 5 Using inline safe text content
    Example 10
    <!-- Unsafe content: < and & are not escaped
    This code is not XML well-formed. -->
    <style>q::before{content:"<";}</style>
    <script>var a = "&";</script> <!-- Unsafe content: < and & are escaped at markup language level.
    This code means different things in HTML vs XML -->
    <style>q::before{content:"&lt;";}</style>
    <script>var a = "&amp;";</script> <!-- Safe content: < and & escaped at scripting/stylesheet level -->
    <style>q::before{content:"\00003c";}</style>
    <script>var a = "\u0026";</script>

    For CSS, the inline safe text content option would work very well most of the time, as < and & are not key parts of CSS and not very often used. But when it comes to JavaScript, the & and the < are key verbs (operators) of the language, and thus one soon runs into trouble – it is better to use external safe content.

Fig. 6 An example of inline safe text content in script
Example 11
<!-- The following the example is polyglot markup because there are no
     ambiguous strings within the script element. -->
<script>document.body.appendChild(document.createElement("div"));</script>
Note

A workaround for using ambiguous strings is to include the properly escaped characters inside the src attribute of style or script tags.

3.6.2.2 The safe CDATA option

The safe CDATA option wraps the raw text content in CDATA section but(!) instead of permitting any content (except the very CDATA end mark string – ]]>), only the subset that corresponds to the particular raw text element’s HTML constraints is permitted. See the “HTML interpretation” column in the parsing differences table above – all the cells with the text ”uninterpreted” are also uninterpreted as CDATA and thus constitutes the safe subset of CDATA.

But while CDATA evens out the constraints, it introduces a new problem: When consumed as HTML, the start and end mark of the CDATA section is seen by the script or stylesheet interpreter and can thus cause syntax errors or even halt the script and stylesheet execution. The way to deal with it is to comment out the CDATA start and end mark using the comment methods of the script or stylesheet language. Additionally, if e.g. script is used as a coding block container, it may be necessary to even comment out the scripting/styling comments by hiding them inside a XML comment.

3.6.2.2.1 Safe CDATA usage rules

These rules assumes that CDATA is of limited use for CSS.

General rules:

  • The CDATA section is subject to HTML’s restrictions on <script>/<style>
  • Only one CDATA section permitted per raw text element
  • Before the CDATA section there can only be one node - preferrably only one line of code, which may consist of whitespace, or an XML comment or a construct of the scripting/styling language (usually a comment of the scripting/styling language).
  • After the CDATA section: Same rules as for before the CDATA section.

The ]]> string:

  • is always commented out if <![CDATA[ is commented out.
  • is never commented out if <![CDATA[ is not commented out.
  • Example 12
    //]]>  </script>

The <![CDATA[ string can be handled in 3 ways:

  1. <![CDATA[ - without commenting it out.
    Example 13
    <script type="not-CSS-and-not-JS"><![CDATA[foo]]></script>
    • Important: Unpermitted for 'text/css' and 'text/javascript'!
    • Advantage: Can be useful for type="text/html" and templating in general. Svelte - saves bytes. Puristic.
    • Disadvantage: scripts might need to be tuned to support it.
  2. //<![CDATA[ - pure scripting language level commenting out. Comment starts in the node before the CDATA section:
    Example 14
    <script>//<[CDATA[ FOO; //]]></script>
    • Advantage: Well known in JavaScript. Much used.
    • Disadvantage: Less safe for templating since the comment could become treated as part of the template.
  3. <!--//--><![CDATA[ - Same as 2, but the scripting comment is hidden inside an XML comment.
    Example 15
    <script><!--//--><![CDATA[ FOO; //]]></script>
    • Advantage: Versatile.
      • ”out of the box” compatible with use of script as container for templating
      • compatible w/JavaScript
      • compatible w/CSS (however rule 2 above prevents validity)
    • Disadvantages:
      • The JavaScript linter might not like it.
      • The scripting language must accept <!-- as syntactically legal (which JavaScript does)

3.6.3 Escapable raw text elements

Escapable raw text elements are elements in which character references are permitted, but where the HTML parser treats elements as text rather than as markup.

  • title
  • textarea

Escapable raw text elements are subject to the same rules of safe text content, with the exception that polyglot character entities are permittd.

3.6.4 Foreign elements

The exact rules of for foreign content elements are defined by the respective specifications.

3.6.5 Normal elements

Normal elements have no special restrictions other than those that normally apply to polyglot markup. But note that some elements, such as the iframe element must be empty in the polyglot markup since this is is a requirement which the HTML specification sets on iframe in the XHTML syntax.

3.7 Text

3.7.1 Newlines in textarea and pre elements

When polyglot markup uses either a textarea or pre element, the text within the element should not begin with a newline.

3.8 Attributes

Polyglot markup surrounds all attribute values with quotation marks. Polyglot markup surrounds attribute values with either single quotation marks or with double quotation marks.

Polyglot markup does not use directly typed newline characters within an attribute.

Within an attribute's value, polyglot markup represents tabs, line feeds, and carriage returns as numeric character references rather than by using literal characters. For example, within an attribute's value, polyglot markup uses &#x9; for a tab rather than the literal character '\t'. This is because of attribute-value normalization in XML [XML10]. The following example uses numeric character references (escaped characters) for the line feed, tab, and less-than characters within a srcdoc attribute.

Example 16
<iframe srcdoc="&lt;p>Hello &#x0A; &#x09; world!&lt;/p>" src="demo_iframe_srcdoc.htm"></iframe>
Note

Because of attribute-value normalization in XML [XML10], polyglot markup does not use newline characters within an attribute. Practically speaking, for source code with newlines within attributes, DOMs generated via XML and HTML will be different; however, whitespace differences have no behavioral impact on the page unless:

  • explicitly examined by JavaScript, rendering the differences of small consequence.
  • used in attributes whose content is rendered visually, such as the content of @alt.

Note that directly typed newline characters are overtly not allowed in any attribute containing a URI.

See also Attribute Values.

3.8.1 Disallowed attributes

The following attributes are not allowed in polyglot markup. These attributes have effects in documents parsed as XML but do not have effects in documents parsed as text/html. The HTML5 spec therefore defines them as invalid in text/html documents. [HTML5]

  • xml:space
  • xml:base

Note that the xml:space and xml:base attributes are allowed on SVG and MathML elements.

3.8.2 Language attributes

When specifying the language mapping of an element, polyglot markup uses both the lang and the xml:lang attributes. Neither attribute is to be used without the other, and polyglot markup maintains identical values for both lang and xml:lang.

The root element SHOULD always specify the language, or else HTML’s fallback language effect may step in and cause the language to vary depending on whether the document is consumed as XML (where the fallback language is not required to work) or consumed via file URI (where fallback language via external HTTP Content-Language would not work). Note that the internal http-equiv="Content-Language meta element is non-conforming in HTML5. For more, see e.g. HTML5’s language determination rules.

3.8.3 Attributes with special considerations

The following attributes or their considerations require exceptions to the general rules for polyglot markup.

3.8.3.1 The id attribute

Polyglot markup does not contain any space characters within the value of an id attribute. This is because values for the id attribute may not contain space characters in HTML5. [HTML5]

3.9 Named entity references

Polyglot markup uses only the following named entity references:

For entities beyond the previous list, polyglot markup uses character references. For example, polyglot markup uses &#xA0; instead of &nbsp;. Note that polyglot markup may use decimal values for escape characters (such as &#160; in the previous example); however, the Character Model for the World Wide Web recommends that content SHOULD use the hexadecimal form of character escapes rather than the decimal form when both are available. [CHARMOD]

Polyglot markup always uses character references for the less than sign (<) and ampersand (&) when they are used as characters, except when those characters appear inside a CDATA section.

3.10 Comments

Polyglot markup does not begin a comment with either ">" or "->".

3.11 Scripting and styling polyglot markup

When applying JavaScript and CSS to polyglot markup, the goal is to get the same result whether consumed as HTML or as XML. It is therefore important to be aware of scripting and styling features that give different results in HTML vs XML. These issues comes in addition to the polyglot usage rules for raw text elements.

3.11.1 JavaScript: innerHTML vs document.write()

Although document.write() and document.writeln() works in HTML, neither function works in XHTML. The polyglot alternative is the innerHTML property, which works for both HTML and XHTML.

Note

The innerHTML property takes a string. However, XML parsers will parse that string as XML in XHTM while HTML parsers parse will parse that string as HTML in HTML. And because of this difference in parsing, the code that innerHTML inserts must follow the guidelines for polyglot markup so that the resulting DOM generated by the XML parser do not differ from the DOM generated by the HTML parser.

3.11.2 CSS: Attribute selectors that require a namespace prefix

CSS allows authors to select elements by referencing their attributes using so called attribute selectors: [attr]{rule:foo}. And for the most part, attribute selectors can be used freely since polyglot markup relies on default namespaces, which do not affect attributes. However, some of the attributes required by polyglot markup, are namespaced – either by default (such as for the xmlns attribute) or via a prefix that by default is namespaced (such as xml:, xmlns:, xlink:). Extension specs might allow even other namespaced attributes than those defined by the HTML specification. As result, a selector such as [xmlns]{rule:foo} will only work in HTML – it will not work in XHTML, where it is a namespace attribute. And the same goes for prefixed attributes – even if one escapes the colon ([xml\:lang]{rule:foo}), such selectors will only work in HTML, except that for the namespace declaration for the xlink: prefix, then it works like in XML even in the HTML syntax and must thus be selected in a namespaced way in both syntaxes.

To be able to select namespaced attributes in XML, the attribute selector must include a namespace prefix. [SELECT]

For the unprefixed, namespaced attribute xmlns, a polyglot selector that works in both HTML and XML can be created by using the asterisk (*) for the namespace prefix, indicating that the selector is to match all attribute names without regard to the attribute's namespace:

Example 17
[*|xmlns]{color:lime}

For prefixed attributes, then, because the rules of polyglot markup as well as the HTML specification itself dictates that the presence of a xml:lang="foo" must be accompanied with a corresponding lang="foo" attribute, then, in a conforming polyglot document, one can use the same approach as for the xmlns attribute.

Example 18
[*|lang]{color:lime}
Note

However, the requirement of polyglot markup to use both xml:lang="foo" and lang="foo" means that even [lang]{color:lime} would work, in both XML parsers and HTML parsers.

When it comes to xmlns:xlink attribute, which is required in polyglot svg elements, then, because it is a foreign element in HTML/XHTML (and thus, unlike xml:lang), its is namespaced even in HTML. Hence, there only way – in HTML as well as in XML – to use this attribute as a selector, is to declare the namespace of the xmlns: prefix in CSS:

Example 19

        @namespace xmlns "http://www.w3.org/2000/xmlns/";
        [xmlns|-xlink]{border:dashed lime 3px}

In cases where the user agent does not support namespaces in CSS and/or in markup, it is necessary to use more than one selector. This could happen if the author declares prefixes – default or prefixed – which are an extension specification permits or if the user agent does not support attribute selectors with CSS namespace prefix.

Example 20

            /*Selector for legacy user agents without support for namespace prefixed attribute selector:*/
            [xmlns],
            /*Selector for user agents with support for namespace prefixed attribute selector:*/
            [*|xmlns]
            {color:lime}

3.12 Templating restrictions

4. Example document

The following example code acts as polyglot markup and validates as either XHTML or as HTML. You can view the page live served as HTML, at http://dev.w3.org/html5/html-xhtml-author-guide/SamplePage.html and the same bytes served as XHTML, at http://dev.w3.org/html5/html-xhtml-author-guide/SamplePage.xhtml.

Note

The example document is served as 'text/html'. Some legacy user agents do not support SVG in when served up as 'text/html' as it is in this example. The example page could also be served as 'application/xhtml+xml' instead, with the file extension .html, maintaining adherence to Polyglot markup and enabling the rendering of the SVG.

Example 21
<!DOCTYPE html>

<html id="SampleDoc" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">

  <head>
    <title>A Sample Page Using Polglot Markup</title>
    <meta charset='utf-8' />
        <!-- The HTML encoding declaration (meta element with the charset
             attribute) is used to declare the encoding ofor HTML parsers, in line with the section on
             Specifying a document’s character encoding -->
	<!-- The link element is self-closing as described in the section on Void Elements -->
	<!-- Style commands are included by linking to an external file rather than including them in-line,
	     as described in the section on The safe text content option for script and style elements.  -->
    <link type="text/css" rel="stylesheet" href="Sample.css"/>
  </head>

  <body>
<nav><p><strong>NB:</strong> These bytes are available served as <a href="SamplePage.xhtml">XHTML</a>
             and as <a href="SamplePage.html">HTML</a></p></nav>

    <h1>Sample Page Using Polyglot Markup</h1>
    <p>
      The source code for <a href="#SampleDoc">this document</a> uses <dfn id="sampleDef">polyglot markup</dfn>,
      a document that is a stream of bytes that parses into identical document trees
      (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML.
      The source code for this document also contains additional comments about the use of
      <a href="#sampleDef">polyglot markup</a>.
    </p>

    <h2>Foreign Elements</h2>
    <p>
      The following shapes use SVG elements.
      <a href="#sampleDef">Polyglot markup</a> introduces undeclared (native) default namespaces
      for the the root SVG element (<code>svg</code>) and respects the mixed-case element names and values
      when appropriate, as described in the section on Element-Level Namespaces, the section on Element Names
      and the section on Attribute Values.
    </p>

    <!-- <a href="#sampleDef">Polyglot markup</a> declares the xlink: namespace on the <svg> element to maintain XML-compatibility  -->
    <svg width="350" height="250" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
      <g>
        <title>Three SVG shapes</title>
        <desc>
          This SVG image contains an ellipse filled with a gradient that goes from white to blue as it moves outward from the center.
          A yellow rectangle with a black border overlaps the ellipse in the upper-left quadrant,
          and a red spiral on a white background overlaps the ellipse in the bottom-right quadrant.
          The red spiral is also a link to the example code for that SVG shape.
        </desc>
        <defs>
          <!-- Note that "radialGradient" and "myGradient" respect mixed-case values. -->
          <radialGradient id="myGradient" cx="50%" cy="50%" r="50%" fx="50%" fy="50%">
            <stop offset="0%" style="stop-color:rgb(200,200,200); stop-opacity:0"/>
            <stop offset="100%" style="stop-color:rgb(0,0,255); stop-opacity:1"/>
          </radialGradient>
        </defs>
      <ellipse cx="50%" cy="50%" rx="50%" ry="42%" style="fill:url(#myGradient)"/>
      <rect x="0" y="0" width="100" height="100" style="fill: yellow; stroke: black;"/>
      <a xlink:href="http://www.example.org/foo">
        <!--
          Note that the following attribute contains newlines which will produce a different DOM,
          but will not affect the way in which SVG functions in the least.
        -->
        <path transform="translate(60, -175)"
                 d="M153 334 C153 334 151 334 151 334 C151 339 153 344 156 344 C164 344 171 339 171 334
                    C171 322 164 314 156 314 C142 314 131 322 131 334 C131 350 142 364 156 364
                    C175 364 191 350 191 334 C191 311 175 294 156 294 C131 294 111 311 111 334
                    C111 361 131 384 156 384 C186 384 211 361 211 334 C211 300 186 274 156 274"
                 style="fill:white;stroke:red;stroke-width:2"/>
        </a>
      </g>
    </svg>
    <h2>Void Elements</h2>
    <!-- Given an empty instance of an element whose content model is not EMPTY (in this case, an empty paragraph)
    <a href="#sampleDef">polyglot markup</a> does not use the minimized form, as described in Section 6.4 Void Elements -->
    <p></p>
    <p>
      There is an empty <code>p</code> element before this paragraph.
      <a href="#sampleDef">Polyglot markup</a> uses <code>&lt;p>&lt;/p></code> and not <code>&lt;p/></code>.
    </p>
    <p>
      <a href="#sampleDef">Polyglot markup</a> treats certain elements as self-closing,
      void elements, such as the following <code>img</code> element.
    </p>
    <img height="48" width="72" alt="W3C" src="http://www.w3.org/Icons/w3c_home"/>
    <p>
      For more information, see the Void Elements section.
    </p>


    <h2>Required Elements</h2>
    <p>
      The following table uses the required <code>tbody</code> element, as described in the
      Required elements and tags section.
    </p>
    <table>
      <tbody>
        <tr>
          <th>Column One</th>
          <th>Column Two</th>
        </tr>
        <tr>
          <td>Row 1, Column 1</td>
          <td>Row 1, Column 2</td>
        </tr>
        <tr>
          <td>Row 2, Column 1</td>
          <td>Row 2, Column 2</td>
        </tr>
        <tr>
          <td>Row 3, Column 1</td>
          <td>Row 3, Column 2</td>
        </tr>
      </tbody>
    </table>

    <p>
      The following table makes use of the col element and therefore uses the then required <code>colgroup</code> element as col element wrapper for, as described in the Required elements and tags section.
    </p>
    <table>
      <colgroup>
        <col style="background-color:silver"/>
        <col style="background-color:gray"/>
        <col style="background-color:yellow"/>
      </colgroup>
      <tbody>
        <tr>
          <th>ISBN</th>
          <th>Title</th>
          <th>Price</th>
        </tr>
        <tr>
          <td>3476896</td>
          <td>My first HTML</td>
          <td>$53</td>
        </tr>
        <tr>
          <td>1234567</td>
          <td>Intermediate Polyglot</td>
          <td>$49</td>
        </tr>
      </tbody>
    </table>

    <h2>Named Entity References</h2>
    <p>
      The paragraph you now read, uses the string <code>&amp;amp;</code> for ampersands (“&amp;”) and uses,
      as described in the section on Named entity references, the string <code>&amp;#xA0;</code>
      for a non-breaking space between the following two words: <i>“<a href="#sampleDef">polyglot&#xA0;markup</a>”</i>.
    </p>
  </body>
</html>

A. Acknowledgements

Many thanks to Robin Berjon, David Carlisle, Daniel Glazman, Richard Ishida, Tony Ross, Sam Ruby, Jonas Sicking, Leif Halvard Silli, Henri Sivonen, Manu Sporny, and Philip Taylor. Special thanks to the W3C TAG and the W3C Internationalization (i18n) Core Working Group.

B. References

B.1 Normative references

[CHARMOD]
Martin Dürst; François Yergeau; Richard Ishida; Misha Wolf; Tex Texin et al. Character Model for the World Wide Web 1.0: Fundamentals. 15 February 2005. W3C Recommendation. URL: http://www.w3.org/TR/charmod/
[CSS3NAMESPACE]
Elika Etemad; Anne van Kesteren. CSS Namespaces Module. 29 September 2011. W3C Recommendation. URL: http://www.w3.org/TR/css3-namespace/
[HTML5]
Robin Berjon; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Edward O'Connor; Silvia Pfeiffer. HTML5. 6 August 2013. W3C Candidate Recommendation. URL: http://www.w3.org/TR/html5/
[HTTP11]
R. Fielding et al. Hypertext Transfer Protocol - HTTP/1.1. June 1999. RFC. URL: http://www.ietf.org/rfc/rfc2616.txt
[RFC2854]
D. Connolly; L. Masinter. The 'text/html' Media Type (RFC 2854). June 2000. RFC. URL: http://www.rfc-editor.org/rfc/rfc2854.txt
[SELECT]
Tantek Çelik; Elika Etemad; Daniel Glazman; Ian Hickson; Peter Linss; John Williams et al. Selectors Level 3. 29 September 2011. W3C Recommendation. URL: http://www.w3.org/TR/css3-selectors/
[WCAG20]
Ben Caldwell; Michael Cooper; Loretta Guarino Reid; Gregg Vanderheiden et al. Web Content Accessibility Guidelines (WCAG) 2.0. 11 December 2008. W3C Recommendation. URL: http://www.w3.org/TR/WCAG20/
[XML-MT]
M. Murata, S. St.Laurent, D. Kohn. XML Media Types. IETF RFC 3023. URL: http://www.ietf.org/rfc/rfc3023.txt.
[XML10]
Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; François Yergeau et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/xml