Просмотр архива Htmlpurifier


Handling Content Model Changes


1. Context

The distinction between
Transitional and Strict document types is somewhat
of an anomaly in the
lineage of XHTML document types (following 1.0, no
doctypes do not have
flavors: instead, modularization is used to let
document authors vary
their elements).  This transition is usually quite
straight-forward, as
W3C usually deprecates attributes or elements, which
are quite easily
handled using tag and attribute transforms.

However, for two elements,
<blockquote>, <body> and <address>, W3C elected
to also
change the content model.  <blockquote> and <body>
originally
accepted both inline and block elements, but in the strict
doctype they
only allow block elements.  With <address>, the
situation is inverted:
<p> tags were now forbidden from appearing
within this tag.


2. Current situation

Currently, HTML Purifier
treats <blockquote> specially during Tidy mode
using a custom
ChildDef class StrictBlockquote.  StrictBlockquote
operates similarly to
Required, except that when it encounters an inline
element, it will wrap
it in a block tag (as specified by
%HTML.BlockWrapper, the default is
<p>).  The naming suggests it can
only be used for
<blockquote>s, although it may be possible to
genericize it to work
on other cases of this nature (this would be of
little practical
application, as no other element in XHTML 1.1 or earlier
has a block-only
content model).

Tidy currently contains no custom, lenient
implementation for <address>.
If one were to be written, it would
likely operate on the principle that,
when a <p> tag were to be
encountered, it would be replaced with a
leading and trailing <br />
tag (the contents of <p>, being inline, are
not an issue).  There is
no prior work with this sort of operation.


3. Outside
applicability

There are a number of other elements that contain
restrictive content
models, such as <ul> or <span> (the latter
is restrictive in that it
does not allow block elements).  In the former
case, an errant node
is eliminated completely, in the latter case, the
text of the node
would is preserved (as the parent node does allow
PCDATA).  Custom
content model implementations probably are not the best
way of handling
these cases, instead, node bubbling should be implemented
instead.

    vim: et sw=4 sts=4