Файл: htmlpurifier-4.3.0/docs/ref-html-modularization.txt
Строк: 357
The Modularization of HTMLDefinition in HTML Purifier
WARNING: This
document was drafted before the implementation of this
system, and
some implementation details may have evolved over time.
HTML Purifier
uses the modularization of
XHTML
<http://www.w3.org/TR/xhtml-modularization/> to organize the
internals
of HTMLDefinition into a more manageable and extensible fashion.
Rather
than have one super-object, HTMLDefinition is split into
HTMLModules,
each of which are responsible for defining elements, their
attributes,
and other properties (for a more indepth coverage,
see
/library/HTMLPurifier/HTMLModule.php's docblock comments). These
modules
are managed by HTMLModuleManager.
Modules that we don't support
but could support are:
* 5.6. Table Modules
o 5.6.1.
Basic Tables Module [?]
* 5.8. Client-side Image Map Module [?]
*
5.9. Server-side Image Map Module [?]
* 5.12. Target Module [?]
*
5.21. Name Identification Module [deprecated]
These modules would be
implemented as "unsafe":
* 5.2. Core Modules
o
5.2.1. Structure Module
* 5.3. Applet Module
* 5.5. Forms
Modules
o 5.5.1. Basic Forms Module
o 5.5.2. Forms
Module
* 5.10. Object Module
* 5.11. Frames Module
* 5.13.
Iframe Module
* 5.14. Intrinsic Events Module
* 5.15.
Metainformation Module
* 5.16. Scripting Module
* 5.17. Style
Sheet Module
* 5.19. Link Module
* 5.20. Base Module
We will
not be using W3C's XML Schemas or DTDs directly due to the lack
of robust
tools for handling them (the main problem is that all the
current parsers
are usually PHP 5 only and solely-validating, not
correcting).
This
system may be generalized and ported over for CSS.
== General Use-Case
==
The outwards API of HTMLDefinition has been largely preserved,
not
only for backwards-compatibility but also by design.
Instead,
HTMLDefinition can be retrieved "raw", in which it
loads a structure
that closely resembles the modules of XHTML 1.1. This
structure is very
dynamic, making it easy to make cascading changes to
global content
sets or remove elements in bulk.
However, once HTML
Purifier needs the actual definition, it retrieves
a finalized version of
HTMLDefinition. The finalized definition involves
processing the modules
into a form that it is optimized for multiple
calls. This final version is
immutable and, even if editable, would
be extremely hard to change.
So,
some code taking advantage of the XHTML modularization may look
like
this:
<?php
$config = HTMLPurifier_Config::createDefault();
$def =& $config->getHTMLDefinition(true); // reference to raw
$def->addElement('marquee', 'Block', 'Flow', 'Common');
$purifier =
new HTMLPurifier($config);
$purifier->purify($html); // now the
definition is finalized
?>
== Inclusions ==
One of the nice
features of HTMLDefinition is that piggy-backing off
of global attribute
and content sets is extremely easy to do.
=== Attributes
===
HTMLModule->elements[$element]->attr stores attribute
information for the
specific attributes of $element. This is quite close
to the final
API that HTML Purifier interfaces with, but there's an
important
extra feature: attr may also contain a array with a member index
zero.
<?php
HTMLModule->elements[$element]->attr[0] =
array('AttrSet');
?>
Rather than map the attribute key 0 to an array
(which should be
an AttrDef), it defines a number of attribute collections
that should
be merged into this elements attribute array.
Furthermore,
the value of an attribute key, attribute value pair need
not be a fully
fledged AttrDef object. They can also be a string, which
signifies a
AttrDef that is looked up from a centralized registry
AttrTypes. This
allows more concise attribute definitions that look
more like W3C's
declarations, as well as offering a centralized point
for modifying the
behavior of one attribute type. And, of course, the
old method of manually
instantiating an AttrDef still works.
=== Attribute Collections
===
Attribute collections are stored and processed in the
AttrCollections
object, which is responsible for performing the inclusions
signified
by the 0 index. These attribute collections, too, are mutable,
by
using HTMLModule->attr_collections. You may add new attributes
to a
collection or define an entirely new collection for your module's
use.
Inclusions can also be cumulative.
Attribute collections allow us to get
rid of so called "global attributes"
(which actually aren't so
global).
=== Content Models and ChildDef ===
An implementation of the
above-mentioned attributes and attribute
collections was applied to the
ChildDef system. HTML Purifier uses
a proprietary system called ChildDef
for performance and flexibility
reasons, but this does not line up very
well with W3C's notion of
regexps for defining the allowed children of an
element.
HTMLPurifier->elements[$element]->content_model
and
HTMLPurifier->elements[$element]->content_model_type store
information
about the final ChildDef that will be stored
in
HTMLPurifier->elements[$element]->child (we use a different
variable
because the two forms are sufficiently
different).
$content_model is an abstract, string representation of the
internal
state of ChildDef, while $content_model_type is a string
identifier
of which ChildDef subclass to instantiate. $content_model is
processed
by substituting all content set identifiers (capitalized element
names)
with their contents. It is then parsed and passed into the
appropriate
ChildDef class, as defined by the
ContentSets->getChildDef() or the
custom fallback
HTMLModule->getChildDef() for custom child definitions
not in the
core.
You'll need to use these facilities if you plan on referencing a
content
set like "Inline" or "Block", and using them
is recommended even if you're
not due to their conciseness.
A few notes
on $content_model: it's structure can be as complicated
as you want, but
the pipe symbol (|) is reserved for defining possible
choices, due to the
content sets implementation. For example, a content
model that looks
like:
"Inline -> Block -> a"
...when the Inline
content set is defined as "span | b" and the Block
content set
is defined as "div | blockquote", will expand into:
"span
| b -> div | blockquote -> a"
The custom
HTMLModule->getChildDef() function will need to be able to
then feed
this information to ChildDef in a usable manner.
=== Content Sets
===
Content sets can be altered using HTMLModule->content_sets, an
associative
array of content set names to content set contents. If the
content set
already exists, your values are appended on to it (great for,
say,
registering the font tag as an inline element), otherwise it
is
created. They are substituted into content_model.
vim: et sw=4
sts=4