Просмотр архива Htmlpurifier


HTML Purifier
  by Edward Z. Yang

There are a number of ad hoc HTML
filtering solutions out there on the web
(some examples including
HTML_Safe, kses and SafeHtmlChecker.class.php) that
claim to filter HTML
properly, preventing malicious JavaScript and layout
breaking HTML from
getting through the parser.  None of them, however,
demonstrates a
thorough knowledge of neither the DTD that defines the HTML
nor the
caveats of HTML that cannot be expressed by a DTD.  Configurable
filters
(such as kses or PHP's built-in striptags() function) have
trouble
validating the contents of attributes and can be subject to
security attacks
due to poor configuration.  Other filters take the naive
approach of
blacklisting known threats and tags, failing to account for
the introduction
of new technologies, new tags, new attributes or quirky
browser behavior.

However, HTML Purifier takes a different approach, one
that doesn't use
specification-ignorant regexes or narrow blacklists. 
HTML Purifier will
decompose the whole document into tokens, and
rigorously process the tokens by:
removing non-whitelisted elements,
transforming bad practice tags like <font>
into <span>,
properly checking the nesting of tags and their children and
validating
all attributes according to their RFCs.

To my knowledge, there is
nothing like this on the web yet.  Not even MediaWiki,
which allows an
amazingly diverse mix of HTML and wikitext in its documents,
gets all the
nesting quirks right.  Existing solutions hope that no JavaScript
will
slip through, but either do not attempt to ensure that the
resulting
output is valid XHTML or send the HTML through a draconic XML
parser (and yet
still get the nesting wrong: SafeHtmlChecker.class.php
does not prevent <a>
tags from being nested within each
other).

This document no longer is a detailed description of how
HTMLPurifier works,
as those descriptions have been moved to the
appropriate code.  The first
draft was drawn up after two rough code
sketches and the implementation of a
forgiving lexer.  You may also be
interested in the unit tests located in the
tests/ folder, which provide a
living document on how exactly the filter deals
with malformed
input.

In summary (see corresponding classes for more details):

1.
Parse document into an array of tag and text tokens (Lexer)
2. Remove all
elements not on whitelist and transform certain other elements
   into
acceptable forms (i.e. <font>)
3. Make document well formed while
helpfully taking into account certain quirks,
   such as the fact that
<p> tags traditionally are closed by other block-level
  
elements.
4. Run through all nodes and check children for proper order
(especially
   important for tables).
5. Validate attributes according to
more restrictive definitions based on the
   RFCs.
6. Translate back into
a string. (Generator)

HTML Purifier is best suited for documents that
require a rich array of
HTML tags.  Things like blog comments are, in all
likelihood, most appropriately
written in an extremely restrictive set of
markup that doesn't require
all this functionality (or not written in HTML
at all), although this may
be changing in the future with the addition of
levels of filtering.

    vim: et sw=4 sts=4