Файл: htmlpurifier-4.3.0/docs/dev-includes.txt
Строк: 606
INCLUDES, AUTOLOAD, BYTECODE CACHES and OPTIMIZATION
The
Problem
-----------
HTML Purifier contains a number of extra components
that are not used all
of the time, only if the user explicitly specifies
that we should use
them.
Some of these optional components are
optionally included (Filter,
Language, Lexer, Printer), while others are
included all the time
(Injector, URIFilter, HTMLModule, URIScheme). We
will stipulate that these
are all developer specified: it is conceivable
that certain Tokens are not
used, but this is user-dependent and should
not be trusted.
We should come up with a consistent way to handle these
things and ensure
that we get the maximum performance when there is
bytecode caches and
when there are not. Unfortunately, these two goals
seem contrary to each
other.
A peripheral issue is the performance of
ConfigSchema, which has been
shown take a large, constant amount of
initialization time, and is
intricately linked to the issue of includes
due to its pervasive use
in our plugin architecture.
Pros and
Cons
-------------
We will assume that user-based extensions will be
included by them.
Conditional includes:
Pros:
- User management
is simplified; only a single directive needs to be set
- Only
necessary code is included
Cons:
- Doesn't play nicely with opcode
caches
- Adds complexity to standalone version
- Optional
configuration directives are not exposed without a little
extra
coaxing (not implemented yet)
Include it all:
Pros:
- User
management is still simple
- Plays nicely with opcode caches and
standalone version
- All configuration directives are present
Cons:
- Lots of (how much?) extra code is included
- Classes that
inherit from external libraries will cause compile
errors
Build
an include stub (Let's do this!):
Pros:
- Only necessary code is
included
- Plays nicely with opcode caches and standalone version
- require (without once) can be used, see above
- Could further extend
as a compilation to one file
Cons:
- Not implemented yet
-
Requires user intervention and use of a command line script
-
Standalone script must be chained to this
- More complex and
compiled-language-like
- Requires a whole new class of system-wide
configuration directives,
as configuration objects can be reused
- Determining what needs to be included can be complex (see above)
-
No way of autodetecting dynamically instantiated classes
- Might be
slow
Include stubs
-------------
This solution may be "just
right" for users who are heavily oriented
towards performance.
However, there are a number of picky implementation
details to work out
beforehand.
The number one concern is how to make the HTML Purifier
files "work
out of the box", while still being able to easily
get them into a form
that works with this setup. As the codebase stands
right now, it would
be necessary to strip out all of the require_once
calls. The only way
we could get rid of the require_once calls is to use
__autoload or
use the stub for all cases (which might not be a bad
idea).
Aside
-----
An important thing to remember,
however, is that these require_once's
are valuable data about what
classes a file needs. Unfortunately, there's
no distinction between
whether or not the file is needed all the time,
or whether or not it
is one of our "optional" files. Thus, it is
effectively
useless.
Deprecated
----------
One of the things I'd like
to do is have the code search for any classes
that are explicitly
mentioned in the code. If a class isn't mentioned, I
get to assume
that it is "optional," i.e. included via introspection.
The
choice is either to use PHP's tokenizer or use regexps; regexps would
be faster but a tokenizer would be more correct. If this ends up being
unfeasible, adding dependency comments isn't a bad idea. (This could
even be done automatically by search/replacing require_once, although
we'd have to manually inspect the results for the optional requires.)
NOTE: This ends up not being necessary, as we're going to make the user
figure out all the extra classes they need, and only include the core
which is predetermined.
Using the autoload framework with include stubs
works nicely with
introspective classes: instead of having to have
require_once inside
the function, we can let autoload do the work; we
simply need to
new $class or accept the object straight from the caller.
Handling filters
becomes a simple matter of ticking off configuration
directives, and
if ConfigSchema spits out errors, adding the necessary
includes. We could
also use the autoload framework as a fallback, in case
the user forgets
to make the include, but doesn't really care about
performance.
Insight
-------
All of this talk is merely a
natural extension of what our current
standalone functionality does.
However, instead of having our code
perform the includes, or
attempting to inline everything that possibly
could be used, we boot
the issue to the user, making them include
everything or setup the
fallback autoload handler.
Configuration
Schema
--------------------
A common deficiency for all of the
conditional include setups (including
the dynamically built include PHP
stub) is that if one of this
conditionally included files includes a
configuration directive, it
is not accessible to configdoc. A stopgap
solution for this problem is
to have it piggy-back off of the data in the
merge-library.php script
to figure out what extra files it needs to
include, but if the file also
inherits classes that don't exist, we're in
big trouble.
I think it's high time we centralized the configuration
documentation.
However, the type checking has been a great boon for the
library, and
I'd like to keep that. The compromise is to use some other
source, and
then parse it into the ConfigSchema internal format (sans all
of those
nasty documentation strings which we really don't need at
runtime) and
serialize that for future use.
The next question is that
of format. XML is very verbose, and the prospect
of setting defaults in it
gives me willies. However, this may be necessary.
Splitting up the file
into manageable chunks may alleviate this trouble,
and we may be even want
to create our own format optimized for specifying
configuration. It might
look like (based off the PHPT format, which is
nicely compact yet
unambiguous and human-readable):
Core.HiddenElements
TYPE:
lookup
DEFAULT: array('script', 'style') // auto-converted during
processing
--ALIASES--
Core.InvisibleElements,
Core.StupidElements
--DESCRIPTION--
<p>
Blah
blah
</p>
The first line is the directive name, the lines after
that prior to the
first --HEADER-- block are single-line values, and then
after that
the multiline values are there. No value is restricted to a
particular
format: DEFAULT could very well be multiline if that would be
easier.
This would make it insanely easy, also, to add arbitrary extra
parameters,
like:
VERSION: 3.0.0
ALLOWED: 'none', 'light', 'medium',
'heavy' // this is wrapped in array()
EXTERNAL: CSSTidy // this would be
documented somewhere else with a URL
The final loss would be that you
wouldn't know what file the directive
was used in; with some clever
regexps it should be possible to
figure out where $config->get($ns,
$d); occurs. Reflective calls to
the configuration object is mitigated by
the fact that getBatch is
used, so we can simply talk about that in the
namespace definition page.
This might be slow, but it would only happen
when we are creating
the documentation for consumption, and is
sugar.
We can put this in a schema/ directory, outside of HTML Purifier.
The serialized
data gets treated like entities.ser.
The final thing
that needs to be handled is user defined configurations.
They can be added
at runtime using ConfigSchema::registerDirectory()
which globs the
directory and grabs all of the directives to be incorporated
in. Then, the
result is saved. We may want to take advantage of the
DefinitionCache
framework, although it is not altogether certain what
configuration
directives would be used to generate our key (meta-directives!)
Further thoughts
----------------
Our master configuration schema
will only need to be updated once
every new version, so it's easily
versionable. User specified
schema files are far more volatile, but
it's far too expensive
to check the filemtimes of all the files, so a
DefinitionRev style
mechanism works better. However, we can uniquely
identify the
schema based on the directories they loaded, so there's
no need
for a DefinitionId until we give them full programmatic
control.
These variables should be directly incorporated into
ConfigSchema,
and ConfigSchema should handle serialization. Some
refactoring will be
necessary for the DefinitionCache classes, as they
are built with
Config in mind. If the user changes something, the
cache file gets
rebuilt. If the version changes, the cache file gets
rebuilt. Since
our unit tests flush the caches before we start, and
the operation is
pretty fast, this will not negatively impact unit
testing.
One last thing: certain configuration directives require that
files
get added. They may even be specified dynamically. It is not a good
idea
for the HTMLPurifier_Config object to be used directly for such
matters.
Instead, the userland code should explicitly perform the
includes. We may
put in something like:
REQUIRES:
HTMLPurifier_Filter_ExtractStyleBlocks
To indicate that if that class
doesn't exist, and the user is attempting
to use the directive, we should
fatally error out. The stub includes the core files,
and the user includes
everything else. Any reflective things like new
$class would be required
to tie in with the configuration.
It would work very well with rarely
used configuration options, but it
wouldn't be so good for
"core" parts that can be disabled. In such cases
the core
include file would need to be modified, and the only way
to properly do
this is use the configuration object. Once again, our
ability to create
cache keys saves the day again: we can create arbitrary
stub files for
arbitrary configurations and include those. They could
even be the single
file affairs. The only thing we'd need to include,
then, would be
HTMLPurifier_Config! Then, the configuration object would
load the
library.
An aside...
-----------
One questions, however,
the wisdom of letting PHP files write other PHP
files. It seems like a
recipe for disaster, or at least lots of headaches
in highly secured
setups, where PHP does not have the ability to write
to its root. In
such cases, we could use sticky bits or tell the user
to manually
generate the file.
The other troublesome bit is actually doing the
calculations necessary.
For certain cases, it's simple (such as
URIScheme), but for AttrDef
and HTMLModule the dependency trees are
very complex in relation to
%HTML.Allowed and friends. I think that
this idea should be shelved
and looked at a later, less insane
date.
An interesting dilemma presents itself when a configuration form
is offered
to the user. Normally, the configuration object is not
accessible without
editing PHP code; this facility changes thing. The
sensible thing to do
is stipulate that all classes required by the
directives you allow must
be included.
Unit
testing
------------
Setting up the parsing and translation into our
existing format would not
be difficult to do. It might represent a good
time for us to rethink our
tests for these facilities; as creative as they
are, they are often hacky
and require public visibility for things that
ought to be protected.
This is especially applicable for our
DefinitionCache tests.
Migration
---------
Because we are not
*adding* anything essentially new, it should be trivial
to write a script
to take our existing data and dump it into the new format.
Well, not
trivial, but fairly easy to accomplish. Primary
implementation
difficulties would probably involve formatting the file
nicely.
Backwards-compatibility
-----------------------
I expect
that the ConfigSchema methods should stick around for a little bit,
but
display E_USER_NOTICE warnings that they are deprecated. This will
require
documentation!
New stuff
---------
VERSION: Version number directive
was introduced
DEPRECATED-VERSION: If the directive was deprecated, when
was it deprecated?
DEPRECATED-USE: If the directive was deprecated, what
should the user use now?
REQUIRES: What classes does this configuration
directive require, but are
not part of the HTML Purifier core?
vim: et sw=4 sts=4