safe_html.php
By Chris Snyder <csnyder@chxo.com>
safe_html() is a free php function that takes a conservative approach
to sanitizing user input while still allowing some markup through. It
is meant to be used on posts to message boards or comment systems,
where you want to allow html images, links and lists, but not
cross-site scripting attacks.
It blocks attempts to embed JavaScript, removes style attributes,
removes undesireable html tags, and closes any open tags or comments at
the end of the post. When it doubt, safe_html() will strip all html tags from a post, rather
than let something tricky slip through.
Full disclosure: this is a
simple hack, but I trust it on production sites. It does not try to
clean up or standardize the html itself. It does not
use the best/most-efficient regexes. It uses brute force. It is not
certified by anyone. You are welcome to audit the code and submit both
exploits and patches, which I may incorporate into future versions.
For input that should not contain html tags at all, it is much better
to use php's built-in htmlentities()
function. And whereas htmlentities() can be decoded, safe_html()'s
filtering cannot be reversed. Your mileage may vary, read and
understand the disclaimer, etc etc.
Demonstration
- Source Code
- Standard Tests - Comments
How To Use safe_html()
To use safe_html.php in your code, simply include it (once) and then
call the safe_html() function as you would htmlentities() or any other
php string processing function.
<?php
include_once('/path/to/safe_html.php');
$markup = $_POST['content'];
$safemarkup = safe_html( $markup );
?>
<h3>Your Post:</h3>
<?=$safemarkup?>
The usage above causes safe_html() to use its default set of allowed
html tags. You can explicitly set the tags allowed using an optional
second argument to safe_html(), as detailed below.
<?php
include_once('/path/to/safe_html.php');
$markup = $_POST['content'];
// define custom allowed tags
array
// - balanced tags (<p></p>) are marked 1
// - unbalanced tags (<hr>) are marked 0
//
// FYI this example is the default allowedtags array
//
$allowedtags= array ( "p"=>1, "br"=>0, "a"=>1, "img"=>0,
"li"=>1, "ol"=>1, "ul"=>1,
"b"=>1, "i"=>1, "em"=>1, "strong"=>1,
"del"=>1, "ins"=>1, "u"=>1, "code"=>1, "pre"=>1,
"blockquote"=>1, "hr"=>0
);
$safemarkup = safe_html( $markup, $allowedtags );
?>
<h3>Your Post:</h3>
<?=$safemarkup?>
Theory of Operation
For general information on Cross-Site Scripting (XSS) see the Wikipedia
entry or get your hands on a copy of Pro PHP
Security, Chapter 13.
For a comprehensive list of attack vectors, see RSnake's cheatsheet at http://ha.ckers.org/xss.html.
There are three main strategies taken by safe_html.php: preventing
JavaScript injection, stripping the style attribute on html tags, and
stripping any html tags that aren't on an allowed short-list. Between
them, these cut off most of the known vectors of attack... but not all.
Attackers may still link to malicious URLs, for instance. And they are
always free to try social engineering tricks.
Preventing JavaScript Injection
If safe_html() encounters anything inside of a tag that looks like a
script call, it will assume that an attack is
being attempted, and call php's strip_tags() on the entire value being
processed. safe_html() will not filter JavaScript outside of html tags,
so that your users can still talk about code.
I think <em>javascript:foo()</em> is the best method.
Comes out fine:
I think javascript:foo() is
the best method.
But:
I think <a href="javascript:alert(document.cookie);">you should
click this</a>.
Comes out as plain text, because strip_tags() is triggered:
I think you should click this.
Simply put, semi-trusted web users have no business using any sort of
JavaScript in their messages or resources. That's something that should
be restricted to static pages and templates, managed by trusted users
out of band.
Stripping Style Attributes
Unfortunately this may be a little unpopular, but here it is:
safe_html() strips all style="" attributes from html tags. That's
right, you don't get arbitary styling, which means that the font and
size menus, colors, and indenting on some in-browser WYSWYG content
editors won't work. It also means that (unless you specifically allow
the <font> tag) users cannot change the font size or color in the
markup they post. This is done to prevent user content from hijacking
your page.
For instance, the following html will effectively cover whatever page
it is on, and entice the user to click a malicious link:
<p style="position: absolute; top: 0px; left: 0px;
width: 98%; height: 98%;
background-color: white;">
Experiencing technical difficulties,
<a href="http://10.0.17.128/?action=login">click here to re-try</a>.
</p>
safe_html() will strip the style attribute altogether, rendering the
markup as a normal <p> paragraph, and exposing the post as an
attempt at a social engineering attack.
Stripping and Balancing html
In addition to the above protections, safe_html() ensures that users
can only use approved html tags, and that all open tags and comments
are closed by the end of the markup. That way some sad sack can't leave
an <a> tag unclosed and wreck your whole interface. This happens
more often then you might think.
The sanitized text is branded the with safe_html version number (which
doubles as a comment closure) so you'll know which text needs to be
refiltered next time there's a bug-fix or new attack vectors discovered.
Last updated 2005-09-05.