safe_html.php

By Chris Snyder <csnyder@chxo.com>

safe_html() is a free php function that takes a conservative approach to sanitizing user input while still allowing some markup through. It is meant to be used on posts to message boards or comment systems, where you want to allow html images, links and lists, but not cross-site scripting attacks.

It blocks attempts to embed JavaScript, removes style attributes, removes undesireable html tags, and closes any open tags or comments at the end of the post. When it doubt, safe_html() will strip all html tags from a post, rather than let something tricky slip through.

Full disclosure: this is a simple hack, but I trust it on production sites. It does not try to clean up or standardize the html itself. It does not use the best/most-efficient regexes. It uses brute force. It is not certified by anyone. You are welcome to audit the code and submit both exploits and patches, which I may incorporate into future versions.

For input that should not contain html tags at all, it is much better to use php's built-in htmlentities() function. And whereas htmlentities() can be decoded, safe_html()'s filtering cannot be reversed. Your mileage may vary, read and understand the disclaimer, etc etc.

Demonstration  -  Source Code  -  Standard Tests  -  Comments

How To Use safe_html()

To use safe_html.php in your code, simply include it (once) and then call the safe_html() function as you would htmlentities() or any other php string processing function.

<?php
  include_once('/path/to/safe_html.php');
  $markup = $_POST['content'];
  $safemarkup = safe_html( $markup );

?>
<h3>Your Post:</h3>
<?=$safemarkup?>

The usage above causes safe_html() to use its default set of allowed html tags. You can explicitly set the tags allowed using an optional second argument to safe_html(), as detailed below.

<?php
  
include_once('/path/to/safe_html.php');
  
$markup = $_POST['content'];

  
// define custom allowed tags array
  // - balanced tags (<p></p>) are marked 1
  // - unbalanced tags (<hr>) are marked 0
  //
  // FYI this example is the default allowedtags array
  //
  
$allowedtags= array ( "p"=>1, "br"=>0, "a"=>1, "img"=>0,
                        
"li"=>1, "ol"=>1, "ul"=>1,
                        
"b"=>1, "i"=>1, "em"=>1, "strong"=>1,
                        
"del"=>1, "ins"=>1, "u"=>1, "code"=>1, "pre"=>1,
                        
"blockquote"=>1, "hr"=>0
                        
);

  
$safemarkup = safe_html( $markup, $allowedtags );

?>
<h3>Your Post:</h3>
<?=$safemarkup?>

Theory of Operation

For general information on Cross-Site Scripting (XSS) see the Wikipedia entry or get your hands on a copy of Pro PHP Security, Chapter 13.

For a comprehensive list of attack vectors, see RSnake's cheatsheet at http://ha.ckers.org/xss.html.

There are three main strategies taken by safe_html.php: preventing JavaScript injection, stripping the style attribute on html tags, and stripping any html tags that aren't on an allowed short-list. Between them, these cut off most of the known vectors of attack... but not all. Attackers may still link to malicious URLs, for instance. And they are always free to try social engineering tricks.

Preventing JavaScript Injection

If safe_html() encounters anything inside of a tag that looks like a script call, it will assume that an attack is being attempted, and call php's strip_tags() on the entire value being processed. safe_html() will not filter JavaScript outside of html tags, so that your users can still talk about code.
I think <em>javascript:foo()</em> is the best method.
Comes out fine:

I think javascript:foo() is the best method.

But:
I think <a href="javascript:alert(document.cookie);">you should
click this</a>.
Comes out as plain text, because strip_tags() is triggered:
I think you should click this.
Simply put, semi-trusted web users have no business using any sort of JavaScript in their messages or resources. That's something that should be restricted to static pages and templates, managed by trusted users out of band.

Stripping Style Attributes

Unfortunately this may be a little unpopular, but here it is: safe_html() strips all style="" attributes from html tags. That's right, you don't get arbitary styling, which means that the font and size menus, colors, and indenting on some in-browser WYSWYG content editors won't work. It also means that (unless you specifically allow the <font> tag) users cannot change the font size or color in the markup they post. This is done to prevent user content from hijacking your page.

For instance, the following html will effectively cover whatever page it is on, and entice the user to click a malicious link:
<p style="position: absolute; top: 0px; left: 0px; 
width: 98%; height: 98%;
background-color: white;">
Experiencing technical difficulties,
<a href="http://10.0.17.128/?action=login">click here to re-try</a>.
</p>

safe_html() will strip the style attribute altogether, rendering the markup as a normal <p> paragraph, and exposing the post as an attempt at a social engineering attack.

Stripping and Balancing html

In addition to the above protections, safe_html() ensures that users can only use approved html tags, and that all open tags and comments are closed by the end of the markup. That way some sad sack can't leave an <a> tag unclosed and wreck your whole interface. This happens more often then you might think.

The sanitized text is branded the with safe_html version number (which doubles as a comment closure) so you'll know which text needs to be refiltered next time there's a bug-fix or new attack vectors discovered.


Last updated 2005-09-05.