Safe HTML
2009-07-07 17:29:23.454946+00 by
Dan Lyke
7 comments
Okay, LazyWeb: The Flutterby CMS knows some stuff about HTML and jumps through hoops to prevent users from entering malicious HTML. It's probably not perfect, but through the years I've seen exploits in various other web sites and thought "wow, I can't believe they left that in."
I'm playing with a new web site, and am wondering what the current mainstream mechanisms are for filtering user input HTML to control the potential for exploits or bad HTML.
[ related topics:
Content Management Invention and Design
]
comments in ascending chronological order (reverse):
#Comment Re: made: 2009-07-07 18:39:48.680979+00 by:
Dan Lyke
Found HTML::Scrubber, still interested in other solutions.
#Comment Re: made: 2009-07-07 22:10:25.621775+00 by:
spc476
I think a whitelist is the best bet, both for tags and attributes. Disallow any tags dealing with frames, iframes, forms and objects (except for IMG). Disallow the style attribute, any on* attribute and the tabindex and accesskey attributes. Definitely filter out the SCRIPT tag, and filter out any URLs that start with "javascript:". I think that will get you most of the way.
#Comment Re: made: 2009-07-08 01:32:27.769586+00 by:
John Anderson
+1 for the whitelist of allowed stuff.
#Comment Re: made: 2009-07-08 11:46:22.518204+00 by:
meuon
In PHP, I've used domdocument (to validate) and strip_tags to limit such things.
Then some regexes on key words.
#Comment Re: made: 2009-07-08 15:00:21.40526+00 by:
Dan Lyke
Yeah, I'm generally whitelisting rather than blacklisting, although I think I'll want to allow iframes and objects for this app. My main question was which tool to do it with, I'm trying to make this new project a bit of a learning experience to keep up with what others are playing with.
So far, I think HTML::Scrubber is the right tool, but I'm not tied to it yet.
#Comment Re: made: 2009-07-08 15:24:16.963524+00 by:
John Anderson
Have you looked at HTML::StripScripts::Parser?
#Comment Re: made: 2009-07-14 13:47:45.781693+00 by:
other_todd
I stripped ruthlessly when I allowed direct HTML in the past, usually by specifically allowing a very small set of tags and then stripping anything else that looks like <...> at all. Users rarely type those characters as part of plain text in comment posts (and W3C says they should be using < and > anyway, not that the average end user knows what those are).
I did it myself. I hate calling libraries for HTML wrangling, especially if the code is short. The threshold on "it's less work to live with the bloat and quirks of this Perl library than do it myself" is a very high one for me, as you know.
These days I don't scrub HTML at all, because I will never again display a web form to parties who haven't been prescreened on any of my web sites ever again. This doesn't remove the possibility of mischief, but greatly decreases it, and if it does happen, I know who to come after with a shotgun.
Even at my workplace, we got away from open web forms years ago. You have to log in in some way to use any form on our site. Those we still filter HTML on, though, because we have enough users with logins we don't trust (e.g. the students, who love to link enormous images or embed media in places they don't belong.)