If you've ever taken a gander at a string sanitization class or library, you've probably noticed the amount of code necessary to keep the script kiddies at bay. We're talking about slow string manipulations, i.e. string replacements and regular expressions.
Most MVC Frameworks today come with some form of sanitization or input filtering class built in. The problem with many of these libraries is that they fail to clean some of the more creative attack vectors. To combat this, some people use drop in libraries like OWASP AntiSamy or HTML Purifier to ensure their data is getting scrubbed clean. HTMLPurifier, for instance, uses a Smoke Test via the ha.ckers.org XSS attack list (the de-facto standard for finding attack vectors to test, might I add) to ensure they're cleaning anything and everything.
So What's The Problem?
Cleaning a string is slow, to say the least. A slow load time is the bane of your site's existence. Customers and visitors hate waiting. The tools mentioned above, as well as any equivalent unmentioned tools, accomplish the task via brute force stripping of characters and numerous regular expressions. Here are a few examples of some common XSS filters out there:
To optimize these libraries, we need to step back a minute and think about what it is we're trying to accomplish. In this case, how can we go about skipping all of the string manipulations and pattern matching with no loss in security?
So What's The Solution?
There is no catch-all solution, especially if you're trying to sanitize HTML coming in from a WYSIWYG editor. There is, however, a very worth-while tweak you can use to make your site's sanitization much, much faster under most circumstances. The simple solution is to start your sanitization with a conditional statement which checks to see if the string is already clean of tom-foolery, and return it if so. Grunt work avoided. The reason such a simple trick works is because 99.9% of user input is going to be alpha-numeric with a few sprinkles of punctuation. Don't quote me on that.
It's debatable as to what additional characters you can add to this without overly complicating the solution and potentially introducing security holes. I'll leave it up to you as the reader to suggest alternative whitelist regexes or punch holes in the solution (like hey guy, how do you know if the character set is in utf-8?).