Sanitizing user data: How and where to do it

Sep 11 2008

User data can be dangerous. Whatever the user supplies as data, especially in a web application, cannot be assumed to be safe. On the contrary, there are many malicious users who try to exploit every security vulnerability in your application. XSS, CSRF, SQL Injection attacks are familiar to most of you. (If not, go figure it out and come back fast.) In order to protect your application from such attacks you need to sanitize user data so that it does not do anything harmful to your system.

Exploits-of-a-mom

A big question being discussed vigorously in the web development community is:

Where to sanitize the user data? Should it be done in the input stage where the data is being entered by the user or in the output stage where the data is being displayed to the user?

The solution, in my opinion, (and in the opinion of a large group of experts in this field) is to do dual sanitization. One validation and SQL escaping before going into the database and one sanitization (filtering and escaping) before going to the output.

So the process essentially boils down to validation in the input and escaping in the output. Here are the reasons why you should go by this method instead of escaping and sanitation in the input alone:

  1. The way data needs to be sanitized depends on the context the data is intended to be used. For example, if the data is to be stored in a database, we need to escape the ‘ character to prevent SQL Injection attacks. If the data is to be displayed in the HTML output, we need to escape the < and > characters to prevent XSS attacks. In the input stage we cannot anticipate the ways in which the data is going to be used.  So it is better to sanitize the data just before the output stage when it is clear where the data is going.
  2. You cannot always be sure that the data in the database is sanitized data. You cannot guarantee that it came from the sources we anticipated the data to come from. There is a chance that the data ended up in the database through a path where you have not placed your input sanitizer. What if a user directly edited the database to add some data? What if there are loopholes in your sanitizer? What if the data was placed by an SQL injection attack against your database? All these points tell us that we need to sanitize user data where it is being used – that is in the output stage.
  3. There may be other applications which use the data from your database. For example an application written in COBOL may be using the data from the database to generate some reports with it. If the data already in the database is in the form of &gt;script&lt;&nbsp;hello&nbsp;world, the COBOL application will not able to make sense out of the data. It will have to implement its own decoder to read the data. This is a very painful process. We can avoid situations like this if we do not push processed data into the database.
  4. It is always best to have pure unaltered data in the database so that it can be easily processed by all the applications using the data. Once we sanitize the data before it is stored in the database, there is no going back. It is really hard to get the original data supplied by the user back after doing all these filtering and escaping techniques. On the other hand, if we have unaltered data in the database it is easy to escape it later with respect to each application using the data.
  5. According to the above points, data sanitization in the output is anyway needed for obvious reasons. If we are encoding the user data in the input as well as in the output, the data will be in a doubly encoded form and it will not be useful at all. There is no need for double sanitization anyway. So it is always recommended to encode your data to the target format just before passing the data to the target system.
  6. Users have reported security holes with applications like phpMyAdmin when it displays database values without encoding to HTML format. The developers of phpMyAdmin anticipated the data in the user databases to be free of any malicious code, but it may not be the case. So your application needs output sanitization especially if you are using data form outside sources. Never trust any data coming your way.
  7. Assume that you are using input sanitization. If there is some bug in the sanitizer, malicious data will creep into the database and now you have to fix the sanitizer and remove all the malicious data from your database. This can be a very tedious job. But if you were using output sanitizer, you just would have to modify the code to fix the security hole.

So how to do this two step sanization? Here is how:

  1. User data comes in
  2. Validate the data
  3. If valid, do SQL escaping and store in the database. (mysql_real_escape_string( ) in PHP)
  4. If invalid, reject the data. Don’t try to modify the data and push it into the database. This will do more harm than good. The user will think that the data went through successfully while the data in the database will be something else. So just accept or reject the user data. Don’t try to alter it.
  5. Output: If the data is going to an HTML page, escape for HTML. (htmlentities( ) in PHP). If the data is going to a unix command line, escape for shell.(escapeshellarg( ) in PHP). If the data is going to a URL, URL encode the data.(urlencode( ) in PHP) etc.

In the validation step, check for the proper encoding of the data – URL/UTF-7/Unicode/US-ASCII etc. Then check if the data contains proper character-set. Allow only the characters which are really needed for the application. Put a limit the length of the input data. Remember that an attacker usually makes use of long strings to craft an attack. Check whether the data format is correct or not. Phone numbers should contain only numbers; email addresses should contain text in the specific email format etc.

Always use the methods or frameworks provided by your language/platform to do the escaping and encoding/decoding. Most of the languages out there support these operations. Java is an exception though: when you are using Java, you should write your own methods handle HTML encoding/decoding.

Finally, when sending the data to the web browser, remember to set the proper encoding for the web page. This can be done using the response header attribute or using meta tags. It is advisable to use both methods. Forgetting this step can aid some types of XSS attacks.

Other concerns

Some websites need to output user input as HTML itself – for example websites that allow HTML editing. In this case you cannot do encoding in your application. Remember to add proper filtering mechanisms to allow only the tags that are intended to be used. Always block potentially dangerous tags such as <script></script>

Read more at:

6 responses so far

Leave a Reply