What is input validation?
Any system or application that works with input data needs to ensure that it is valid. This applies equally to information provided directly by the user and to data received from other systems. There are many different types and levels of validation, from syntactic validation that checks the input types and lengths to semantic validation that ensures supplied values make sense in the application context. So if you’re entering an email address, syntactic validation would mean checking the syntax (i.e. the characters and structure) to ensure that it is a valid email, while semantic validation might allow (or exclude) only addresses from specific domains.
In web application development, input validation is typically understood as checking the values of web form input fields. This initial client-side validation is performed directly in the browser, but you also have to check submitted values on the server side.
While you will often see the terms user input or user-controlled input, actually determining all the application inputs that a malicious user could control is not easy. This is why it is good security practice to treat all application inputs as untrusted by default and validate everything. The same principle also applies to data originating from theoretically trusted systems and users since attackers may abuse such trust relationships to send dangerous data via a compromised third party.
The consequences of improper input validation
When reading about web vulnerabilities on this blog, you may have noticed that many of the posts have a very similar ending: “To mitigate this vulnerability, make sure you carefully validate all user inputs.” By preventing malicious users from freely entering attack strings, you can reduce your exposure to many injection attacks, including cross-site scripting (XSS), SQL injection, and code injection (RCE). If you look at the definition of CWE-20: Improper Input Validation, you will notice that this weakness can precede many others and lead to all sorts of security headaches.
While input validation alone can never prevent all attacks, it can reduce the attack surface and minimize the impact of any attacks that do succeed. Beyond its security implications, data validation is also crucial for software performance, stability, and usability. When processing invalid or corrupt data, an application might return incorrect results, fail to load, or even crash the web server.
Missing or insufficient input validation can also degrade the user experience on other levels. For example, if a registration page fails to detect an incorrect email or phone number, the user may be unable to confirm their account. If invalid data passes validation in the browser and is only caught during server-side validation, users may experience errors or longer load times.
How to ensure proper input validation in web applications
HTML5 validation features
The HTML5 spec includes built-in form validation features that let you specify validation constraints directly in HTML. These include input field attributes such as
required to indicate a required field,
type to specify the data type,
maxlength to define a maximum length limit, and
pattern to specify a regex pattern for valid values. The spec also defines CSS pseudo-classes such as
:invalid so you can easily apply different styles depending on the validation result.
Built-in form validation features in HTML5 are a great place to get started with data validation. With just a few extra attributes in standard HTML elements, you get basic data type and content validation with cross-platform support to save you a lot of work and provide a native user experience. For detailed examples, see the MDN article on client-side form validation.
Blacklisting vs. whitelisting
For well-defined inputs such as numbers, dates, or postcodes, it’s much easier and safer to use a whitelist. That way, you can precisely specify permitted values and reject everything else. With HTML5 form validation, you get predefined whitelisting logic in the built-in data type definitions, so if you indicate that a field contains an email address, you already have email validation. If only a handful of values are expected, you can use regular expressions to explicitly whitelist them.
Whitelisting gets tricky with free-form text fields, where you need some way to allow the vast majority of available characters, potentially in many different alphabets. Unicode character categories can be useful to allow, for example, only letters and numbers in a variety of international scripts. You should also apply normalization to ensure that all input uses the same encoding and no invalid characters are present.
Input validation against XSS
The problems with validating free-form text once again highlight the limitations of input validation in a security context. Despite its importance for web application security, input validation is not and never should be your primary defense against cross-site scripting (XSS). Believing that rejecting angle brackets or script tags will protect you against XSS is asking for trouble. Simply filtering inputs is not enough to prevent cross-site scripting (and, in any case, does not cover all XSS variants), which is why XSS filters have been removed from modern web browsers.
In the case of cross-site scripting and other injection attacks, your main defense is context-aware output encoding to ensure that even if malicious code makes it into the application, it will not be executed. Apart from security, context-aware encoding is also important for usability. To end with a real-life example, if an application user needs to enter
<script> in a text field (perhaps because they are writing a blog post about input validation), the application should properly encode these characters and ensure that they are processed correctly and safely in this specific context.
For a detailed discussion of input validation in web applications, see the OWASP Input Validation Cheat Sheet.