This blog post includes a discussion of URLs, their structure, how they can contain sensitive information and why it's so difficult to parse them without introducing vulnerabilities. We include an example of how a parsing error led to a Window Opener Protection Bypass.

Since its inception, the Uniform Resource Locator (URL) has been a fundamental part of the World Wide Web. It is easily located in your current browser's address bar.
If you were not already very familiar with URLs, it would be easy to conclude that they always start with either 'http://' or 'https://', and can't contain sensitive information. Unfortunately, this is not true.

In this article, we'll shed light on these apparently insignificant strings, to reveal that they can be bursting with information, and we'll also examine why it's so hard to parse them correctly.
Let's first start with a very simple URL example and why it's hard to parse them correctly. One of HackerOne's latest submissions examines a tabnabbing protection bypass for a URL parser. Phabricator is an open source management program that contained a security bug that could be abused by a rather interesting looking URL. Phabricator checks whether links added by users point to an internal resource or to another website. Those pointing to another website are treated with special care, as Phabricator adds an additional security attribute to the link. All other links (that link to internal resources) do not receive this attribute. Links that Phabricator interprets as an internal resource might look like this:
/\example.com/some-file
This doesn't look like a typical URL that you'd see in a browser window. So how does this work and why did Phabricator not recognize that it leads to an external website? Let's first look at the href attribute of link tags. In order to create an internal link to your website's blog, for example, your homepage must contain HTML code like this:
<a href = "/blog/">Our Blog</a>
This is fairly common. Browsers immediately know that you want to visit the /blog/ endpoint on the same website. However, it's also possible to do something like this:
<a href = "//example.com/blog/">An external Blog</a>
This has the potential to cause confusion. Why are there two slashes, where you would expect 'http://' or 'https://'? The answer has to do with mixed content. When you serve your website over HTTPS, you don't want any HTTP links to appear there. In older web browser versions you would risk leaking sensitive data (such as cookies) over HTTP if you included an image. In the newer web browser versions however, the content would simply be blocked and your image would not be displayed. This is obviously a problem.
So, imagine you want to load the image over HTTP. Your visitors go to http://yourwebsite.com/ which works fine. But once they visit the HTTPS version of your website you run into the problem with mixed content. You'd either have to replace 'http://' with 'https://' on the server side, or use JavaScript to do so. However, modern browsers are able to determine whether you should use 'http://' or 'https://', if you give them the permission to do so; you need only omit 'http:' and 'https:' from the link. What's left is the (two slashes) link mentioned above.
The problem is, when you compare it to the link to your website's blog, you will see striking similarities – no 'http://', no 'https://', and it begins with a slash.
However, it's quite common to use the above syntax. It comes as no surprise, then, that the Phabricator developers took precautions and ensured that such links were treated as external links too. However, they neglected the fact that browsers automatically convert a backslash to a forward slash if the URL begins with '/\'. This is where the vulnerability occurred and why Phabricator parsed the URL incorrectly.
But what was the security attribute about? The HTML attribute that Phabricator omitted on internal links was called noreferrer. Note how it's written differently from the HTTP Referer header. This header is sent by the browser to tell the server which website contained the link that the user clicked. That means that a server can determine whether you came from https://example.com/help or https://www.netsparker.com/blog. While this has its advantages, it also comes with a risk that sensitive information can easily be leaked to the web server. Of course, passwords and session IDs don't belong in URLs, and the referrer is just one of the reasons why this is a dangerous idea. But even if they don't contain a password or a session ID, URLs can still contain information that should not be made available to the visited site.
Let's consider an example. Imagine a customer uses a helpdesk application to open a ticket that contains a link to an article on their website. Once the employee clicks the link, they automatically send the URL of both the helpdesk application and the link, possibly containing the title of the particular ticket, to the customer.
Of course, internal helpdesk tickets may sometimes contain titles that shouldn't be shown to customers. One way to prevent the browser from sending a Referer header is via the rel attribute containing the noreferrer value. As explained before, the spelling is different to the HTTP header. The reason for this is that Phillip Hallam-Baker, the computer scientist who made the proposal for the Referer header, spelt it wrong. Reportedly, the UNIX spellchecker at the time knew neither 'referer' nor 'referrer'. Apart from the fact that this means our browsers send one byte less per request, the spelling of this header also leads to a lot of confusion. If you add rel = "noreferer" to your link tag, it doesn't have any effect.
So, to recap, when you add rel = "noreferrer" to your link, you prevent the browser from sending a Referer header.
The HackerOne submission mentions the Tabnabbing exploit a few times, which is what both the submitter and the Phabricator developer seem to be most concerned about. But what is Tabnabbing and why does noreferrer prevent it? This is a description on how tabnabbing works.
Whenever you open a new tab by clicking a link whose HTML code looks like this, JavaScript will keep a reference to the window object of the site that opened the tab:
<a href = "https://example.com/blog" target = "_blank">Blog</a>
You are not allowed to read the location of the site that opened the tab, whether the rel = "noreferrer" attribute is set or not. However, what you can do is change the location of the opener by using the following JavaScript code:
window.opener.location = 'https://attacker.com/phishing';
The tabnabbing attack would happen as follows:
This makes a phishing attack much more effective, because the user is not expecting such behaviour and thinks they are still on the original page ('tabnabbing'). The way to thwart this attack is to use rel = "noopener", though rel = "noreferrer" has the same effect.
It's interesting how such a small parsing mistake can have such a huge impact on the security of an application. Since we have learned how easy it is to parse URLs incorrectly, let's take a look at how hard URL parsing can actually be.
A URL consists of many different parts the client must parse in order to establish a connection to the target server. In fact, URLs are just an easy way for humans to read and create links. Machines have to use a different approach.
This approach starts with the scheme. This is the missing part of the URL that was problematic for Phabricator. Mostly URLs look like http://example.com or https://example.com. However, there are many more schemes, such as ftp://, gopher:// or netdoc://. Aside from that, browsers can recognise various pseudo-schemes such as javascript: or data:. It is therefore not possible to recognize an external link simply by assuming it begins with 'https://', 'http://', '//' or '/\', even though schemes like gopher:// and netdoc:// aren't available in browsers.
As you might have already observed, external links contain a double slash, either immediately after the scheme or at the beginning. There is a simple difference between an internal and an external link.
Let's assume we click one of the following internal links on this website: https://example.com/about/index.html.
Link TagResulting Absolute URL<a href = "/blog">Blog</a>https://example.com/blog<a href = "company">Company</a>https://example.com/about/company
If there is a single slash at the beginning of the URL, the browser will simply replace the current path with the content of the href attribute and open that link. However, if you omit the slash, your browser will append the content of the href attribute to your current folder.
Here is an example of an absolute URL that leads to an external website. This will open https://www.netsparker.com regardless of your URL.
Link TagResulting Absolute URL<a href = "https://www.netsparker.com/">Netsparker</a>https://www.netsparker.com/
For HTTP Basic authentication, or for authenticating to an FTP server using your web browser and the ftp:// scheme, it is possible to specify username/password combinations. It may look something like this:
https://username:password@example.com/
This causes a big problem in the case of URL whitelisting that is not implemented correctly. You can simply use a URL like https://example.com@attacker.com/, which will still lead to attacker.com. However, some URL parsers could interpret it as a link to https://example.com.
The host part of the URL specifies which server the URL points to. This can be a domain like example.com, or an IPv4 or IPv6 IP in different formats, such as 127.0.0.1 or [0: 0: 0: 0: 0: 0: 0: 1]. Although classic IP notation is widely accepted, many clients that were developed using the C programming language also accept IP addresses in octal, decimal or hexadecimal format as shown:
All of them point to the localhost. Not only can this be exploited in a client-side attack, but it can also bypass IP blacklists which are designed to protect internal services from Server Side Request Forgery attacks.
You can specify the Port in a URL by appending it to the domain name with a leading colon like this:
http://example.com:8080/
Most of the time, the value does not have to be specified. It automatically defaults to the standard port of the respective protocol.
ProtocolDefault Porthttps443http80ftp21
As illustrated in the example above, it's also possible to run a server on a non-standard port. This is where the port part of the URL comes into play.
The Path in a URL begins with a '/', and originally referred to the folder structure within the webroot. However, with many modern frameworks and REST style URLs, this is no longer always the case.
The path is not required in order to establish a connection. Instead, it is passed to the web server after the connection has been established and specifies which document the browser wants to retrieve.
It is possible to specify additional parameters in the path. One example is to be found in Java applications, where the JSESSIONID parameter is appended to the URL.
https://example.com/blog.jsp;JSESSIONID=b92e8649b6cf4886241a3e0825bd36a262b24933
We have already established earlier why this is not a good idea. In IIS prior to version 6.0, this was the root cause of file upload vulnerabilities. IIS would treat a file such as shell.jsp;img.jpg as having a jsp extension with an additional parameter called img.jpeg, which could easily bypass some blacklists.
The query part of a URL is where GET parameters are usually located. It begins with a question mark, and can contain key/value pairs. It might look like something like this:
https://example.com/blog?action=search&author=Bob
The fragment part of a URL includes everything that follows after the hash symbol. It is different from all the other URL parts, because everything that follows is not sent to the server, but is accessible by JavaScript through document.location.hash, and is also used for some browser features. Clicking a link like the one below will make the browser scroll to an HTML element with the ID 'help' on the same page, should such an element exist:
<a href = "#help">Help</a>
I have demonstrated why URL parsing is a highly complex topic. There are lots of different parts to consider, and it's even possible for different libraries to have different methods to parse URLs.
You can find a few great tricks on how to bypass URL parsers in Michał Zalewski's book, The Tangled Web. One example from the book looks quite complicated, but the information provided above makes it easier to understand. The URL looks like this. Can you guess where this link points to?
http://example.com&gibberish=1234@167772161/
It's easy to assume that it will resolve to example.com. But as we've learned above, it is possible to add a credential part to the URL. This is achieved using an '@' symbol. One drawback of this method is that you can't use an unencoded slash character within the credential part. However, everything after 'http://' and before the '@' character can safely be ignored, since the browser will just remove it if no authentication is required. What we are left with is http://167772161/, which is actually just the hex encoded form of 10.0.0.1.
The following URLs all resolve to http://example.com:
If you do not closely follow the specifications for URL parsing, filter bypasses can occur. Preventing SSRF vulnerabilities in particular is harder than you might think. You may want to block access to localhost and make it point to 127.0.0.1., as nothing stops an attacker from registering a domain like attacker.com. Just by blocking URLs with 127.0.0.1 as the domain is therefore not sufficient.
And did you know that a URL like 127.123.123.123 points to localhost too? You should always make sure to retrieve the IP of the external service and take into consideration all possible bypasses. You should never use blacklisting, only whitelisting. Mistakes are bound to happen if you aren't aware of all the ways attackers can use to bypass your blacklist. If your code simply won't work without one, for example, if you want to retrieve data from a user supplied external URL, you should keep in mind that code like this is still vulnerable, even if you have a perfect blacklist that takes every possibility into consideration:
host = 'attacker.com';
ip  = getIpOf(host);
if (isBlacklisted(ip) === false) {
        response = sendRequest(host);
        ...
} else {
        throwError('forbidden URL detected');
}
While it may look like a secure way to prevent SSRF, it is still prone to DNS rebinding. sendRequest will most likely issue its own DNS request before it establishes a connection to the remote server. An attacker can send a harmless IP on the first DNS request, then the IP 127.0.0.1 on the second request. This is also known as a Time of Check to Time of Use (TOCTOU) problem.
The moral of the story is that even if you parse the URL correctly, you still need to take care of pitfalls that arise with the use of its respective parts.