Security & Compliance May 11, 2018 by Alex Karpinsky

Vulnerability in HTML Design, the Tag

TL;DR

This article covers one of the ways to avoid the by-design <script> HTML element vulnerability.

Long story short, unlike any other HTML tag, <script> implies different rules of escaping its content. The proper escaping is unreasonably difficult and can even be impossible under certain circumstances.

The “escaping problem” often makes <script> a source of vulnerabilities.

Instead of going with uncertain rules, I propose using the <safescript> element which follows HTML guidelines on escaping via HTML entities.

🔖 Now, read on to check out all the details.


Introduction

There was a time when most of the web pages were manually crafted by programmers. These days, a greater part of code delivered to client browsers is either generated or processed by some robots. Those robots and building blocks they use should be reliable and secure.

We’re in 2018, and an HTML element used by every developer contains vulnerabilities at its core.

To begin with, I want to briefly describe how the HTML parser works. HTML is a hypertext markup language, and if you want to properly “speak” that language, you should follow the specs. Otherwise, the one “listening” to you just won’t understand what you’ve got to say. Now, let’s take a look at the example with HTML attributes.

https://gist.github.com/uploadcare-user/b882137b0d9976da75c243344a3a68e3

In the above example, the element tagname has the single attribute named attributename. The attribute’s name is followed by the equal sign. After the sign, there goes a value surrounded by double quotes.

This will change if we put "LLC "Horns and Hoofs"." in the value. The element will then have four attributes: attributename with "LLC " in its value and three additional ones named: Horns, and, Hoofs".". All with empty values.

https://gist.github.com/uploadcare-user/0cebdfd46fa294271cad33cf30950ce2

The HTML specification allows you to escape special symbols: make a parser read them as just characters. Regarding quotes, you can use the &quot; symbol sequence. Such sequences are called HTML entities.

https://gist.github.com/uploadcare-user/c4785a6ae8840c463d3f44b8f5f8d8b6

With that in mind, if you had the &quot; symbol sequence in your initial string and didn’t want the parser to interpret it as a quote, you could go with &amp; instead of just &, i.e., &amp;quot;.

Thus, the transformation of our input string into the output is consistent and reversible. So, we can read and write any data as attribute values without any introspection into their actual content. You follow the rules and everything works out fine and dandy. The end.

Most of the formats we encounter work in a similar fashion: there is some syntax, a way to escape content from it, and a way to avoid the so-called “escape characters” from being parsed as special symbols. As stated above, it’s true for most of the formats, but not…


The <script> tag

<script> serves the purpose of embedding code fragments written in other languages in HTML. As of today, in 99% percent of cases that would be JavaScript. The embedded script starts right after the opening <script> tag and ends right before the closing one, </script>. The HTML parser does not even look into the tag. It passes its contents to a JS parser.

In turn, JavaScript is an independent and self-sufficient programming language. It wasn’t designed in any specific manner to be embedded in HTML. It’s got its string literals that can hold whatever content. And, as you may have already guessed, there might be a sequence of characters that stands for the closing </script> tag.

https://gist.github.com/uploadcare-user/0550a58aefe67d0b0d2059f0cae02741

What should be happening here is the variable s being assigned some harmless string value. However, what happens, is the script where we declare s terminates with var s = "surprise!. This raises the syntax error. All further text is interpreted as pure HTML with any injected markup. In our case, there is a new opening <script> tag that executes some malicious code.

We now have the same effect as if there was a double quote in the HTML attribute value. You might think of using HTML entities, but they wouldn’t help in this case.

https://gist.github.com/uploadcare-user/090afb423fb9a8cc0f1b526bbcf25996

Taking into consideration how the HTML parser works within <script>, your string now holds HTML entities, which means the contained data were altered.

In contrast with the quote that can be escaped from the string, the <script> tag does not provide any way to escape its content. The HTML standard itself states there should be no “</script>” symbol sequence within the <script> tag. However, the JavaScript specification does not forbid using such character sequences in string literals.

The result here is counterintuitive: after embedding a valid JS in a valid HTML on a valid basis, we get an invalid result.

That’s the HTML markup vulnerability I’m talking about; it leads to some real issues in existing applications.


Exploiting the vulnerability

No doubt, it’s hard to imagine that you see no problems when manually writing some code and putting </script> there. At the bare minimum, your syntax highlighting will state the tag closed earlier than expected. Or, you just won’t be able to properly execute the code and will have to spend some time fixing it. So, that’s not where sits the real problem.

The modern app development is often about dynamically generating HTML including <script> content. Here is the code snippet you can frequently encounter in apps using Redux with server-side rendering:

https://gist.github.com/uploadcare-user/814b100415137c2241c3892c073031e2

</script> may appear in any position within InitialState where you get data from users or systems. JSON.stringify() won’t be alerting such strings on serialization: they are fully compliant with both the JSON and JavaScript specs. Thus, such lines will jump into your page and allow the intruder to execute any JS code in a user browser. Here is another example:

https://gist.github.com/uploadcare-user/e3bbeb400140af113c62fc546142e2fb

In the example above, we get user id and referer written into strings. A template processor will then escape the values in line with the JS specs. And, while user id will almost certainly contain nothing but digits, an intruder might insert the </script> tag into referer.

The fun has only just begun with the closing </script> tag. Another implication is related to the opening <script> tag if there is the <!-- combination somewhere before it. In HTML, that usually starts a multi-line comment. There won’t be much help from your highlighted syntax too. Now, take a look at the following snippet:

https://gist.github.com/uploadcare-user/d735f4e2ca5d3344b6202b29c943601d

What a regular folk sees here is two <script> tags and a paragraph of text. Now, what about the weird HTML parser? It sees just a single and not closed <script> tag holding everything from the second line to the very end.

I can’t say I completely understand why it works this way, but once encountering <!-- the HTML parser starts counting the opening and closing <script> tags and does not consider the script terminated until all of the opened scripts are closed. Thus, in most cases, a script will last until the very end of the page; well, unless someone happened to inject another closing </script> using another vulnerability, top kek. If you haven’t seen that yourself, you might even think I was joking. However, I wasn’t. Here’s the DOM tree screenshot:

The worst thing here is that even though </script> in JavaScript can only be encountered in string literals, <!-- and <script> can sit somewhere in the code and have the same effect:

https://gist.github.com/uploadcare-user/f32e48fa8832b6b47d1d374517f8f7fb

Erm, are you really a specification?

The HTML specs not only forbid you from using valid symbol sequences in the <script> tag and do not provide you with any way to escape those within the scope of HTML, the following is stated there:

The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape <!-- as <!--, <script as <script>, and </script as </script when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions.

Now, that recommendation makes at least three naive assumptions about how we use HTML:

1. In a script you embed (which is not necessarily JavaScript), such symbol sequences can either sit strictly within script literals or can easily be avoided in the language syntax.

2. In the embedded script, you can escape the symbol sequences, and this will not alter their syntax meaning.

3. Someone embedding a script knows what that script is, understands its constructs, and can properly mutate it.

While the first two are okay with JavaScript, the third point isn’t.

Scripts are not always embedded by a skilled person. Embedding is often handled by HTML generators.

For instance, here’s an example of how a browser is unable to handle it:

https://gist.github.com/uploadcare-user/5ab45a2147ee7752804960cc86011e5d

As you can see, the serialized string was not parsed into an element equivalent to the original. Transforming a DOM-tree to HTML text is not consistent and reversible in the general case. Some DOM trees just cannot be interpreted as their source HTML.


Nobody likes problems. Avoid problems

As you have already come to understand, there is no safe way of embedding JavaScript in HTML. However, there is a way of making JavaScript safe for embedding in HTML (now, hold on for a moment and feel the difference).

Of course, the latter would bind you to be extremely cautious when writing stuff in your <script>. Especially when you insert something via a template processor.

The truth is the possibility to encounter <!-- and <script> in your source code is pretty low, even in its minified version. You probably won’t code something like that; and, if an intruder happens to inject something in your <script>, that will bother you in the last turn.

There still exists the problem of injecting symbols in strings. In such a case, you follow the specs: escape everything as stated. However, the problem is after you do JSON.stringify(), you won’t want to parse the output again and find all string literals to escape stuff. Also, I wouldn’t advise using third-party serialization packages that consider the problem: cases may vary, and you want to be safe at all times. Thus, I would advise escaping < with a Unicode escape sequence after serialization. Such symbols can’t be encountered anywhere in JSON but within string literals, so simply replacing symbols would be safe enough.

https://gist.github.com/uploadcare-user/ffc34c20d3e4fe0b9255285cf5cfbae9

You may want to escape < via HTML entities. This helps you get rid of the vulnerability, but your data are now spoiled. Hence, you should choose the right way of escaping for every encountered case, and that’s a hassle.

You can also escape individual strings in the same manner. Another bit of advice is about not embedding anything via a <script> tag. Store your data in places where escape transformations are predictable and reversible. Like, in other elements’ attributes. However, it lacks visual clarity and only works for strings, JSON would have to be parsed separately.

https://gist.github.com/uploadcare-user/6543815cd0a79b9e1ca1aa925144c5b1

In case, despite all the efforts, you are still afraid of being hacked, you can forbid executing any scripts but those you allow explicitly. To do so, add the nonce (number used once) attribute holding some unique value and a special header that forbids executing scripts without the attribute.

Then, even if an intruder happens to inject a malicious script into your page, the script will not be executed. This is called Content-Security-Policy.

In the end, if you want to comfortably develop web apps and not wander around minefields, you need a reliable way of embedding scripts in your HTML. I propose dumping the <script> tag entirely, as the unsafe one.


The <safescript> tag

Let’s be honest here; we can abandon embedded scripts completely. But what next? Always connecting external scripts cannot be an option here, it’s pretty convenient to have scripts and their data in a single HTML. You can then have fewer HTTP requests and server-side routes.

What I suggest is implementing a separate tag: <safescript>. All the content of <safescript> would follow the HTML specs: we get fully working HTML entities for escaping char sequences thus making any embedded script safe.

https://gist.github.com/uploadcare-user/b3b4a000b4aa78e0c91a611fa2affda9

The code within <safescript> may look a bit unusual, but that’s what will sit in your HTML. You can add a simple filter to your template processor that will insert the tag and escape every needed char sequence. Here’s how the code might look like in Django:

https://gist.github.com/uploadcare-user/9df3b01a2a337e6c82bbe86777560810

It’s not necessary to wait while browsers have <safescript> supported: I made a simple polyfill that just works. Here is how you implement it:

https://gist.github.com/uploadcare-user/8bda85e8e646198913b4eb476a5a472c

Conclusion

Embedding scripts in HTML is tricky. Most of the time, you need to be very careful. And, there are cases when embedding scripts in HTML should be avoided, it’s mostly about dynamically generated HTML.

However, you can use the proposed <safescript> tag to embed any script. This approach lets you forget about escaping in JavaScript and avoid many vulnerabilities. It’d be cool to add <safescript> to the general HTML specification or devise some other way of handling the problem of embedding scripts in HTML.

Enjoy the end-to-end file handling platform that seamlessly covers file and image uploading, storage, transformations, optimization, and delivery.