This article covers one of the ways to avoid the by-design
<script> HTML element vulnerability.
Long story short, unlike any other HTML tag,
<script> implies different rules of escaping its content. The proper escaping is unreasonably difficult and can even be impossible under certain circumstances.
The “escaping problem” often makes
<script> a source of vulnerabilities.
Instead of going with uncertain rules, I propose using the
<safescript> element which follows HTML guidelines on escaping via HTML entities.
🔖 Now, read on to check out all the details.
There was a time when most of the web pages were manually crafted by programmers. These days, a greater part of code delivered to client browsers is either generated or processed by some robots. Those robots and building blocks they use should be reliable and secure.
We’re in 2018, and an HTML element used by every developer contains vulnerabilities at its core.
To begin with, I want to briefly describe how the HTML parser works. HTML is a hypertext markup language, and if you want to properly “speak” that language, you should follow the specs. Otherwise, the one “listening” to you just won’t understand what you’ve got to say. Now, let’s take a look at the example with HTML attributes.
In the above example, the element
tagname has the single attribute named
attributename. The attribute’s name is followed by the equal sign. After the sign, there goes a
value surrounded by double quotes.
This will change if we put
"LLC "Horns and Hoofs"." in the value. The element will then have four attributes:
"LLC " in its value and three additional ones named:
Hoofs".". All with empty values.
The HTML specification allows you to escape special symbols: make a parser read them as just characters. Regarding quotes, you can use the
" symbol sequence. Such sequences are called HTML entities.
With that in mind, if you had the
" symbol sequence in your initial string and didn’t want the parser to interpret it as a quote, you could go with
& instead of just
Thus, the transformation of our input string into the output is consistent and reversible. So, we can read and write any data as attribute values without any introspection into their actual content. You follow the rules and everything works out fine and dandy. The end.
Most of the formats we encounter work in a similar fashion: there is some syntax, a way to escape content from it, and a way to avoid the so-called “escape characters” from being parsed as special symbols. As stated above, it’s true for most of the formats, but not…
<script> tag and ends right before the closing one,
</script>. The HTML parser does not even look into the tag. It passes its contents to a JS parser.
What should be happening here is the variable
s being assigned some harmless string value. However, what happens, is the script where we declare
s terminates with
var s = "surprise!. This raises the syntax error. All further text is interpreted as pure HTML with any injected markup. In our case, there is a new opening
<script> tag that executes some malicious code.
We now have the same effect as if there was a double quote in the HTML attribute value. You might think of using HTML entities, but they wouldn’t help in this case.
Taking into consideration how the HTML parser works within
<script>, your string now holds HTML entities, which means the contained data were altered.
In contrast with the quote that can be escaped from the string, the
<script> tag does not provide any way to escape its content. The HTML standard itself states there should be no “</script>” symbol sequence within the
The result here is counterintuitive: after embedding a valid JS in a valid HTML on a valid basis, we get an invalid result.
That’s the HTML markup vulnerability I’m talking about; it leads to some real issues in existing applications.
Exploiting the vulnerability
No doubt, it’s hard to imagine that you see no problems when manually writing some code and putting
</script> there. At the bare minimum, your syntax highlighting will state the tag closed earlier than expected. Or, you just won’t be able to properly execute the code and will have to spend some time fixing it. So, that’s not where sits the real problem.
The modern app development is often about dynamically generating HTML including
<script> content. Here is the code snippet you can frequently encounter in apps using Redux with server-side rendering:
</script> may appear in any position within
InitialState where you get data from users or systems.
In the example above, we get
user id and
referer written into strings. A template processor will then escape the values in line with the JS specs. And, while
user id will almost certainly contain nothing but digits, an intruder might insert the
</script> tag into
The fun has only just begun with the closing
</script> tag. Another implication is related to the opening
<script> tag if there is the
<!-- combination somewhere before it. In HTML, that usually starts a multi-line comment. There won’t be much help from your highlighted syntax too. Now, take a look at the following snippet:
What a regular folk sees here is two
<script> tags and a paragraph of text. Now, what about the weird HTML parser? It sees just a single and not closed
<script> tag holding everything from the second line to the very end.
I can’t say I completely understand why it works this way, but once encountering
<!-- the HTML parser starts counting the opening and closing
<script> tags and does not consider the script terminated until all of the opened scripts are closed. Thus, in most cases, a script will last until the very end of the page; well, unless someone happened to inject another closing
</script> using another vulnerability, top kek. If you haven’t seen that yourself, you might even think I was joking. However, I wasn’t. Here’s the DOM tree screenshot:
The worst thing here is that even though
<script> can sit somewhere in the code and have the same effect:
Erm, are you really a specification?
The HTML specs not only forbid you from using valid symbol sequences in the
<script> tag and do not provide you with any way to escape those within the scope of HTML, the following is stated there:
The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape
</scriptwhen these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions.
Now, that recommendation makes at least three naive assumptions about how we use HTML:
2. In the embedded script, you can escape the symbol sequences, and this will not alter their syntax meaning.
3. Someone embedding a script knows what that script is, understands its constructs, and can properly mutate it.
Scripts are not always embedded by a skilled person. Embedding is often handled by HTML generators.
For instance, here’s an example of how a browser is unable to handle it:
As you can see, the serialized string was not parsed into an element equivalent to the original. Transforming a DOM-tree to HTML text is not consistent and reversible in the general case. Some DOM trees just cannot be interpreted as their source HTML.
Nobody likes problems. Avoid problems
Of course, the latter would bind you to be extremely cautious when writing stuff in your
<script>. Especially when you insert something via a template processor.
The truth is the possibility to encounter
<script> in your source code is pretty low, even in its minified version. You probably won’t code something like that; and, if an intruder happens to inject something in your
<script>, that will bother you in the last turn.
There still exists the problem of injecting symbols in strings. In such a case, you follow the specs: escape everything as stated. However, the problem is after you do
JSON.stringify(), you won’t want to parse the output again and find all string literals to escape stuff. Also, I wouldn’t advise using third-party serialization packages that consider the problem: cases may vary, and you want to be safe at all times. Thus, I would advise escaping
< with a Unicode escape sequence after serialization. Such symbols can’t be encountered anywhere in JSON but within string literals, so simply replacing symbols would be safe enough.
You may want to escape
< via HTML entities. This helps you get rid of the vulnerability, but your data are now spoiled. Hence, you should choose the right way of escaping for every encountered case, and that’s a hassle.
You can also escape individual strings in the same manner. Another bit of advice is about not embedding anything via a
<script> tag. Store your data in places where escape transformations are predictable and reversible. Like, in other elements’ attributes. However, it lacks visual clarity and only works for strings, JSON would have to be parsed separately.
In case, despite all the efforts, you are still afraid of being hacked, you can forbid executing any scripts but those you allow explicitly. To do so, add the
nonce (number used once) attribute holding some unique value and a special header that forbids executing scripts without the attribute.
Then, even if an intruder happens to inject a malicious script into your page, the script will not be executed. This is called
In the end, if you want to comfortably develop web apps and not wander around minefields, you need a reliable way of embedding scripts in your HTML. I propose dumping the
<script> tag entirely, as the unsafe one.
The <safescript> tag
Let’s be honest here; we can abandon embedded scripts completely. But what next? Always connecting external scripts cannot be an option here, it’s pretty convenient to have scripts and their data in a single HTML. You can then have fewer HTTP requests and server-side routes.
What I suggest is implementing a separate tag:
<safescript>. All the content of
<safescript> would follow the HTML specs: we get fully working HTML entities for escaping char sequences thus making any embedded script safe.
The code within
<safescript> may look a bit unusual, but that’s what will sit in your HTML. You can add a simple filter to your template processor that will insert the tag and escape every needed char sequence. Here’s how the code might look like in Django:
It’s not necessary to wait while browsers have
<safescript> supported: I made a simple polyfill that just works. Here is how you implement it:
Embedding scripts in HTML is tricky. Most of the time, you need to be very careful. And, there are cases when embedding scripts in HTML should be avoided, it’s mostly about dynamically generated HTML.
However, you can use the proposed
<safescript> to the general HTML specification or devise some other way of handling the problem of embedding scripts in HTML.
Stay tuned: follow us on Twitter to learn more about SaaS, engineering, and our product.