Hi! I'm Pavel and I interned at Dropbox over the past summer. One of my biggest projects during this internship was optimizing Python for dynamic page generation on the website. By the end of the summer, I optimized many of dropbox.com's pages to render 5 times faster. This came with a fair share of challenges though, which I'd like to write about today:
The Problem
Dropbox is a large website with lots of dynamically generated pages. The more pages that are dynamically generated from user input, the bigger the risk becomes for Cross-site scripting attacks. To prevent this, we must ensure to escape all dynamically generated text before it gets sent to the browser. Note that we also need to escape things differently in different string contexts such JavaScript code, HTML text, and HTML attributes. This is because an abusive user-generated string placed in JavaScript may not be abusive if placed in HTML and vice-versa.
All user-generated strings on the Dropbox website are passed through a special escaping function we've written that takes context into account. The problem is that having defined our own escaping function we need to ensure that it's as fast as possible since basically all user input is passing through this function. In this article we'll focus on one particular escaping context, HTML text.
A quick disclaimer, before you accuse us of reinventing the wheel and ignoring both the xml.sax.saxutils.escape
and cgi.escape
functions, it doesn't escape quite enough characters. This is especially true if you consider the possibility for confusing the browser into believing your page is UTF-7, at which point the equal sign has to be escaped. So we have to write our own.
First steps
The original function looked like this:
WHITELIST = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def escape_engine(s, on_whitelist, on_ascii, on_unicode):
pieces = []
for char in s:
if char in WHITELIST:
pieces.append(on_whitelist(char))
elif ord(char) < 256:
pieces.append(on_ascii(char))
else:
pieces.append(on_unicode(char))
return "".join(pieces)
def html1(s):
"""
Escapes a string for inclusion into tag contents in HTML.
Do *not* use for including a string into an HTML attribute; use
escape.attr() for that.
"""
# Ensure we deal with strings
s = unicode(s)
escaped = escape_engine(s,
on_whitelist = lambda c: c,
on_ascii = lambda c: "&#%02x;" % ord(c),
on_unicode = lambda c: "&#%02x;" % ord(c),
)
return UNSAFE_bless(escaped) # Now it’s escaped, so should be safe
Applying it to some test templates gives (times in seconds; the digits (obviously) aren't all significant):
: html1 14.1678471565
Not very fast. (I blame the intern who wrote it...) Of course, the code wasn't optimized for speed, but for readability. But like I said, given that this is actually a non-negligible part of our render time, it could use some optimization.
Inlining
The first fact of Python optimization is that function calls are slow. html1 is awful in this regard: it calls a function per character! Worse yet, the common case is calling a function that does nothing at all! So the first step is to inline. This also lets us join the on_ascii and on_unicode cases (originally separated because we also use the above for escaping JavaScript, where we do make a distinction between Unicode and ASCII literals).
def html2(s):
"""
Escapes a string for inclusion into tag contents in HTML.
Do *not* use for including a string into an HTML attribute; use
escape.attr() for that.
"""
# Ensure we deal with strings
s = unicode(s)
pieces = []
for char in s:
if char in WHITELIST:
pieces.append(char)
else:
pieces.append("&#%02x;" % ord(char))
return UNSAFE_bless("".join(pieces)) # Now it’s escaped, so should be safe
This has a pretty good 15% or so improvement:
: html2 12.8864150047
Implicit Loops
Now, the preceding code sample maybe isn't too pretty, but the improvement is nothing worth sneering at. There's more we can do though. The second fact of Python optimization is that the loop overhead can also be pretty significant. This leads to our second attempt, which gets rid of the loop in favor of a generator expression. Fortunately, switching to generator expressions makes the resulting function as readable as the original.
def html3(s):
"""
Escapes a string for inclusion into tag contents in HTML.
Do *not* use for including a string into an HTML attribute; use
escape.attr() for that.
"""
# Ensure we deal with strings
s = unicode(s)
escaped = "".join(
c if c in WHITELIST
else "&#%02x;" % ord(c)
for c in s)
return UNSAFE_bless(escaped) # Now it’s escaped, so should be safe
What are the timings with this new version?
: html3 13.4748219418
Hmm. The readability improvements are nice, but the speed dropped. Maybe we can get both speed and reability with some tweaks?
More Optimizations
We already picked out a neat 10% improvement and still have a wonderfully readable function. But we can go faster...
Generators?? Really?
The problem with the generator expressions used above is that Python actually constructs a generator. Let's look at the bytecode:
>>> import dis
>>> dis.dis(optimization_story.html3)
…
78 12 LOAD_CONST 1 (”)
15 LOAD_ATTR 1 (join)
79 18 LOAD_CONST 2 (<code object <genexpr> at 0x1005dddc8, file "./optimization_story.py", line 79>)
21 MAKE_FUNCTION 0
81 24 LOAD_FAST 0 (s)
27 GET_ITER
28 CALL_FUNCTION 1
31 CALL_FUNCTION 1
34 STORE_FAST 1 (escaped)
…
Luckily, we can avoid making the generator by simply using a list comprehension instead of a generator expression:
def html4(s):
#…
escaped = "".join([
c if c in WHITELIST
else "&#%02x;" % ord(c)
for c in s])
#…
This brings us back to faster than html2.
: html4 11.2531888485
Looks like the generator expression is actually slower than the list comprehension in this case, a good explanation is probably that our sample set of strings are probably too small to reap the benefits of using a generator. As expected, our disassembly now looks a lot more friendly:
97 12 LOAD_CONST 1 (”)
15 LOAD_ATTR 1 (join)
98 18 BUILD_LIST 0
21 DUP_TOP
22 STORE_FAST 1 (_[1])
100 25 LOAD_FAST 0 (s)
28 GET_ITER
>> 29 FOR_ITER 43 (to 75)
32 STORE_FAST 2 (c)
35 LOAD_FAST 1 (_[1])
38 LOAD_FAST 2 (c)
41 LOAD_GLOBAL 2 (WHITELIST)
44 COMPARE_OP 6 (in)
47 JUMP_IF_FALSE 7 (to 57)
50 POP_TOP
51 LOAD_FAST 2 (c)
54 JUMP_FORWARD 14 (to 71)
>> 57 POP_TOP
58 LOAD_CONST 2 (‘&#%02x;’)
61 LOAD_GLOBAL 3 (ord)
64 LOAD_FAST 2 (c)
67 CALL_FUNCTION 1
70 BINARY_MODULO
>> 71 LIST_APPEND
72 JUMP_ABSOLUTE 29
>> 75 DELETE_FAST 1 (_[1])
78 CALL_FUNCTION 1
81 STORE_FAST 3 (escaped)
Sets?? Really?
But this is still slower than it should be. One low-hanging fruit still stands out: why are we doing a linear search on the whitelist?
WHITELIST2 = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
The timings bear out the guess that sets are faster than strings:
: html5 8.6205868721
The question remains, Can we go faster?
Deeper Python Optimization
If we're going to optimize this function, we might as well extract all the performance we possibly can. One peculiarity of modern Python is that it has two LOAD instructions: LOAD_GLOBAL and LOAD_FAST. Due to a implementation detail loading a local variable is faster than loading a global variable. Consequently, you want to avoid global variables in tight loops. The above disassembly pointed to two such globals: ord and WHITELIST.
def html6(s):
# …
lord = ord; lWHITELIST2 = WHITELIST2
escaped = "".join([
c if c in lWHITELIST2
else "&#%02x;" % lord(c)
for c in s])
#…
We don't expect this to buy us much, but why not?
: html6 7.87281298637
In absolute measurement a second is not much of an improvment but as a percentage it's still a significant of time.
String Interpolation
String interpolation is another thing Python isn't very fast at. Let's see if removing that helps.
def html7(s):
# …
lstr = str; lord = ord; lWHITELIST2 = WHITELIST2
escaped = "".join([
c if c in lWHITELIST2
else "&#" + lstr(lord(c)) + ";"
for c in s])
# …
: html7 5.58323383331
That's a huge boost from saving the string interpolation!
Every Problem Can be Solved With a Big Enough Hash Table
Any web programmer knows that performance problems can be universally solved with caching. (Heh heh, if only...) In any case, if the gain from removing string interpolation was so great, maybe we can just cache the results of that characters to escaped form: function, and not do string concatenation either. We'll set up a cache of characters to HTML escapes:
CACHE_HTML_ESCAPES = {}
Then we read and write our cache as necessary:
def html8(s):
# …
escaped = "".join([
c if c in lWHITELIST2
else CACHE_HTML_ESCAPES.get(c) or CACHE_HTML_ESCAPES.setdefault(c, "&#" + lstr(lord(c)) + ";")
for c in s])
# …
Since our web servers are long-running processes, the cache should eventually capture all of the characters people are using; Python’s dicts are then fast enough to give us a speed boost over string concatenation and =str(ord(c))=.
: html8 4.5724029541
Another big boost from avoid what seems like a trivial computation.
Premature Optimization…
If we’re going down the setdefault branch so rarely, and since that’s the only place we’re using str and ord, maybe it’s not worth making those local variables?
def html9(s):
# …
lWHITELIST2 = WHITELIST2
escaped = "".join([
c if c in lWHITELIST2
else CACHE_HTML_ESCAPES.get(c) or CACHE_HTML_ESCAPES.setdefault(c, "&#" + str(ord(c)) + ";")
for c in s])
We don’t expect much of a benefit here, but maybe a percent change or so…
: html9 4.565928936
Wait, Why a Python Loop At All?
Wait, why are we using a Python loop at all? Python loops are slow… Instead, maybe we can use the C code in Python’s re module? We can use regexes to find anything that isn’t in our whitelist, and do the replacement. That way, the “do-nothing” operation on whitelisted characters becomes much cheaper.
import re
RE_FIND_NON_WHITELISTED = re.compile("([^" + "".join(WHITELIST2) + "])")
Unfortunately, there’s no way to get the entire matched range from a Python match object. So we’re forced to call the group method– and function calls are slow!
def htmlA(s):
# …
escaped = RE_FIND_NON_WHITELISTED.sub(
lambda m: CACHE_HTML_ESCAPES.get(m.group(1))
or CACHE_HTML_ESCAPES.setdefault(m.group(1), "&#" + str(ord(m.group(1)))) + ";",
s)
# …
What are our results?
: htmlA 4.54020690918
Hmm, this isn’t that great, actually. We did save a percentage point, but this is a lot less readable (at least, in my opinion) than html9, and the gains aren’t worth it.
Until, that is, we try it on whitelisted-only text:
: html9 3.84376811981
: htmlA 0.796116113663
Whoosh! html9 got a bit faster, by virtue of avoiding hitting the cache at all, but the regex-based solution *rocked*, since it could do all of the skipping of characters in C, not Python.
In fact, the regex-based solution is slower for punctuation, since it has to do multiple function calls instead of just one function call and dictionary lookup. And for short strings, the overhead to initiating the regex search is worse. Let’s try a test on punctuation-heavy, short text snippets:
: html9 1.59476995468
: htmlA 3.44844794273
Measure Twice, Cut Once
So we have one function that’s faster for English alphanumeric text and one that’s faster for everything else. But we can’t just shaft our non-US users, and we don’t want to settle for a function that’s five times slower for the common case! So we have a few options.
The first is simply to expand our whitelist — most file names have a dot in them, and spaces, dashes, and similar are popular. At signs appear in emails. And so on. Of course, one has to be careful not to permit any XSS vector while doing so; but a conservative expansion by adding -|!,. _ to our whitelist should be safe. Of course, this helps both versions, so it’s not really a fix.
Another fix presents itself, though: can we figure out which version of html to use quickly?
def htmlB(s):
# …
non_whitelisted = RE_FIND_NON_WHITELISTED.findall(s)
if len(non_whitelisted) > .6 * len(s):
escaped = RE_FIND_NON_WHITELISTED.sub(
lambda m: CACHE_HTML_ESCAPES.get(m.group(1))
or CACHE_HTML_ESCAPES.setdefault(m.group(1), "&#" + str(ord(m.group(1)))) + ";",
s)
else:
lWHITELIST2 = WHITELIST2
escaped = "".join([
c if c in lWHITELIST2
else CACHE_HTML_ESCAPES.get(c) or CACHE_HTML_ESCAPES.setdefault(c, "&#" + str(ord(c)) + ";")
for c in s])
# …
Why .6, you ask? Well, the exact constant isn’t too important (the function *works* whichever way the test goes, it’s just an optimization), but some testing showed it to be the approximate break-even point of the two methods.
With hope and trepidation, we run the tester…
: htmlB 5.16241598129
: html9 3.97228693962
: htmlA 3.95208191872
Awww…
As we hoped, the result is fast on whitelisted-only characters…
: htmlB 1.24477005005
: html9 3.41327309608
: htmlA 0.696345090866
And is passable on punctuation:
: htmlB 5.97420597076
: html9 3.61161899567
: htmlA 8.88924694061
But the overhead is pretty saddening…
Sampling
We can improve things a bit by testing just a few characters for punctuation/unicode.
def htmlC(s):
# …
non_whitelisted = RE_FIND_NON_WHITELISTED.findall(s[:10])
if len(whitelisted) < 6:
# …
We use =6= instead of =.6 * min(10, len(s))= because if the string is shorter than 10 characters, either alternative is going to be fast.
This leads to a marked improvement. For alphanumeric strings:
: htmlC 0.836707115173
: html9 3.34154415131
: htmlA 0.701889276505
For unicode-heavy strings:
: htmlC 3.70150613785
: html9 2.82831597328
: htmlA 6.77485609055
This is now really looking like an option for production use. But checking is still quite a bit of overhead still– in the case where the user has just about the break-even balance of English and international characters, we’re looking at a 20% premium. Can we do better?
The User is Always Right
Well, so checking which to use has very high overhead. What to do? Well, we’ve got to think about when each case will come up: how might we predict that lots of non-English-alphanumerics are going to come up? What users are likely to store files with international names, or use an international name? (Hopefully, people who use lots of punctuation in both are few…)
Well luckily, Dropbox added translations a bit back, so we already have all of the infrastructure in place to detect what locale a user is from. So a quick optimization is to switch to the html9 version (the list-comprehension-based one) for international users, and use the fast htmlA version (regexes) for US-based users, and also users from countries with mostly-Latin alphabets. The code here is mostly tied to Dropbox internals, so I’m not going to show it, but I’m sure you get the point.
This final optimization removes the check overhead while giving us most of the benefit. Success!
What We Didn’t Do
Now there are some ways to optimize the above even further. The obvious approach is to rewrite the entire thing in C. This lets us squeeze the last bits of performance out of the above, but it would also make things much harder to maintain.
Similarly, using PyPy or a similar JIT would probably help on the inner loop of the pure-Python version, likely making it as fast as the regex-based approach.
Conclusions
The first, most basic conclusion is that the basic facts of Python optimization inline functions, use implicit loops, move computation into C if possible are perfectly valid. Another fact: Python dictionaries are *fast*. The WHITELIST set and the CACHE_HTML_ESCAPES dict both rely on the superb hash table implementation for their performance gains.
Other “common wisdom”, like using locals instead of globals, yields relatively little gain.
Optimizing inner loops in a high-level loops requires lots of trial and error. It was absolutely amazing that moving to string concatenation from string interpolation gave such a huge boost to performance.
Finally, measurement is key. It was measurement that told us that HTML escaping was slow to begin with; and without measuring performance, we would never have guessed that string interpolation was so slow.
Thanks for reading!