Performance and Pymake
Thursday, February 26th, 2009pymake runs correctly now. With some Mozilla patches, I can get it to recurse through most of the tree within a single process, doing parallel builds correctly, on Windows, Linux, and Mac, python 2.4 and 2.5.
6x as slow
Performance is bad. On a benchmark of a do-nothing rebuild of make tier_xpcom, which is about 40 recursive makes:
bsmedberg $ time make tier_xpcom real 0m1.129s bsmedberg $ time python -O /builds/pymake/make.py tier_xpcom real 0m7.525s
This is a bit depressing, because improved Windows performance was one of the primary reasons for doing pymake. It’s possible that the Windows numbers are better, given that launching Windows processes is so much more expensive than on Linux: but I doubt that will make up the whole 6x performance loss.
Improved 50% today!
Earlier today, the number was about 15 seconds. Profile-guided patches helped reduce execution time significantly!
The first optimization was string-joining performance. str.join Showed up near the top of profiles both in call counts and own-time. Expansion objects are a core data structure in pymake: the parser turns a string such as “Hello$(COMMA), $(subst a,o,warld)!” into a list of bare strings and function calls (a variable expansion is treated as a function. An expansion is “resolved” to a string by passing it variables and a makefile context. The original, naive way of resolving an Expansion used ''.join() on the elements. This was replaced, in two phases, with a system of iterators which recursively yield strings, and an itersplit method, which splits an expansion into words without even joining it into a single string. A final optimization replaced ''.join entirely: it was better, in 99% of cases, to use simple appending methods when few elements are being joined.
Another optimization avoids parsing makefile syntax into expansions until it’s actually needed. In many cases, makefiles will use a small set of variables many times, and will never read the value of other variables. The first candidate optimization had pymake parse variables as they were set; a much better solution later was to lazily parse variables the first time they were read.
A grab-bag of other optimizations improved performance by a bit, but the last attempt increased code complexity far more than performance.
Hitting a Performance Barrier
At the moment I think pymake has hit a performance barrier and I’m not sure how to proceed. The current profile of pymake, generated with cProfile, is mostly unhelpful:
7228961 function calls (6902795 primitive calls) in 10.934 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 15529 0.783 0.000 2.059 0.000 /builds/pymake/pymake/parser.py:689(parsemakesyntax) 49054/35478 0.555 0.000 2.320 0.000 /builds/pymake/pymake/util.py:24(itersplit) 128 0.396 0.003 0.396 0.003 {posix.read} 466085/222572 0.356 0.000 2.653 0.000 /builds/pymake/pymake/data.py:192(resolve) 51384 0.289 0.000 1.491 0.000 /builds/pymake/pymake/parserdata.py:214(execute) 13876 0.288 0.000 1.007 0.000 /builds/pymake/pymake/data.py:684(resolvevpath) 29268 0.280 0.000 0.280 0.000 {posix.stat} 171027 0.266 0.000 0.471 0.000 /builds/pymake/pymake/data.py:430(match) 25700 0.246 0.000 0.327 0.000 /builds/pymake/pymake/data.py:384(__init__) 40350 0.223 0.000 0.223 0.000 /usr/lib64/python2.5/logging/__init__.py:1158(getEffectiveLevel) 58982 0.213 0.000 0.329 0.000 /builds/pymake/pymake/data.py:312(set) 43854/42319 0.211 0.000 1.572 0.000 /builds/pymake/pymake/functions.py:56(resolve) 131959/117714 0.207 0.000 1.343 0.000 /builds/pymake/pymake/data.py:258(get) 2130 0.194 0.000 2.281 0.001 /builds/pymake/pymake/data.py:542(resolveimplicitrule) 47515 0.189 0.000 0.421 0.000 /builds/pymake/pymake/data.py:1204(gettarget) 128 0.182 0.001 0.182 0.001 {posix.fork} 7717 0.174 0.000 1.941 0.000 /builds/pymake/pymake/parserdata.py:117(execute) 57298 0.173 0.000 0.255 0.000 /usr/lib64/python2.5/posixpath.py:56(join) 73798 0.165 0.000 0.628 0.000 /builds/pymake/pymake/data.py:1103(matchesfor) 46953 0.157 0.000 0.184 0.000 /builds/pymake/pymake/parser.py:176(iterdata) 1153401/1150418 0.156 0.000 0.158 0.000 {len} 27900 0.156 0.000 0.163 0.000 /builds/pymake/pymake/data.py:67(__init__) 11264/168 0.148 0.000 6.120 0.036 /builds/pymake/pymake/parserdata.py:431(execute) 37008/23960 0.141 0.000 0.176 0.000 /builds/pymake/pymake/parser.py:193(itermakefilechars) 330817 0.135 0.000 0.135 0.000 {method 'startswith' of 'str' objects}
parsemakesyntax, the function which parses $(FOO) $(BAR) into an Expansion, is still the single most time-consuming function. But since I don’t have line-by-line heatmaps, it’s hard to know what parts of that function might be inefficient. The callee data is not much help:
ncalls tottime cumtime /builds/pymake/pymake/parser.py:689(parsemakesyntax) -> 27666 0.017 0.017 /builds/pymake/pymake/data.py:111(__init__) 233 0.000 0.001 /builds/pymake/pymake/data.py:116(fromstring) 33408 0.059 0.097 /builds/pymake/pymake/data.py:145(appendstr) 10219 0.014 0.020 /builds/pymake/pymake/data.py:156(appendfunc) 27666 0.084 0.212 /builds/pymake/pymake/data.py:183(finish) 1474 0.002 0.002 /builds/pymake/pymake/functions.py:24(__init__) 1271 0.002 0.002 /builds/pymake/pymake/functions.py:32(setup) 2765 0.005 0.007 /builds/pymake/pymake/functions.py:40(append) 8315 0.018 0.022 /builds/pymake/pymake/functions.py:48(__init__) 430 0.001 0.001 /builds/pymake/pymake/functions.py:70(__init__) 203 0.001 0.002 /builds/pymake/pymake/functions.py:380(setup) 25800 0.106 0.346 /builds/pymake/pymake/parser.py:73(getloc) 9986 0.068 0.072 /builds/pymake/pymake/parser.py:97(findtoken) 27239 0.035 0.072 /builds/pymake/pymake/parser.py:155(get) 46953 0.157 0.184 /builds/pymake/pymake/parser.py:176(iterdata) 15245 0.071 0.133 /builds/pymake/pymake/parser.py:193(itermakefilechars) 6440 0.026 0.033 /builds/pymake/pymake/parser.py:247(itercommandchars) 25515 0.032 0.032 /builds/pymake/pymake/parser.py:673(__init__) 15529 0.003 0.003 {callable} 28565 0.008 0.010 {len} 9986 0.003 0.003 {method 'append' of 'list' objects} 1 0.000 0.000 {method 'iterkeys' of 'dict' objects} 9986 0.005 0.005 {method 'pop' of 'list' objects}
Yes, I know getloc is inefficient, and a commenter on one of my previous posts suggests a possible solution. But that’s not going to create any significant improvement. In order to have performance parity with GNU make there has to be an algorithmic improvement.
Can I trust cProfile?
There are some confusing aspects to the cProfile output which make me suspect it. In particular, I suspect that generator functions are not being accounted for correctly: the primary work of the iterdata function is to call match on a compiled regular expression object, but that method doesn’t even show up in the callee list:
ncalls tottime cumtime /builds/pymake/pymake/parser.py:176(iterdata) -> 17815 0.004 0.004 {built-in method end} 17815 0.004 0.004 {built-in method group} 35630 0.014 0.014 {built-in method start} 25222 0.005 0.005 {len}
In any case, it’s hard to usefully analyze the profiling output. What I want is a Shark-like hierarchical profile. Apparently, dtrace can profile Python code, but I seen any useful visualization/analysis tools for that combination: if anyone knows of something, please let me know!
Help Wanted
I’ve been head-down in pymake for a few weeks now; my review queue and several other Mozilla tasks need attention. I’d really love some help from people who know Python performance (or who have no fear). If you’re interested and want some guidance, please e-mail me or leave a comment here. We can probably construct some useful pymake performance tests that are not as scary as actually building Mozilla.