Performance and Pymake

pymake runs correctly now. With some Mozilla patches, I can get it to recurse through most of the tree within a single process, doing parallel builds correctly, on Windows, Linux, and Mac, python 2.4 and 2.5.

6x as slow

Performance is bad. On a benchmark of a do-nothing rebuild of make tier_xpcom, which is about 40 recursive makes:

bsmedberg $ time make tier_xpcom
real	0m1.129s

bsmedberg $ time python -O /builds/pymake/ tier_xpcom
real	0m7.525s

This is a bit depressing, because improved Windows performance was one of the primary reasons for doing pymake. It’s possible that the Windows numbers are better, given that launching Windows processes is so much more expensive than on Linux: but I doubt that will make up the whole 6x performance loss.

Improved 50% today!

Earlier today, the number was about 15 seconds. Profile-guided patches helped reduce execution time significantly!

The first optimization was string-joining performance. str.join Showed up near the top of profiles both in call counts and own-time. Expansion objects are a core data structure in pymake: the parser turns a string such as “Hello$(COMMA), $(subst a,o,warld)!” into a list of bare strings and function calls (a variable expansion is treated as a function. An expansion is “resolved” to a string by passing it variables and a makefile context. The original, naive way of resolving an Expansion used ''.join() on the elements. This was replaced, in two phases, with a system of iterators which recursively yield strings, and an itersplit method, which splits an expansion into words without even joining it into a single string. A final optimization replaced ''.join entirely: it was better, in 99% of cases, to use simple appending methods when few elements are being joined.

Another optimization avoids parsing makefile syntax into expansions until it’s actually needed. In many cases, makefiles will use a small set of variables many times, and will never read the value of other variables. The first candidate optimization had pymake parse variables as they were set; a much better solution later was to lazily parse variables the first time they were read.

A grab-bag of other optimizations improved performance by a bit, but the last attempt increased code complexity far more than performance.

Hitting a Performance Barrier

At the moment I think pymake has hit a performance barrier and I’m not sure how to proceed. The current profile of pymake, generated with cProfile, is mostly unhelpful:

7228961 function calls (6902795 primitive calls) in 10.934 CPU seconds

Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    15529    0.783    0.000    2.059    0.000 /builds/pymake/pymake/
49054/35478    0.555    0.000    2.320    0.000 /builds/pymake/pymake/
      128    0.396    0.003    0.396    0.003 {}
466085/222572    0.356    0.000    2.653    0.000 /builds/pymake/pymake/
    51384    0.289    0.000    1.491    0.000 /builds/pymake/pymake/
    13876    0.288    0.000    1.007    0.000 /builds/pymake/pymake/
    29268    0.280    0.000    0.280    0.000 {posix.stat}
   171027    0.266    0.000    0.471    0.000 /builds/pymake/pymake/
    25700    0.246    0.000    0.327    0.000 /builds/pymake/pymake/
    40350    0.223    0.000    0.223    0.000 /usr/lib64/python2.5/logging/
    58982    0.213    0.000    0.329    0.000 /builds/pymake/pymake/
43854/42319    0.211    0.000    1.572    0.000 /builds/pymake/pymake/
131959/117714    0.207    0.000    1.343    0.000 /builds/pymake/pymake/
     2130    0.194    0.000    2.281    0.001 /builds/pymake/pymake/
    47515    0.189    0.000    0.421    0.000 /builds/pymake/pymake/
      128    0.182    0.001    0.182    0.001 {posix.fork}
     7717    0.174    0.000    1.941    0.000 /builds/pymake/pymake/
    57298    0.173    0.000    0.255    0.000 /usr/lib64/python2.5/
    73798    0.165    0.000    0.628    0.000 /builds/pymake/pymake/
    46953    0.157    0.000    0.184    0.000 /builds/pymake/pymake/
1153401/1150418    0.156    0.000    0.158    0.000 {len}
    27900    0.156    0.000    0.163    0.000 /builds/pymake/pymake/
11264/168    0.148    0.000    6.120    0.036 /builds/pymake/pymake/
37008/23960    0.141    0.000    0.176    0.000 /builds/pymake/pymake/
   330817    0.135    0.000    0.135    0.000 {method 'startswith' of 'str' objects}

parsemakesyntax, the function which parses $(FOO) $(BAR) into an Expansion, is still the single most time-consuming function. But since I don’t have line-by-line heatmaps, it’s hard to know what parts of that function might be inefficient. The callee data is not much help:

                                                          ncalls  tottime  cumtime
/builds/pymake/pymake/  ->   27666    0.017    0.017  /builds/pymake/pymake/
                                                             233    0.000    0.001  /builds/pymake/pymake/
                                                           33408    0.059    0.097  /builds/pymake/pymake/
                                                           10219    0.014    0.020  /builds/pymake/pymake/
                                                           27666    0.084    0.212  /builds/pymake/pymake/
                                                            1474    0.002    0.002  /builds/pymake/pymake/
                                                            1271    0.002    0.002  /builds/pymake/pymake/
                                                            2765    0.005    0.007  /builds/pymake/pymake/
                                                            8315    0.018    0.022  /builds/pymake/pymake/
                                                             430    0.001    0.001  /builds/pymake/pymake/
                                                             203    0.001    0.002  /builds/pymake/pymake/
                                                           25800    0.106    0.346  /builds/pymake/pymake/
                                                            9986    0.068    0.072  /builds/pymake/pymake/
                                                           27239    0.035    0.072  /builds/pymake/pymake/
                                                           46953    0.157    0.184  /builds/pymake/pymake/
                                                           15245    0.071    0.133  /builds/pymake/pymake/
                                                            6440    0.026    0.033  /builds/pymake/pymake/
                                                           25515    0.032    0.032  /builds/pymake/pymake/
                                                           15529    0.003    0.003  {callable}
                                                           28565    0.008    0.010  {len}
                                                            9986    0.003    0.003  {method 'append' of 'list' objects}
                                                               1    0.000    0.000  {method 'iterkeys' of 'dict' objects}
                                                            9986    0.005    0.005  {method 'pop' of 'list' objects}

Yes, I know getloc is inefficient, and a commenter on one of my previous posts suggests a possible solution. But that’s not going to create any significant improvement. In order to have performance parity with GNU make there has to be an algorithmic improvement.

Can I trust cProfile?

There are some confusing aspects to the cProfile output which make me suspect it. In particular, I suspect that generator functions are not being accounted for correctly: the primary work of the iterdata function is to call match on a compiled regular expression object, but that method doesn’t even show up in the callee list:

                                                   ncalls  tottime  cumtime
/builds/pymake/pymake/  ->   17815    0.004    0.004  {built-in method end}
                                                    17815    0.004    0.004  {built-in method group}
                                                    35630    0.014    0.014  {built-in method start}
                                                    25222    0.005    0.005  {len}

In any case, it’s hard to usefully analyze the profiling output. What I want is a Shark-like hierarchical profile. Apparently, dtrace can profile Python code, but I seen any useful visualization/analysis tools for that combination: if anyone knows of something, please let me know!

Help Wanted

I’ve been head-down in pymake for a few weeks now; my review queue and several other Mozilla tasks need attention. I’d really love some help from people who know Python performance (or who have no fear). If you’re interested and want some guidance, please e-mail me or leave a comment here. We can probably construct some useful pymake performance tests that are not as scary as actually building Mozilla.

Atom Feed for Comments 10 Responses to “Performance and Pymake”

  1. Robert O'Callahan Says:

    Maybe Python just isn’t the right tool for this job.

  2. johnjbarton Says:

    I’m puzzled by your question here Ben. Is 7 seconds a significant factor?

    Some years ago the author of the Jikes Java compiler, Phillippe Charles, did an extensive analysis of his tool. He had a choice: build a dependency graph and compile only the changed ones or compile all. He found that compile all won hands down. CPUs are so fast compared to disk drives that walking the file system dominates. He ditched the dependency analysis; Jikes was renowned as a fast compiler.

    So maybe the more important difference between Windows and Linux and between your optimized python and make could be file i/o management.


  3. Vlad Vukicevic Says:

    I don’t think it’s a problem with python; I think that -make- might not be the right tool for this job.

  4. Arpad Borsos Says:

    I’ve read somewhere that SCons’ performance problems are due to variable expansion as well. I can’t find the link though.

  5. kbrosnan Says:

    Would taking the Hg approach of using C for the slow parts of python be of benefit? It is touched upon in the video about 15:00 to 21:00 it is a bit hand wavy. Of course asking them or poking their source may yield fruit.

  6. Robert Sayre Says:

    The older hotshot profiler outputs logs you can feed to kcachegrind. Google says there are patches floating around for cProfile as well. kcachegrind has that homely linux look, but I like its area-based visualizations better than shark hierarchical profiles.

  7. Sylvain Says:

    Could be interesting to give Psyco a try (

  8. Benjamin Smedberg Says:

    Sylvain/kbrosnan: both approaches would necessitate that Mozilla builders have psycho, or compile a binary module (or suffer from much worse performance). That’s not really an option I’d like to consider at this point.

    Robert, yes, this may be the wrong tool for actually running makefiles. Though as an import tool for some other format it’s still going to be pretty helpful.

    johnjbarton, 7 seconds of make overhead just to build XPCOM is a pretty big deal, yes, if you multiply that over the entire tree. We’re trying to reduce the time it takes to do a complete do-nothing rebuild to, say, <30 seconds total. Your comparison with Jikes makes no sense, since recompiling Mozilla from scratch requires hours of CPU time.

  9. Vlad Vukicevic Says:

    Oh, one pymake bug — it doesn’t handle msys style “/c/foo” paths. I hacked around that for testing tier_xpcom, but I can’t test a full build because of that — is that an easy fix? (I get into that situation by using an absolute path to configure, needed for the WinCE build due to a bug for now.)

  10. Benjamin Smedberg Says:

    Vlad, that’s intentional (python and pymake are not MSYS programs). I have a patch which avoids leaking msys-style paths into makefiles.

Leave a Reply