Performance and Pymake

Thursday, February 26th, 2009

pymake runs correctly now. With some Mozilla patches, I can get it to recurse through most of the tree within a single process, doing parallel builds correctly, on Windows, Linux, and Mac, python 2.4 and 2.5.

6x as slow

Performance is bad. On a benchmark of a do-nothing rebuild of make tier_xpcom, which is about 40 recursive makes:

bsmedberg $ time make tier_xpcom
real	0m1.129s

bsmedberg $ time python -O /builds/pymake/ tier_xpcom
real	0m7.525s

This is a bit depressing, because improved Windows performance was one of the primary reasons for doing pymake. It’s possible that the Windows numbers are better, given that launching Windows processes is so much more expensive than on Linux: but I doubt that will make up the whole 6x performance loss.

Improved 50% today!

Earlier today, the number was about 15 seconds. Profile-guided patches helped reduce execution time significantly!

The first optimization was string-joining performance. str.join Showed up near the top of profiles both in call counts and own-time. Expansion objects are a core data structure in pymake: the parser turns a string such as “Hello$(COMMA), $(subst a,o,warld)!” into a list of bare strings and function calls (a variable expansion is treated as a function. An expansion is “resolved” to a string by passing it variables and a makefile context. The original, naive way of resolving an Expansion used ''.join() on the elements. This was replaced, in two phases, with a system of iterators which recursively yield strings, and an itersplit method, which splits an expansion into words without even joining it into a single string. A final optimization replaced ''.join entirely: it was better, in 99% of cases, to use simple appending methods when few elements are being joined.

Another optimization avoids parsing makefile syntax into expansions until it’s actually needed. In many cases, makefiles will use a small set of variables many times, and will never read the value of other variables. The first candidate optimization had pymake parse variables as they were set; a much better solution later was to lazily parse variables the first time they were read.

A grab-bag of other optimizations improved performance by a bit, but the last attempt increased code complexity far more than performance.

Hitting a Performance Barrier

At the moment I think pymake has hit a performance barrier and I’m not sure how to proceed. The current profile of pymake, generated with cProfile, is mostly unhelpful:

7228961 function calls (6902795 primitive calls) in 10.934 CPU seconds

Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    15529    0.783    0.000    2.059    0.000 /builds/pymake/pymake/
49054/35478    0.555    0.000    2.320    0.000 /builds/pymake/pymake/
      128    0.396    0.003    0.396    0.003 {}
466085/222572    0.356    0.000    2.653    0.000 /builds/pymake/pymake/
    51384    0.289    0.000    1.491    0.000 /builds/pymake/pymake/
    13876    0.288    0.000    1.007    0.000 /builds/pymake/pymake/
    29268    0.280    0.000    0.280    0.000 {posix.stat}
   171027    0.266    0.000    0.471    0.000 /builds/pymake/pymake/
    25700    0.246    0.000    0.327    0.000 /builds/pymake/pymake/
    40350    0.223    0.000    0.223    0.000 /usr/lib64/python2.5/logging/
    58982    0.213    0.000    0.329    0.000 /builds/pymake/pymake/
43854/42319    0.211    0.000    1.572    0.000 /builds/pymake/pymake/
131959/117714    0.207    0.000    1.343    0.000 /builds/pymake/pymake/
     2130    0.194    0.000    2.281    0.001 /builds/pymake/pymake/
    47515    0.189    0.000    0.421    0.000 /builds/pymake/pymake/
      128    0.182    0.001    0.182    0.001 {posix.fork}
     7717    0.174    0.000    1.941    0.000 /builds/pymake/pymake/
    57298    0.173    0.000    0.255    0.000 /usr/lib64/python2.5/
    73798    0.165    0.000    0.628    0.000 /builds/pymake/pymake/
    46953    0.157    0.000    0.184    0.000 /builds/pymake/pymake/
1153401/1150418    0.156    0.000    0.158    0.000 {len}
    27900    0.156    0.000    0.163    0.000 /builds/pymake/pymake/
11264/168    0.148    0.000    6.120    0.036 /builds/pymake/pymake/
37008/23960    0.141    0.000    0.176    0.000 /builds/pymake/pymake/
   330817    0.135    0.000    0.135    0.000 {method 'startswith' of 'str' objects}

parsemakesyntax, the function which parses $(FOO) $(BAR) into an Expansion, is still the single most time-consuming function. But since I don’t have line-by-line heatmaps, it’s hard to know what parts of that function might be inefficient. The callee data is not much help:

                                                          ncalls  tottime  cumtime
/builds/pymake/pymake/  ->   27666    0.017    0.017  /builds/pymake/pymake/
                                                             233    0.000    0.001  /builds/pymake/pymake/
                                                           33408    0.059    0.097  /builds/pymake/pymake/
                                                           10219    0.014    0.020  /builds/pymake/pymake/
                                                           27666    0.084    0.212  /builds/pymake/pymake/
                                                            1474    0.002    0.002  /builds/pymake/pymake/
                                                            1271    0.002    0.002  /builds/pymake/pymake/
                                                            2765    0.005    0.007  /builds/pymake/pymake/
                                                            8315    0.018    0.022  /builds/pymake/pymake/
                                                             430    0.001    0.001  /builds/pymake/pymake/
                                                             203    0.001    0.002  /builds/pymake/pymake/
                                                           25800    0.106    0.346  /builds/pymake/pymake/
                                                            9986    0.068    0.072  /builds/pymake/pymake/
                                                           27239    0.035    0.072  /builds/pymake/pymake/
                                                           46953    0.157    0.184  /builds/pymake/pymake/
                                                           15245    0.071    0.133  /builds/pymake/pymake/
                                                            6440    0.026    0.033  /builds/pymake/pymake/
                                                           25515    0.032    0.032  /builds/pymake/pymake/
                                                           15529    0.003    0.003  {callable}
                                                           28565    0.008    0.010  {len}
                                                            9986    0.003    0.003  {method 'append' of 'list' objects}
                                                               1    0.000    0.000  {method 'iterkeys' of 'dict' objects}
                                                            9986    0.005    0.005  {method 'pop' of 'list' objects}

Yes, I know getloc is inefficient, and a commenter on one of my previous posts suggests a possible solution. But that’s not going to create any significant improvement. In order to have performance parity with GNU make there has to be an algorithmic improvement.

Can I trust cProfile?

There are some confusing aspects to the cProfile output which make me suspect it. In particular, I suspect that generator functions are not being accounted for correctly: the primary work of the iterdata function is to call match on a compiled regular expression object, but that method doesn’t even show up in the callee list:

                                                   ncalls  tottime  cumtime
/builds/pymake/pymake/  ->   17815    0.004    0.004  {built-in method end}
                                                    17815    0.004    0.004  {built-in method group}
                                                    35630    0.014    0.014  {built-in method start}
                                                    25222    0.005    0.005  {len}

In any case, it’s hard to usefully analyze the profiling output. What I want is a Shark-like hierarchical profile. Apparently, dtrace can profile Python code, but I seen any useful visualization/analysis tools for that combination: if anyone knows of something, please let me know!

Help Wanted

I’ve been head-down in pymake for a few weeks now; my review queue and several other Mozilla tasks need attention. I’d really love some help from people who know Python performance (or who have no fear). If you’re interested and want some guidance, please e-mail me or leave a comment here. We can probably construct some useful pymake performance tests that are not as scary as actually building Mozilla.

When Partitioning Database Tables, EXPLAIN your queries

Saturday, May 12th, 2007

The past week or so I’ve been spending most of my time on the Socorro crash-reporting server software. One if the important things I’ve learned this week is that while database partitions solve some important problems, they create some equally nasty and unexpected problems.

Socorro will provide all of the reports and querying capabilities we need to analyze Firefox crashes. In order to gracefully deal with the volume of incoming crash reports from Firefox users (approx 30k reports per day), morgamic designed a database schema that would use postgres partitions to separate data into manageable and queryable pieces. This allows would allow any date-bound queries to read only the partitions of interest. And hopefully, we’re going to be able to design the reporting system so that all queries are date-bound.

The tables involved look something like this:


(primary key)



2007-05-13 00:00:01

Yahoo games!


2007-05-13 00:00:03

Youtube video


2007-05-13 00:00:02

I just keep crashing :-(

(Index on date)




(primary key)







If you don’t partition the table, getting the last report is a very fast operation:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN                                                          
 Result  (cost=1.73..1.74 rows=1 width=0)
     ->  Limit  (cost=0.00..1.73 rows=1 width=8)
           ->  Index Scan Backward using idx_reports_date on reports  (cost=0.00..1728388.93 rows=999873 width=8)
                 Filter: (date IS NOT NULL)

For the uninitiated, this means that we are doing an index scan of the index on date and returning the highest value.

However, when you partition the table, things get ugly very quickly:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
 Aggregate  (cost=186247.04..186247.05 rows=1 width=8)
   ->  Append  (cost=0.00..175344.43 rows=4361043 width=8)
         ->  Seq Scan on reports  (cost=0.00..10.20 rows=20 width=8)
         ->  Seq Scan on reports_part0 reports  (cost=0.00..40209.73 rows=999873 width=8)
         ->  Seq Scan on reports_part1 reports  (cost=0.00..40205.75 rows=1000175 width=8)
         ->  Seq Scan on reports_part2 reports  (cost=0.00..40200.93 rows=1000093 width=8)
         ->  Seq Scan on reports_part3 reports  (cost=0.00..40197.31 rows=999731 width=8)
         ->  Seq Scan on reports_part4 reports  (cost=0.00..14510.31 rows=361131 width=8)
         ->  Seq Scan on reports_part5 reports  (cost=0.00..10.20 rows=20 width=8)

The query performs a full table scan of all the partitions, which is just about the worst result possible. Even if you don’t have any constraints or knowledge about the data in the date field, the query planner should be able to optimize the query to the following:

SELECT max(maxdate)
 (SELECT max(date) as maxdate FROM reports_part0 UNION
  SELECT max(date) FROM reports_part1 UNION...

This is at most one index query per partition, which is perfectly reasonable. If you add range constraints to the date field of each partition, this query can be optimized into a loop where you query the “latest” partition first and work backwards until you find a single value that is higher than the range of all the remaining partitions.

But there are even more “gotchas” lurking in table partitioning. The query planner operates on queries before functions are called or bind parameters are substituted. This means that a SQL query which contains a constant can perform very differently than one containing a function:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < '2007-05-12 11:03' AND date > '2007-05-12 10:03' ORDER BY date DESC;
   Sort Key:
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))
               ->  Index Scan using idx_reports_part0_date on reports_part0 reports
                     Index Cond: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))

Because we have date constraints on the reports partitions, the planner is smart enough to know that only reports_part0 contains the data we’re looking for. But replace the literal dates with the equivalent functions, and the query planner has to search every partition:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < now() AND date > now() - interval '1 day' ORDER BY date DESC;
   Sort Key:
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Bitmap Heap Scan on reports_part0 reports
                     Recheck Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
                     ->  Bitmap Index Scan on idx_reports_part0_date
                           Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part1_date on reports_part1 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part2_date on reports_part2 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part3_date on reports_part3 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part4_date on reports_part4 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part5_date on reports_part5 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))

Both of these missed optimizations are extremely problematic when dealing with partitioned tables in postgresql. The first, less common issue should be easy to fix, because it doesn’t require any constraint information. The second one is not so easy, because it would require the query planner to divide its work into a “pre-function/bindparam expansion” stage, which is cacheable, and a “post-function/bindparam expansion stage”, which is not very easy to cache.

We are going to try and work around the data-binding issue by issuing the queries from Socorro using literals instead of bound variables. This is not ideal because it requires the database to completely re-plan every query that is issued.

The moral of the story is simple: if you are planning on using database partitions, be sure you EXPLAIN all the queries you’re planning, with the actual literals or bound data statements that will actually be used in production. Be prepared to significantly rework your queries if the queries perform unexpected full table scans.