When Partitioning Database Tables, EXPLAIN your queries

Saturday, May 12th, 2007

The past week or so I’ve been spending most of my time on the Socorro crash-reporting server software. One if the important things I’ve learned this week is that while database partitions solve some important problems, they create some equally nasty and unexpected problems.

Socorro will provide all of the reports and querying capabilities we need to analyze Firefox crashes. In order to gracefully deal with the volume of incoming crash reports from Firefox users (approx 30k reports per day), morgamic designed a database schema that would use postgres partitions to separate data into manageable and queryable pieces. This allows would allow any date-bound queries to read only the partitions of interest. And hopefully, we’re going to be able to design the reporting system so that all queries are date-bound.

The tables involved look something like this:

reports

id
(primary key)

date

1

2007-05-13 00:00:01

Yahoo games!

2

2007-05-13 00:00:03

Youtube video

3

2007-05-13 00:00:02

I just keep crashing :-(


(Index on date)

frames

report_id

frame_num

(primary key)

1

0

0x0

1

1

nsCOMPtr_base::~nsCOMPtr_base()


If you don’t partition the table, getting the last report is a very fast operation:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN                                                          
--------------
 Result  (cost=1.73..1.74 rows=1 width=0)
   InitPlan
     ->  Limit  (cost=0.00..1.73 rows=1 width=8)
           ->  Index Scan Backward using idx_reports_date on reports  (cost=0.00..1728388.93 rows=999873 width=8)
                 Filter: (date IS NOT NULL)

For the uninitiated, this means that we are doing an index scan of the index on date and returning the highest value.

However, when you partition the table, things get ugly very quickly:

breakpad=> EXPLAIN SELECT max(date) FROM reports;
QUERY PLAN
--------------
 Aggregate  (cost=186247.04..186247.05 rows=1 width=8)
   ->  Append  (cost=0.00..175344.43 rows=4361043 width=8)
         ->  Seq Scan on reports  (cost=0.00..10.20 rows=20 width=8)
         ->  Seq Scan on reports_part0 reports  (cost=0.00..40209.73 rows=999873 width=8)
         ->  Seq Scan on reports_part1 reports  (cost=0.00..40205.75 rows=1000175 width=8)
         ->  Seq Scan on reports_part2 reports  (cost=0.00..40200.93 rows=1000093 width=8)
         ->  Seq Scan on reports_part3 reports  (cost=0.00..40197.31 rows=999731 width=8)
         ->  Seq Scan on reports_part4 reports  (cost=0.00..14510.31 rows=361131 width=8)
         ->  Seq Scan on reports_part5 reports  (cost=0.00..10.20 rows=20 width=8)

The query performs a full table scan of all the partitions, which is just about the worst result possible. Even if you don’t have any constraints or knowledge about the data in the date field, the query planner should be able to optimize the query to the following:

SELECT max(maxdate)
FROM
 (SELECT max(date) as maxdate FROM reports_part0 UNION
  SELECT max(date) FROM reports_part1 UNION...
 );

This is at most one index query per partition, which is perfectly reasonable. If you add range constraints to the date field of each partition, this query can be optimized into a loop where you query the “latest” partition first and work backwards until you find a single value that is higher than the range of all the remaining partitions.

But there are even more “gotchas” lurking in table partitioning. The query planner operates on queries before functions are called or bind parameters are substituted. This means that a SQL query which contains a constant can perform very differently than one containing a function:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < '2007-05-12 11:03' AND date > '2007-05-12 10:03' ORDER BY date DESC;
QUERY PLAN
--------------
 Sort
   Sort Key: public.reports.date
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))
               ->  Index Scan using idx_reports_part0_date on reports_part0 reports
                     Index Cond: ((date < '2007-05-12 11:03:00'::timestamp without time zone) AND (date > '2007-05-12 10:03:00'::timestamp without time zone))

Because we have date constraints on the reports partitions, the planner is smart enough to know that only reports_part0 contains the data we’re looking for. But replace the literal dates with the equivalent functions, and the query planner has to search every partition:

breakpad=> EXPLAIN SELECT * FROM reports WHERE date < now() AND date > now() - interval '1 day' ORDER BY date DESC;
QUERY PLAN
---------------
 Sort
   Sort Key: public.reports.date
   ->  Result
         ->  Append
               ->  Seq Scan on reports
                     Filter: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Bitmap Heap Scan on reports_part0 reports
                     Recheck Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
                     ->  Bitmap Index Scan on idx_reports_part0_date
                           Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part1_date on reports_part1 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part2_date on reports_part2 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part3_date on reports_part3 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part4_date on reports_part4 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))
               ->  Index Scan using idx_reports_part5_date on reports_part5 reports
                     Index Cond: ((date < now()) AND (date > (now() - '1 day'::interval)))

Both of these missed optimizations are extremely problematic when dealing with partitioned tables in postgresql. The first, less common issue should be easy to fix, because it doesn’t require any constraint information. The second one is not so easy, because it would require the query planner to divide its work into a “pre-function/bindparam expansion” stage, which is cacheable, and a “post-function/bindparam expansion stage”, which is not very easy to cache.

We are going to try and work around the data-binding issue by issuing the queries from Socorro using literals instead of bound variables. This is not ideal because it requires the database to completely re-plan every query that is issued.

The moral of the story is simple: if you are planning on using database partitions, be sure you EXPLAIN all the queries you’re planning, with the actual literals or bound data statements that will actually be used in production. Be prepared to significantly rework your queries if the queries perform unexpected full table scans.