October 8, 2008

Space-time: cube and btree_gist use

Hello

I want to share my recent experience with geometric data processing and optimization in postgresql. The solution I'm going to describe has produced a good practical effect.

Let's assume we have a table for geo data logging geodata_log containing objects location column obj_point (point) and capture time obj_created (timestamp). Also there's a query that is used to obtain the latest object from specified square very often:
EXPLAIN ANALYZE
SELECT * FROM
geodata_log
WHERE
box(obj_point, obj_point) <@ box(point(50.857639595549, 30.337280273438), point(55.450349029297, 38.715576171875))
ORDER BY
obj_created
LIMIT 1;
Adding data into the table is rather intencive so we can ensure that at least one record has been added for the last 3 months. Accordingly we can supply the query with additional condition on capture time:
EXPLAIN ANALYZE
SELECT * FROM
geodata_log
WHERE
box(obj_point, obj_point) <@ box(point(50.857639595549, 30.337280273438), point(55.450349029297, 38.715576171875))
AND obj_created > '2008-07-04'
ORDER BY
obj_created
LIMIT 1;
It's clear that case without indexes doesn't worth our attention. So let's create an index on location and capture time:
CREATE INDEX i_geodata_log__created_point
ON geodata_log
USING gist
(obj_created, box(obj_point, obj_point));
And we've got an error:
ERROR: data type timestamp with time zone has no default operator class for access method "gist"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
Similar message we will see if we try to create btree index. The thing is that int4, timestamp and other common types are supported by btree oposite to geometric types that are supported by gist.

Let's create two indexes, first on location:
CREATE INDEX i_geodata_log__point
ON geodata_log
USING gist
(box(obj_point, obj_point));
Query plan is:
Limit  (cost=256.03..256.04 rows=1 width=570) (actual time=17.081..17.082 rows=1 loops=1)
-> Sort (cost=256.03..256.04 rows=1 width=570) (actual time=17.079..17.079 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on geodata_log (cost=4.79..256.02 rows=1 width=570) (actual time=9.903..15.112 rows=829 loops=1)
Recheck Cond: (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box)
Filter: (obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone)
-> Bitmap Index Scan on i_geodata_log__point (cost=0.00..4.79 rows=67 width=0) (actual time=9.231..9.231 rows=2362 loops=1)
Index Cond: (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box
Total runtime: 17.184 ms
Second one on capture time:
CREATE INDEX i_geodata_log__obj_created
ON geodata_log
USING btree
(obj_created);
Plan is:
Limit  (cost=40.44..40.45 rows=1 width=570) (actual time=15.288..15.288 rows=1 loops=1)
-> Sort (cost=40.44..40.45 rows=1 width=570) (actual time=15.285..15.285 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on geodata_log (cost=36.42..40.43 rows=1 width=570) (actual time=10.909..13.297 rows=829 loops=1)
Recheck Cond: ((box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box) AND (obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone))
-> BitmapAnd (cost=36.42..36.42 rows=1 width=0) (actual time=10.751..10.751 rows=0 loops=1)
-> Bitmap Index Scan on i_geodata_log__point (cost=0.00..4.79 rows=67 width=0) (actual time=9.235..9.235 rows=2362 loops=1)
Index Cond: (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box)
-> Bitmap Index Scan on i_geodata_log__obj_created (cost=0.00..31.38 rows=1483 width=0) (actual time=0.996..0.996 rows=3857 loops=1)
Index Cond: (obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone)
Total runtime: 15.392 ms
Despite the index scans the situation wishes to be desired. Is there another way? It is.

Take a look at postgres contribs and pay attention to cube and btree_gist. First provides a cube type - multidimentional cube of float point numbers. Second implements gist for int2, int4, timestamp and other common types.

cube will help us if we represent our space like float triplets (latitude, longitude, seconds_since_epoch). Btree allows to create mixed indexes. Let's install both contribs and test it.

cube requires additional column, let's name it area_time:
ALTER TABLE geodata_log
ADD COLUMN area_time cube;
It has to be filled with 3D points (obj_point[0], obj_point[1], extract(epoch from obj_created)):
UPDATE geodata_log SET
area_time = cube(ARRAY[obj_point[0], obj_point[1], extract(epoch from obj_created)],
ARRAY[obj_point[0], obj_point[1], extract(epoch from obj_created)]);
Create an index on it:
CREATE INDEX i_geodata_log__area_time
ON geodata_log
USING gist
(area_time);
Change our query and look at the plan:
EXPLAIN ANALYZE
SELECT * FROM
geodata_log
WHERE
area_time <@ cube(ARRAY[50.857639595549, 30.337280273438, extract(epoch from '2008-07-04'::timestamp)],
ARRAY[55.450349029297, 38.715576171875, extract(epoch from now()::timestamp)])
ORDER BY
obj_created
LIMIT 1;

Limit (cost=265.36..265.36 rows=1 width=614) (actual time=5.312..5.313 rows=1 loops=1)
-> Sort (cost=265.36..265.52 rows=67 width=614) (actual time=5.310..5.310 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on geodata_log (cost=8.83..265.02 rows=67 width=614) (actual time=0.794..3.190 rows=829 loops=1)
Recheck Cond: (area_time <@ cube('{50.857639595549,30.337280273438,1215115200}'::double precision[], ARRAY[55.450349029297::double precision, 38.715576171875::double precision, date_part('epoch'::text, (now())::timestamp without time zone)]))
-> Bitmap Index Scan on i_geodata_log__area_time (cost=0.00..8.81 rows=67 width=0) (actual time=0.682..0.682 rows=829 loops=1)
Index Cond: (area_time <@ cube('{50.857639595549,30.337280273438,1215115200}'::double precision[], ARRAY[55.450349029297::double precision, 38.715576171875::double precision, date_part('epoch'::text, (now())::timestamp without time zone)]))
Total runtime: 5.420 ms
It's better, indeed. Now try btree_gist:
CREATE INDEX i_geodata_log__created_point
ON geodata_log
USING gist
(obj_created, box(obj_point, obj_point));

EXPLAIN ANALYZE
SELECT * FROM
geodata_log
WHERE
box(obj_point, obj_point) <@ box(point(50.857639595549, 30.337280273438), point(55.450349029297, 38.715576171875))
AND obj_created > '2008-07-04'
ORDER BY
obj_created
LIMIT 1;

Limit (cost=8.31..8.32 rows=1 width=570) (actual time=5.092..5.093 rows=1 loops=1)
-> Sort (cost=8.31..8.32 rows=1 width=570) (actual time=5.092..5.092 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Index Scan using i_geodata_log__created_point on geodata_log (cost=0.00..8.30 rows=1 width=570) (actual time=0.076..3.408 rows=829 loops=1)
Index Cond: ((obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone) AND (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box))
Total runtime: 5.170 ms
Quite good.

Making few tests in different situations I noticed that performance is almost the same in both cases but as cube requires additional column and trigger to fill it up I choose btree_gist.

What would you choose depends on your situation.

p.s. In the nearest future I'm going to publish a post named "Divide and conquer: active and arcive data problem" It'll be interesting I promise :)

Regards,
Sergey Konoplev

No comments:

Post a Comment