I want to share my recent experience with geometric data processing and optimization in postgresql. The solution I'm going to describe has produced a good practical effect.
Let's assume we have a table for geo data logging geodata_log containing objects location column obj_point (point) and capture time obj_created (timestamp). Also there's a query that is used to obtain the latest object from specified square very often:
EXPLAIN ANALYZEAdding data into the table is rather intencive so we can ensure that at least one record has been added for the last 3 months. Accordingly we can supply the query with additional condition on capture time:
SELECT * FROM
geodata_log
WHERE
box(obj_point, obj_point) <@ box(point(50.857639595549, 30.337280273438), point(55.450349029297, 38.715576171875))
ORDER BY
obj_created
LIMIT 1;
EXPLAIN ANALYZEIt's clear that case without indexes doesn't worth our attention. So let's create an index on location and capture time:
SELECT * FROM
geodata_log
WHERE
box(obj_point, obj_point) <@ box(point(50.857639595549, 30.337280273438), point(55.450349029297, 38.715576171875))
AND obj_created > '2008-07-04'
ORDER BY
obj_created
LIMIT 1;
CREATE INDEX i_geodata_log__created_pointAnd we've got an error:
ON geodata_log
USING gist
(obj_created, box(obj_point, obj_point));
ERROR: data type timestamp with time zone has no default operator class for access method "gist"Similar message we will see if we try to create btree index. The thing is that int4, timestamp and other common types are supported by btree oposite to geometric types that are supported by gist.
HINT: You must specify an operator class for the index or define a default operator class for the data type.
Let's create two indexes, first on location:
CREATE INDEX i_geodata_log__pointQuery plan is:
ON geodata_log
USING gist
(box(obj_point, obj_point));
Limit (cost=256.03..256.04 rows=1 width=570) (actual time=17.081..17.082 rows=1 loops=1)Second one on capture time:
-> Sort (cost=256.03..256.04 rows=1 width=570) (actual time=17.079..17.079 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on geodata_log (cost=4.79..256.02 rows=1 width=570) (actual time=9.903..15.112 rows=829 loops=1)
Recheck Cond: (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box)
Filter: (obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone)
-> Bitmap Index Scan on i_geodata_log__point (cost=0.00..4.79 rows=67 width=0) (actual time=9.231..9.231 rows=2362 loops=1)
Index Cond: (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box
Total runtime: 17.184 ms
CREATE INDEX i_geodata_log__obj_createdPlan is:
ON geodata_log
USING btree
(obj_created);
Limit (cost=40.44..40.45 rows=1 width=570) (actual time=15.288..15.288 rows=1 loops=1)Despite the index scans the situation wishes to be desired. Is there another way? It is.
-> Sort (cost=40.44..40.45 rows=1 width=570) (actual time=15.285..15.285 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on geodata_log (cost=36.42..40.43 rows=1 width=570) (actual time=10.909..13.297 rows=829 loops=1)
Recheck Cond: ((box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box) AND (obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone))
-> BitmapAnd (cost=36.42..36.42 rows=1 width=0) (actual time=10.751..10.751 rows=0 loops=1)
-> Bitmap Index Scan on i_geodata_log__point (cost=0.00..4.79 rows=67 width=0) (actual time=9.235..9.235 rows=2362 loops=1)
Index Cond: (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box)
-> Bitmap Index Scan on i_geodata_log__obj_created (cost=0.00..31.38 rows=1483 width=0) (actual time=0.996..0.996 rows=3857 loops=1)
Index Cond: (obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone)
Total runtime: 15.392 ms
Take a look at postgres contribs and pay attention to cube and btree_gist. First provides a cube type - multidimentional cube of float point numbers. Second implements gist for int2, int4, timestamp and other common types.
cube will help us if we represent our space like float triplets (latitude, longitude, seconds_since_epoch). Btree allows to create mixed indexes. Let's install both contribs and test it.
cube requires additional column, let's name it area_time:
ALTER TABLE geodata_logIt has to be filled with 3D points (obj_point[0], obj_point[1], extract(epoch from obj_created)):
ADD COLUMN area_time cube;
UPDATE geodata_log SETCreate an index on it:
area_time = cube(ARRAY[obj_point[0], obj_point[1], extract(epoch from obj_created)],
ARRAY[obj_point[0], obj_point[1], extract(epoch from obj_created)]);
CREATE INDEX i_geodata_log__area_timeChange our query and look at the plan:
ON geodata_log
USING gist
(area_time);
EXPLAIN ANALYZEIt's better, indeed. Now try btree_gist:
SELECT * FROM
geodata_log
WHERE
area_time <@ cube(ARRAY[50.857639595549, 30.337280273438, extract(epoch from '2008-07-04'::timestamp)],
ARRAY[55.450349029297, 38.715576171875, extract(epoch from now()::timestamp)])
ORDER BY
obj_created
LIMIT 1;
Limit (cost=265.36..265.36 rows=1 width=614) (actual time=5.312..5.313 rows=1 loops=1)
-> Sort (cost=265.36..265.52 rows=67 width=614) (actual time=5.310..5.310 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on geodata_log (cost=8.83..265.02 rows=67 width=614) (actual time=0.794..3.190 rows=829 loops=1)
Recheck Cond: (area_time <@ cube('{50.857639595549,30.337280273438,1215115200}'::double precision[], ARRAY[55.450349029297::double precision, 38.715576171875::double precision, date_part('epoch'::text, (now())::timestamp without time zone)]))
-> Bitmap Index Scan on i_geodata_log__area_time (cost=0.00..8.81 rows=67 width=0) (actual time=0.682..0.682 rows=829 loops=1)
Index Cond: (area_time <@ cube('{50.857639595549,30.337280273438,1215115200}'::double precision[], ARRAY[55.450349029297::double precision, 38.715576171875::double precision, date_part('epoch'::text, (now())::timestamp without time zone)]))
Total runtime: 5.420 ms
CREATE INDEX i_geodata_log__created_pointQuite good.
ON geodata_log
USING gist
(obj_created, box(obj_point, obj_point));
EXPLAIN ANALYZE
SELECT * FROM
geodata_log
WHERE
box(obj_point, obj_point) <@ box(point(50.857639595549, 30.337280273438), point(55.450349029297, 38.715576171875))
AND obj_created > '2008-07-04'
ORDER BY
obj_created
LIMIT 1;
Limit (cost=8.31..8.32 rows=1 width=570) (actual time=5.092..5.093 rows=1 loops=1)
-> Sort (cost=8.31..8.32 rows=1 width=570) (actual time=5.092..5.092 rows=1 loops=1)
Sort Key: obj_created
Sort Method: top-N heapsort Memory: 17kB
-> Index Scan using i_geodata_log__created_point on geodata_log (cost=0.00..8.30 rows=1 width=570) (actual time=0.076..3.408 rows=829 loops=1)
Index Cond: ((obj_created > '2008-07-04 00:00:00+04'::timestamp with time zone) AND (box(obj_point, obj_point) <@ '(55.450349029297,38.715576171875),(50.857639595549,30.337280273438)'::box))
Total runtime: 5.170 ms
Making few tests in different situations I noticed that performance is almost the same in both cases but as cube requires additional column and trigger to fill it up I choose btree_gist.
What would you choose depends on your situation.
p.s. In the nearest future I'm going to publish a post named "Divide and conquer: active and arcive data problem" It'll be interesting I promise :)
Regards,
Sergey Konoplev
No comments:
Post a Comment