[Xastir-dev] Re: Breaking up shapefiles Re: [Xastir] ESRI shapefiles and dbfawk

Fri Dec 17 06:01:51 EST 2004

On Fri, Dec 17, 2004 at 12:54:19AM -0700, we recorded a bogon-computron collision of the <russo at bogodyn.org> flavor, containing:
> On Fri, Dec 17, 2004 at 12:42:19AM -0700, we recorded a bogon-computron collision of the <jewen at shaw.ca> flavor, containing:
> > 
> > Excellent, I will chop the file up into smaller chunks. I'm going to make
> > them the same size that the B250K Toporama files are chopped up in.
> 
> Might want to wait just a bit.

or more than a bit.

> I am currently working on a modification to the shapefile rendering that
> uses a fast RTree lookup scheme to speed up large shapefile access.  What
> it does is build a spatial index of the contents of the shapefile the first
> time it's read in.  For subsequent attempts to render the file it uses the
> spatial index to select only those shape numbers that would intersect the
> current viewport and accesses them directly.  This would eliminate the need
> to split the files up into a bazillion small ones for speed --- the slowdown
> now is that if the shapefile contains 10000 shapes and 5 of them are visible,
> the code reads all 10000 and checks to see if they're visible.
> 
> My toy code that built the index and queried it worked out just fine, now 
> I'm trying to incorporate it into Xastir.  Might be done in a few hours.

Well, it works.  Once indexed, viewing even huge shapefiles is really fast
at tight zoom levels.

Unfortunately, the RTree spatial indexing takes up a *vast* amount of memory. 
As in, when I view the state of New Mexico with all 66 of the TIGER/Line 
shapefiles (both linear and area versions) ranging from 300K to 13M each, the 
NOAA county shapefiles ranging from 8M to 16M each, and a handful of 
additional ones, the nodes of the RTree take up some 74MB of ram.  That might 
be considered a fair chunk to some people.  It's not leaked memory, either,
once everything's indexed I can zoom in, out, and around till the cows
come home and memory use never increases.  Shapefiles are only indexed the
first time xastir tries to render them (i.e it's determined that at least
some part of the file overlaps the current viewport).

And to have any value at all, the indices need to be kept around and not
deleted willy-nilly --- to generate them requires that one read the bounding 
rectangles on each shape in a shapefile the first time it's encountered, which
means a linear scan through the file.  The speed-up only happens on the second 
and subsequent renderings of the file, and only when viewing at a zoom level 
smaller than the shapefile's full extent.

[the rest for the developers]

The way I'm doing it is pretty much how I outlined on the dev list yesterday:
I keep a hash table with filename as key.  When a shapefile is encountered
I check to see if I've already done it -- if so, I pull out the RTree root
that I stored when I first found it, do a search of the tree for the
shape ID of any shapes that have bounding rectangles that intersect the current
viewport, and then loop over the list of matching shapes, displaying them
precisely as before.  If I hadn't seen it before, I generate the RTree and add 
to the hash table, then act as if I'd seen it before.

There are some tunable parameters in the RTree library that I was able to
tweak to make the tree fatter and shallower, so each node is a bit more of a
memory hog, but the tree in toto winds up taking less memory --- but the 74MB 
I mentioned above was after tweaking significantly (it had been over 100M).

It'll take some doing to improve that, and I won't commit it until I do.  I
might be able to ifdef out the changes so that it can be an experimental 
feature, but I didn't do it that way to begin with so I'll have to go back
and return code I removed to its original state.  Maybe tomorrow.

If anyone has suggestions, I'm all ears.  

One thing that occurs to me is that perhaps for shapefiles of sufficiently
small size it's pointless to do the indexing --- so it's probably a good idea
to have a choice whether to process a shapefile with a spatial index/RTree
or not.  Index the huge files, do linear scans through the small ones.
Whether that will help is questionable, but there could be some wasted
memory when the tree isn't very well populated and almost everything's a leaf
node.  Hmmmm.

-- 
Tom Russo    KM5VY     SAR502  DM64ux         http://www.swcp.com/~russo/
Tijeras, NM  QRPL#1592 K2#398  SOC#236 AHTB#1 http://www.qsl.net/~km5vy/
    "When life gives you lemons, find someone with a paper cut."