Documentation Index
Fetch the complete documentation index at: https://mintlify.com/terrafloww/rasteret/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Rasteret introduces index-first geospatial retrieval, a pattern that separates metadata operations from pixel data operations to eliminate cold-start overhead when working with cloud-native GeoTIFF collections.The Pattern
Index-first retrieval splits geospatial data access into two independent planes:Control Plane
Queryable Parquet index containing:
- Scene metadata (datetime, cloud cover, footprint)
- COG header metadata (tile offsets, byte counts, transforms)
- User-defined columns (ML splits, labels, custom attributes)
Data Plane
On-demand tile reads from original GeoTIFF/COG objects:
- Pixel data stays in source files
- Concurrent HTTP range requests
- No GDAL in the read path
Why It’s Fast
The Cold Start Problem
Traditional workflows with GDAL/rasterio re-parse COG headers over HTTP on every cold start:- Your colleague did it last week
- CI did it overnight
- PyTorch DataLoader workers do it every epoch
- Each process repeats millions of redundant HTTP requests
Rasteret’s Solution
Parse headers once, cache them in Parquet:All subsequent runs: Reads local Parquet, zero STAC queries, zero header fetches Pixel reads still fetch from remote COGs, but use cached tile offsets for direct range requests.
Performance vs GDAL
Single-Process Benchmarks
Same AOI, same scenes, same DataLoader. TorchGeo path uses recommended GDAL settings for remote COGs:| Scenario | rasterio/GDAL | Rasteret | Speedup |
|---|---|---|---|
| Single AOI, 15 scenes | 9.08s | 1.14s | 8x |
| Multi-AOI, 30 scenes | 42.05s | 2.25s | 19x |
| Cross-CRS, 12 scenes | 12.47s | 0.59s | 21x |
Measured on AWS EC2 in us-west-2 (same region as data). Speedup comes from eliminating header re-parsing, not pixel I/O.
Where the Time Goes
GDAL/rasterio path:Comparison to Other Tools
vs Google Earth Engine
vs Google Earth Engine
Earth Engine: Server-side processing, pre-computed pyramids, proprietary infrastructureRasteret: Client-side processing, works with any COG, open-source, no vendor lock-inBoth cache metadata, but Rasteret gives you control over the index (add custom columns, share as Parquet, version with git).
vs STAC + GDAL
vs STAC + GDAL
STAC + GDAL: Query STAC once, but GDAL re-parses IFD headers on every open()Rasteret: Query STAC once, cache IFD headers in Parquet, never re-parseSTAC gives you discovery; Rasteret extends it with persistent metadata cache.
vs Virtual Raster (VRT)
vs Virtual Raster (VRT)
VRT: XML file referencing source rasters, GDAL still parses headers at read timeRasteret: Parquet index with pre-parsed headers, zero overhead at read timeVRTs are declarative; Rasteret indexes are executable (queryable with Arrow/DuckDB).
Under the Hood
What Gets Cached
For each scene and band, Rasteret stores:/home/daytona/workspace/source/src/rasteret/fetch/cog.py for the full implementation.
Custom I/O Stack
Rasteret replaces GDAL’s I/O with:- obstore (Rust object_store bindings) for authenticated HTTP/S3/Azure reads
- imagecodecs for JPEG/LZW/Deflate decompression
- asyncio for concurrent tile fetches
Native dtypes preserved: uint16 stays uint16 in tensors (only xarray conversion promotes to float32 for NaN support)
Key Takeaways
Metadata is Queryable
Parquet index is an Arrow dataset — filter with SQL, DuckDB, or Arrow expressions before reading pixels
Pixels Stay Remote
No downloads required. Work with terabytes of imagery while storing only megabytes of metadata
Reproducible
Same Parquet index = same records = same results. Version control your index with git
Shareable
Export collection as Parquet, share with teammates — they get instant access without rebuilding
Next Steps
Collections
Learn how Collections wrap the index and provide filtering/export APIs
Dataset Catalog
Explore built-in datasets ready to use with
rasteret.build()