Object-storage-first: why our query DP has no authoritative state

Most vector databases assume the index lives in RAM in a long-running cluster. That's a great fit for some workloads. It's a terrible fit for others, and the others are where RosalindDB chose to live.

The shape problem

If you're running interactive semantic search behind a consumer product at thousands of QPS, you want the index pinned in memory on warm replicas and you want a routing layer in front of them. That architecture exists, it works, and the pricing reflects it: you're paying for the cluster every minute of every day whether anyone queried it or not.

The workloads we care about don't look like that. Agent long-term memory is bursty — a session lights up, asks the database forty things in ten seconds, and goes quiet for an hour. Indie RAG over a 200k-chunk internal corpus changes weekly, not per second. Batch retrieval jobs run on a schedule, hammer the index for fifteen minutes, and then nothing. Internal-tool search has twelve users.

For that class of workload, an always-on memory-resident cluster is the wrong unit of consumption. You're renting RAM you aren't using so the rare query is fast. Scale to zero isn't a nice-to-have — it's the entire economic argument. And you can't scale to zero if your replicas hold the authoritative copy of the data, because killing them means losing it.

So we inverted the assumption.

What "object storage is the truth" means

The authoritative copy of every FAISS shard RosalindDB has ever built lives in S3-compatible object storage. Cloudflare R2, AWS S3, MinIO, or any other S3-compatible object store — RosalindDB has no preferred provider. There is also an in-memory adapter for unit tests. There is deliberately no file:// adapter — local disk as authority is a foot-gun that tempts people to run production on a single node, and that's exactly the shape we're trying to get away from.

The layout is boring on purpose:

indexes/<tenant>/<dataset>/v-<version>/shard-<id>.bin
indexes/<tenant>/<dataset>/v-<version>/shard-<id>.bin.meta.json
landing/<tenant>/<dataset>/upload-<id>/part-NNNN.parquet

Each shard is a serialised FAISS IVFFlat index next to a JSON sidecar. FAISS itself only knows about int64 ids — the sidecar is what maps those int64s back to the caller-supplied string ids and any metadata you uploaded with the vector. They travel together. A shard without its sidecar is unreadable; a sidecar without its shard is trivia.

The Query Data Plane — the process that actually answers POST /v1/query — holds none of this authoritatively. It's a Python service with a shard_catalog lookup in Postgres, an HTTP client to the object store, and an in-process cache. That's it. Kill a Query-DP replica mid-flight and you lose nothing except the warmth of its cache. Start a new one and within a few hundred milliseconds of cold reads it's serving traffic again.

This is the property we wanted, and it's the property the rest of the design is in service of: the data plane is recreatable. It is not a stateful database node. It is a cache with a query planner stapled to the front.

The cache: bounded by bytes, not entries

The cache is the part of this design that took the longest to get right, and it's the part that most directly pays for the thesis. Get it wrong and "object storage as the truth" becomes "object storage as a slow disk we hit every query."

The cache is a per-process LRU keyed by shard id. The value is the deserialised FAISS index plus the parsed sidecar — already in memory, ready to search. On a hit, query latency is whatever FAISS takes to probe nprobe cells (default 64) and rank by exact L2. On a miss, we fetch shard-<id>.bin and shard-<id>.bin.meta.json from object storage, call faiss.read_index, parse the sidecar, insert into the cache, and then run the search.

Here is the part that matters: the cache is bounded by bytes, not by entry count. The knob is RB_SHARD_CACHE_BYTES, default 512 MB.

The obvious implementation — "keep the last N shards" — is wrong for this workload, and it's wrong in a way that doesn't show up until you have real datasets. Shard footprints span roughly two orders of magnitude across the datasets we see. A small one might be a thousand vectors in a few hundred kilobytes. A large one is closer to a million vectors and ~430 MB of float32 once FAISS has it deserialised. A count cap that's safe for the small case (say, 64 shards) lets the large case blow past your container's memory limit and get OOM-killed. A count cap that's safe for the large case (say, 1 shard) is uselessly small for the common case.

Bytes are the only honest unit. When a shard is loaded we measure what it cost — the serialised index size is a reasonable proxy, with a small fixed overhead for the parsed sidecar — and we account against the budget. If admitting the new shard would breach RB_SHARD_CACHE_BYTES, we evict LRU entries until it fits. The operator picks a budget that matches the container's memory limit and stops worrying about it. The cache will do whatever it has to inside that envelope.

A nice side effect: this makes vertical and horizontal scaling decisions legible. If your working set won't fit in a 512 MB budget on one replica and the eviction rate is hurting your tail latency, you have exactly two levers — give each replica more memory, or add replicas and let the router spread shards. Both are configuration changes, not migrations.

The cold-cache cost, honestly

The price of having no authoritative state on the data plane is that the first query against a freshly-started replica is slower than the second. We don't try to hide it.

For a typical shard — a few tens of megabytes — the cold path is dominated by the object-storage GET. R2 and S3 both sit in the 50–200 ms range for the body of a small object from a region with reasonable network proximity, with a long tail past that during incidents. Add faiss.read_index and sidecar parsing, and a first-touch query on a cold replica costs you in the neighbourhood of 100–300 ms before FAISS does any actual searching. For a very large shard the read alone is longer.

The safety valve for the ugly cases is the ephemeral path. If the planner can see that the synchronous path is going to breach the request budget — too many shards to load, too large a working set — Query-DP enqueues a RUN_EPHEMERAL_QUERY job, returns 202 with a job_id, and the client polls GET /v1/query/status/{job_id}. A separate ephemeral_runner process churns through it. That's the escape hatch; the common case is the cache.

We think this is a fair trade for the workload class. If your p50 needs to be 8 ms hot interactive search at scale, it isn't.

Prior art: this is a recognised shape

We didn't invent "object storage as the database." The pattern has been quietly winning in adjacent infrastructure for a couple of years now, which is part of why we were comfortable betting on it for a vector store.

SlateDB is an embedded LSM key-value store built directly on object storage, with writers, readers, and compaction running as independent processes — the writer never competes with reads, and there's no local disk in the authority path. WarpStream makes the same bet for Kafka: stateless agents, S3 as the log, and the argument that cross-AZ replication costs are the real economic enemy. Turbopuffer is the closest analog in our space, with a documented cold/warm split (their cold p50 on a 1M-document namespace is just under a second; warm is in the low double digits) that mirrors the shape we're managing.

The argument these systems share is the one we believe: cloud providers run object storage at a scale and durability that nobody else can match, and the right design treats it as the foundation rather than the backup target.

Where this is the wrong shape

Be honest about it. RosalindDB is not the right database for:

p50 in single-digit milliseconds for hot interactive search at scale — you want a memory-resident cluster with warm replicas
A billion-vector single-tenant corpus that must always be queryable in under 50 ms — the working set won't fit in any reasonable per-replica byte budget and you'll spend all day on the cold path
Workloads that never go cold — if your QPS floor is high enough that scale-to-zero is never triggered, you're paying object-storage round-trip costs for no economic gain

It is the right shape for cold and bursty workloads where always-on cluster pricing is the dominant line item, where the corpus changes on a human timescale, and where "kill the data plane and start a new one" should be a routine operation rather than an incident.

If you want to see how the rest of the system hangs off this decision — the five process roles, the ingest path, the validator — the architecture chapter walks through it end to end.

The shape problem

So we inverted the assumption.

What "object storage is the truth" means

The layout is boring on purpose:

indexes/<tenant>/<dataset>/v-<version>/shard-<id>.bin
indexes/<tenant>/<dataset>/v-<version>/shard-<id>.bin.meta.json
landing/<tenant>/<dataset>/upload-<id>/part-NNNN.parquet

The cache: bounded by bytes, not entries

Here is the part that matters: the cache is bounded by bytes, not by entry count. The knob is RB_SHARD_CACHE_BYTES, default 512 MB.

The cold-cache cost, honestly

The price of having no authoritative state on the data plane is that the first query against a freshly-started replica is slower than the second. We don't try to hide it.

We think this is a fair trade for the workload class. If your p50 needs to be 8 ms hot interactive search at scale, it isn't.

Prior art: this is a recognised shape

Where this is the wrong shape

Be honest about it. RosalindDB is not the right database for:

p50 in single-digit milliseconds for hot interactive search at scale — you want a memory-resident cluster with warm replicas
A billion-vector single-tenant corpus that must always be queryable in under 50 ms — the working set won't fit in any reasonable per-replica byte budget and you'll spend all day on the cold path
Workloads that never go cold — if your QPS floor is high enough that scale-to-zero is never triggered, you're paying object-storage round-trip costs for no economic gain

If you want to see how the rest of the system hangs off this decision — the five process roles, the ingest path, the validator — the architecture chapter walks through it end to end.