Datasets
A dataset is a named, tenant-scoped collection of vectors with a fixed embedding dimension. Create one, stream NDJSON records into it, poll until status === "indexed", then query.
POST /v1/datasets
Create a dataset
Create an empty dataset bound to your tenant. name is 1-64 chars matching [a-z0-9_-]+ and must be unique per tenant. dimension is fixed at create time — every vector you later upload must match it exactly.
curl -s -X POST http://localhost:8080/v1/datasets \
-H "Authorization: Bearer $RB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"name":"products","dimension":768}'Response (HTTP 201):
{
"name": "products",
"dimension": 768,
"status": "empty",
"row_count": 0,
"created_at": "2026-05-14T12:34:56Z"
}POST /v1/datasets/{name}/vectors
Upload NDJSON vectors
Stream NDJSON records into a dataset. One JSON object per line, Content-Type: application/x-ndjson. Each record has an id (1-256 chars), a values array whose length equals the dataset's dimension, and an optional metadata object.
Sample vectors.jsonl:
{"id":"doc-1","values":[0.1,0.2,0.3,0.4],"metadata":{"title":"Atlas of birds"}}
{"id":"doc-2","values":[0.5,0.6,0.7,0.8],"metadata":{"title":"Field guide"}}
{"id":"doc-3","values":[0.9,1.0,1.1,1.2],"metadata":{"title":"Migration patterns"}}Upload:
curl -s -X POST http://localhost:8080/v1/datasets/products/vectors \ -H "Authorization: Bearer $RB_API_KEY" \ -H "Content-Type: application/x-ndjson" \ --data-binary @vectors.jsonl
Response (HTTP 202):
{
"accepted": 3,
"rejected": 0,
"errors": [],
"job_id": "job_01H..."
}Rejected lines are reported individually with a line number and a reason (e.g. dimension mismatch: got 767 expected 768). Accepted records are queued for validation and indexing.
This endpoint is an upsert: re-sending a record whose id already exists overwrites it — last write wins — rather than creating a duplicate. accepted counts validated NDJSON lines before dedup, so sending the same id twice counts both lines even though only one row is stored.
POST /v1/datasets/{name}/imports
Bulk import (large files)
The POST /v1/datasets/{name}/vectors endpoint buffers the whole body and caps it at 10 MiB — fine for small interactive upserts, too small for a real embedding dump. For large files, use the async import-job flow: the client stages the file directly into object storage via a presigned upload, then a job validates and indexes it asynchronously. The bytes never flow through the API. Both ndjson and parquet uploads are supported.
Step 1 — create the job. It returns an import_id and a presigned upload target:
curl -s -X POST http://localhost:8080/v1/datasets/products/imports \
-H "Authorization: Bearer $RB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"format":"ndjson","error_mode":"continue","max_bad_records":100}'Response (HTTP 201):
{
"import_id": "imp_a1b2c3...",
"dataset": "products",
"status": "awaiting_upload",
"format": "ndjson",
"error_mode": "continue",
"max_bad_records": 100,
"upload": {
"method": "PUT",
"url": "https://minio.example/rosalinddb/staging/...?X-Amz-Signature=...",
"content_type": "application/octet-stream",
"max_bytes": 5368709120,
"expires_at": "2026-05-15T13:34:56Z"
},
"created_at": "2026-05-15T12:34:56Z"
}format— required.ndjsonorparquet. A Parquet file must conform to the internal landing schema (idstring,valueslist of float of lengthdimension, optionalmetadatastruct).error_mode— optional, defaultcontinue.continuedrops bad records and writes them to a rejected-records file;abortfails the whole job on the first bad record.max_bad_records— optional, defaultnull(unlimited). Witherror_mode: continue, if the rejected count exceeds this value the job is markedfailed.
Step 2 — upload the file directly to storage. Do a single PUT of the raw file body to upload.url, with Content-Type set to exactly upload.content_type:
curl -s -X PUT "$UPLOAD_URL" \ -H "Content-Type: application/octet-stream" \ --data-binary @vectors.ndjson
It is a presigned PUT — no multipart form, no fields. Sending any other Content-Type is rejected with 403 SignatureDoesNotMatch. The size cap upload.max_bytes is enforced after the fact (the import worker checks the staged object before validation), so check your file size against it to fail fast.
Step 3 — signal the upload is done:
curl -s -X POST \ http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3.../complete \ -H "Authorization: Bearer $RB_API_KEY"
This verifies the staged object is present, moves the job to validating, and returns HTTP 202 with the job object. Calling complete on a job that is not awaiting_upload returns 409 import_not_pending; calling it before the file was staged returns 400 upload_missing.
Step 4 — poll the job until it reaches a terminal state:
curl -s http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3... \ -H "Authorization: Bearer $RB_API_KEY"
Response (HTTP 200):
{
"import_id": "imp_a1b2c3...",
"dataset": "products",
"format": "ndjson",
"status": "completed",
"error_mode": "continue",
"max_bad_records": 100,
"records_processed": 10000,
"records_accepted": 9998,
"records_rejected": 2,
"percent_complete": 100,
"rejected_records_url": "https://minio.example/...&X-Amz-Signature=...",
"error_message": null,
"created_at": "2026-05-15T12:34:56Z",
"completed_at": "2026-05-15T12:36:10Z"
}The job walks awaiting_upload → validating → indexing → completed, or lands on failed from any stage with error_message populated. percent_complete is 0 / 25 / 90 / 100 across those states. When records_rejected > 0, rejected_records_url is a presigned link to a rejected.jsonl file — one JSON object per rejected record with a line, reason, and the offending record. It stays available for at least 30 days. GET /v1/datasets/{name}/imports lists a dataset's jobs, newest first.
Two-stage quota. At create time an admission check rejects the request with 429 vector_quota_exceeded if the tenant is already at or over its vector quota — before anything is staged. After validation, the settlement stage charges records_accepted against the quota; if that would cross the remaining quota the job is marked failed and nothing is indexed.
Status lifecycle
A dataset progresses through four states as vectors land and a shard is built. Poll GET /v1/datasets/{name} until status === "indexed" before issuing queries.
empty ──► validating ──► indexing ──► indexed
│ │
└──────────────┴──► errorempty— created, no vectors uploaded yetvalidating— the validator has received records and is checking shapesindexing— a shard is being builtindexed— at least one shard is queryableerror— last build or validate failed;error_messageexplains why
GET /v1/datasets
List datasets
List every dataset owned by your tenant.
curl -s http://localhost:8080/v1/datasets \ -H "Authorization: Bearer $RB_API_KEY"
Response (HTTP 200):
{
"datasets": [
{
"name": "products",
"dimension": 768,
"status": "indexed",
"row_count": 12500,
"created_at": "2026-05-14T12:34:56Z",
"last_indexed_at": "2026-05-14T12:40:00Z",
"error_message": null
}
]
}DELETE /v1/datasets/{name}
Delete a dataset
Soft-delete a dataset. The catalog row is marked deleted_at; shard cleanup runs in the background. Subsequent reads return 404 dataset_not_found.
curl -s -X DELETE http://localhost:8080/v1/datasets/products \ -H "Authorization: Bearer $RB_API_KEY"
Returns HTTP 204 with no body on success.
Limits
- Max upload body: 10 MiB per
POST /v1/datasets/{name}/vectorsrequest. Larger batches return413 payload_too_large— chunk them into multiple calls, or use bulk import. - Upload formats: the
vectorsendpoint takes NDJSON only. The async bulk-import flow accepts both NDJSON and Parquet. - Per-tenant vector quotas: enforced via
429 vector_quota_exceeded— see Rate limits & quotas.