Datasets

A dataset is a named, tenant-scoped collection of vectors with a fixed embedding dimension. Create one, stream NDJSON records into it, poll until status === "indexed", then query.

POST /v1/datasets

Create a dataset

Create an empty dataset bound to your tenant. name is 1-64 chars matching [a-z0-9_-]+ and must be unique per tenant. dimension is fixed at create time — every vector you later upload must match it exactly.

curl -s -X POST http://localhost:8080/v1/datasets \
  -H "Authorization: Bearer $RB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name":"products","dimension":768}'

Response (HTTP 201):

{
  "name": "products",
  "dimension": 768,
  "status": "empty",
  "row_count": 0,
  "created_at": "2026-05-14T12:34:56Z"
}

POST /v1/datasets/{name}/vectors

Upload NDJSON vectors

Stream NDJSON records into a dataset. One JSON object per line, Content-Type: application/x-ndjson. Each record has an id (1-256 chars), a values array whose length equals the dataset's dimension, and an optional metadata object.

Sample vectors.jsonl:

{"id":"doc-1","values":[0.1,0.2,0.3,0.4],"metadata":{"title":"Atlas of birds"}}
{"id":"doc-2","values":[0.5,0.6,0.7,0.8],"metadata":{"title":"Field guide"}}
{"id":"doc-3","values":[0.9,1.0,1.1,1.2],"metadata":{"title":"Migration patterns"}}

Upload:

curl -s -X POST http://localhost:8080/v1/datasets/products/vectors \
  -H "Authorization: Bearer $RB_API_KEY" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @vectors.jsonl

Response (HTTP 202):

{
  "accepted": 3,
  "rejected": 0,
  "errors": [],
  "job_id": "job_01H..."
}

Rejected lines are reported individually with a line number and a reason (e.g. dimension mismatch: got 767 expected 768). Accepted records are queued for validation and indexing.

This endpoint is an upsert: re-sending a record whose id already exists overwrites it — last write wins — rather than creating a duplicate. accepted counts validated NDJSON lines before dedup, so sending the same id twice counts both lines even though only one row is stored.

POST /v1/datasets/{name}/imports

Bulk import (large files)

The POST /v1/datasets/{name}/vectors endpoint buffers the whole body and caps it at 10 MiB — fine for small interactive upserts, too small for a real embedding dump. For large files, use the async import-job flow: the client stages the file directly into object storage via a presigned upload, then a job validates and indexes it asynchronously. The bytes never flow through the API. Both ndjson and parquet uploads are supported.

Step 1 — create the job. It returns an import_id and a presigned upload target:

curl -s -X POST http://localhost:8080/v1/datasets/products/imports \
  -H "Authorization: Bearer $RB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"format":"ndjson","error_mode":"continue","max_bad_records":100}'

Response (HTTP 201):

{
  "import_id": "imp_a1b2c3...",
  "dataset": "products",
  "status": "awaiting_upload",
  "format": "ndjson",
  "error_mode": "continue",
  "max_bad_records": 100,
  "upload": {
    "method": "PUT",
    "url": "https://minio.example/rosalinddb/staging/...?X-Amz-Signature=...",
    "content_type": "application/octet-stream",
    "max_bytes": 5368709120,
    "expires_at": "2026-05-15T13:34:56Z"
  },
  "created_at": "2026-05-15T12:34:56Z"
}

format — required. ndjson or parquet. A Parquet file must conform to the internal landing schema (id string, values list of float of length dimension, optional metadata struct).
error_mode — optional, default continue. continue drops bad records and writes them to a rejected-records file; abort fails the whole job on the first bad record.
max_bad_records — optional, default null (unlimited). With error_mode: continue, if the rejected count exceeds this value the job is marked failed.

Step 2 — upload the file directly to storage. Do a single PUT of the raw file body to upload.url, with Content-Type set to exactly upload.content_type:

curl -s -X PUT "$UPLOAD_URL" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @vectors.ndjson

It is a presigned PUT — no multipart form, no fields. Sending any other Content-Type is rejected with 403 SignatureDoesNotMatch. The size cap upload.max_bytes is enforced after the fact (the import worker checks the staged object before validation), so check your file size against it to fail fast.

Step 3 — signal the upload is done:

curl -s -X POST \
  http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3.../complete \
  -H "Authorization: Bearer $RB_API_KEY"

This verifies the staged object is present, moves the job to validating, and returns HTTP 202 with the job object. Calling complete on a job that is not awaiting_upload returns 409 import_not_pending; calling it before the file was staged returns 400 upload_missing.

Step 4 — poll the job until it reaches a terminal state:

curl -s http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3... \
  -H "Authorization: Bearer $RB_API_KEY"

Response (HTTP 200):

{
  "import_id": "imp_a1b2c3...",
  "dataset": "products",
  "format": "ndjson",
  "status": "completed",
  "error_mode": "continue",
  "max_bad_records": 100,
  "records_processed": 10000,
  "records_accepted": 9998,
  "records_rejected": 2,
  "percent_complete": 100,
  "rejected_records_url": "https://minio.example/...&X-Amz-Signature=...",
  "error_message": null,
  "created_at": "2026-05-15T12:34:56Z",
  "completed_at": "2026-05-15T12:36:10Z"
}

The job walks awaiting_upload → validating → indexing → completed, or lands on failed from any stage with error_message populated. percent_complete is 0 / 25 / 90 / 100 across those states. When records_rejected > 0, rejected_records_url is a presigned link to a rejected.jsonl file — one JSON object per rejected record with a line, reason, and the offending record. It stays available for at least 30 days. GET /v1/datasets/{name}/imports lists a dataset's jobs, newest first.

Two-stage quota. At create time an admission check rejects the request with 429 vector_quota_exceeded if the tenant is already at or over its vector quota — before anything is staged. After validation, the settlement stage charges records_accepted against the quota; if that would cross the remaining quota the job is marked failed and nothing is indexed.

Status lifecycle

A dataset progresses through four states as vectors land and a shard is built. Poll GET /v1/datasets/{name} until status === "indexed" before issuing queries.

empty  ──►  validating  ──►  indexing  ──►  indexed
                      │              │
                      └──────────────┴──►  error

empty — created, no vectors uploaded yet
validating — the validator has received records and is checking shapes
indexing — a shard is being built
indexed — at least one shard is queryable
error — last build or validate failed; error_message explains why

GET /v1/datasets

List datasets

List every dataset owned by your tenant.

curl -s http://localhost:8080/v1/datasets \
  -H "Authorization: Bearer $RB_API_KEY"

Response (HTTP 200):

{
  "datasets": [
    {
      "name": "products",
      "dimension": 768,
      "status": "indexed",
      "row_count": 12500,
      "created_at": "2026-05-14T12:34:56Z",
      "last_indexed_at": "2026-05-14T12:40:00Z",
      "error_message": null
    }
  ]
}

DELETE /v1/datasets/{name}

Delete a dataset

Soft-delete a dataset. The catalog row is marked deleted_at; shard cleanup runs in the background. Subsequent reads return 404 dataset_not_found.

curl -s -X DELETE http://localhost:8080/v1/datasets/products \
  -H "Authorization: Bearer $RB_API_KEY"

Returns HTTP 204 with no body on success.

Limits

Max upload body: 10 MiB per POST /v1/datasets/{name}/vectors request. Larger batches return 413 payload_too_large — chunk them into multiple calls, or use bulk import.
Upload formats: the vectors endpoint takes NDJSON only. The async bulk-import flow accepts both NDJSON and Parquet.
Per-tenant vector quotas: enforced via 429 vector_quota_exceeded — see Rate limits & quotas.

Datasets

A dataset is a named, tenant-scoped collection of vectors with a fixed embedding dimension. Create one, stream NDJSON records into it, poll until status === "indexed", then query.

POST /v1/datasets

Create a dataset

curl -s -X POST http://localhost:8080/v1/datasets \
  -H "Authorization: Bearer $RB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name":"products","dimension":768}'

Response (HTTP 201):

{
  "name": "products",
  "dimension": 768,
  "status": "empty",
  "row_count": 0,
  "created_at": "2026-05-14T12:34:56Z"
}

POST /v1/datasets/{name}/vectors

Upload NDJSON vectors

Sample vectors.jsonl:

{"id":"doc-1","values":[0.1,0.2,0.3,0.4],"metadata":{"title":"Atlas of birds"}}
{"id":"doc-2","values":[0.5,0.6,0.7,0.8],"metadata":{"title":"Field guide"}}
{"id":"doc-3","values":[0.9,1.0,1.1,1.2],"metadata":{"title":"Migration patterns"}}

Upload:

curl -s -X POST http://localhost:8080/v1/datasets/products/vectors \
  -H "Authorization: Bearer $RB_API_KEY" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @vectors.jsonl

Response (HTTP 202):

{
  "accepted": 3,
  "rejected": 0,
  "errors": [],
  "job_id": "job_01H..."
}

Rejected lines are reported individually with a line number and a reason (e.g. dimension mismatch: got 767 expected 768). Accepted records are queued for validation and indexing.

POST /v1/datasets/{name}/imports

Bulk import (large files)

Step 1 — create the job. It returns an import_id and a presigned upload target:

curl -s -X POST http://localhost:8080/v1/datasets/products/imports \
  -H "Authorization: Bearer $RB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"format":"ndjson","error_mode":"continue","max_bad_records":100}'

Response (HTTP 201):

{
  "import_id": "imp_a1b2c3...",
  "dataset": "products",
  "status": "awaiting_upload",
  "format": "ndjson",
  "error_mode": "continue",
  "max_bad_records": 100,
  "upload": {
    "method": "PUT",
    "url": "https://minio.example/rosalinddb/staging/...?X-Amz-Signature=...",
    "content_type": "application/octet-stream",
    "max_bytes": 5368709120,
    "expires_at": "2026-05-15T13:34:56Z"
  },
  "created_at": "2026-05-15T12:34:56Z"
}

format — required. ndjson or parquet. A Parquet file must conform to the internal landing schema (id string, values list of float of length dimension, optional metadata struct).
error_mode — optional, default continue. continue drops bad records and writes them to a rejected-records file; abort fails the whole job on the first bad record.
max_bad_records — optional, default null (unlimited). With error_mode: continue, if the rejected count exceeds this value the job is marked failed.

Step 2 — upload the file directly to storage. Do a single PUT of the raw file body to upload.url, with Content-Type set to exactly upload.content_type:

curl -s -X PUT "$UPLOAD_URL" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @vectors.ndjson

Step 3 — signal the upload is done:

curl -s -X POST \
  http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3.../complete \
  -H "Authorization: Bearer $RB_API_KEY"

Step 4 — poll the job until it reaches a terminal state:

curl -s http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3... \
  -H "Authorization: Bearer $RB_API_KEY"

Response (HTTP 200):

{
  "import_id": "imp_a1b2c3...",
  "dataset": "products",
  "format": "ndjson",
  "status": "completed",
  "error_mode": "continue",
  "max_bad_records": 100,
  "records_processed": 10000,
  "records_accepted": 9998,
  "records_rejected": 2,
  "percent_complete": 100,
  "rejected_records_url": "https://minio.example/...&X-Amz-Signature=...",
  "error_message": null,
  "created_at": "2026-05-15T12:34:56Z",
  "completed_at": "2026-05-15T12:36:10Z"
}

Status lifecycle

A dataset progresses through four states as vectors land and a shard is built. Poll GET /v1/datasets/{name} until status === "indexed" before issuing queries.

empty  ──►  validating  ──►  indexing  ──►  indexed
                      │              │
                      └──────────────┴──►  error

empty — created, no vectors uploaded yet
validating — the validator has received records and is checking shapes
indexing — a shard is being built
indexed — at least one shard is queryable
error — last build or validate failed; error_message explains why

GET /v1/datasets

List datasets

List every dataset owned by your tenant.

curl -s http://localhost:8080/v1/datasets \
  -H "Authorization: Bearer $RB_API_KEY"

Response (HTTP 200):

{
  "datasets": [
    {
      "name": "products",
      "dimension": 768,
      "status": "indexed",
      "row_count": 12500,
      "created_at": "2026-05-14T12:34:56Z",
      "last_indexed_at": "2026-05-14T12:40:00Z",
      "error_message": null
    }
  ]
}

DELETE /v1/datasets/{name}

Delete a dataset

Soft-delete a dataset. The catalog row is marked deleted_at; shard cleanup runs in the background. Subsequent reads return 404 dataset_not_found.

curl -s -X DELETE http://localhost:8080/v1/datasets/products \
  -H "Authorization: Bearer $RB_API_KEY"

Returns HTTP 204 with no body on success.

Limits

Max upload body: 10 MiB per POST /v1/datasets/{name}/vectors request. Larger batches return 413 payload_too_large — chunk them into multiple calls, or use bulk import.
Upload formats: the vectors endpoint takes NDJSON only. The async bulk-import flow accepts both NDJSON and Parquet.
Per-tenant vector quotas: enforced via 429 vector_quota_exceeded — see Rate limits & quotas.