RosalindDBRosalindDB
HomeDocsBlog
View RosalindDB on GitHub
View RosalindDB on GitHub
RosalindDBRosalindDB

An object-storage-first vector database for cold and bursty workloads. Apache 2.0.

View RosalindDB on GitHub

Project

  • GitHub
  • License (Apache 2.0)
  • Issues

Read

  • Documentation
  • MCP server
  • Blog

© 2026 RosalindDB contributors. Apache License 2.0.

Privacy

    Documentation

    • Quickstart
    • Architecture
    • Datasets
    • Query
    • MCP server
    • Multi-tenant mode
    • Authentication
    • Rate limits & quotas

    Datasets

    A dataset is a named, tenant-scoped collection of vectors with a fixed embedding dimension. Create one, stream NDJSON records into it, poll until status === "indexed", then query.

    POST /v1/datasets

    Create a dataset

    Create an empty dataset bound to your tenant. name is 1-64 chars matching [a-z0-9_-]+ and must be unique per tenant. dimension is fixed at create time — every vector you later upload must match it exactly.

    curl -s -X POST http://localhost:8080/v1/datasets \
      -H "Authorization: Bearer $RB_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"name":"products","dimension":768}'

    Response (HTTP 201):

    {
      "name": "products",
      "dimension": 768,
      "status": "empty",
      "row_count": 0,
      "created_at": "2026-05-14T12:34:56Z"
    }

    POST /v1/datasets/{name}/vectors

    Upload NDJSON vectors

    Stream NDJSON records into a dataset. One JSON object per line, Content-Type: application/x-ndjson. Each record has an id (1-256 chars), a values array whose length equals the dataset's dimension, and an optional metadata object.

    Sample vectors.jsonl:

    {"id":"doc-1","values":[0.1,0.2,0.3,0.4],"metadata":{"title":"Atlas of birds"}}
    {"id":"doc-2","values":[0.5,0.6,0.7,0.8],"metadata":{"title":"Field guide"}}
    {"id":"doc-3","values":[0.9,1.0,1.1,1.2],"metadata":{"title":"Migration patterns"}}

    Upload:

    curl -s -X POST http://localhost:8080/v1/datasets/products/vectors \
      -H "Authorization: Bearer $RB_API_KEY" \
      -H "Content-Type: application/x-ndjson" \
      --data-binary @vectors.jsonl

    Response (HTTP 202):

    {
      "accepted": 3,
      "rejected": 0,
      "errors": [],
      "job_id": "job_01H..."
    }

    Rejected lines are reported individually with a line number and a reason (e.g. dimension mismatch: got 767 expected 768). Accepted records are queued for validation and indexing.

    This endpoint is an upsert: re-sending a record whose id already exists overwrites it — last write wins — rather than creating a duplicate. accepted counts validated NDJSON lines before dedup, so sending the same id twice counts both lines even though only one row is stored.

    POST /v1/datasets/{name}/imports

    Bulk import (large files)

    The POST /v1/datasets/{name}/vectors endpoint buffers the whole body and caps it at 10 MiB — fine for small interactive upserts, too small for a real embedding dump. For large files, use the async import-job flow: the client stages the file directly into object storage via a presigned upload, then a job validates and indexes it asynchronously. The bytes never flow through the API. Both ndjson and parquet uploads are supported.

    Step 1 — create the job. It returns an import_id and a presigned upload target:

    curl -s -X POST http://localhost:8080/v1/datasets/products/imports \
      -H "Authorization: Bearer $RB_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"format":"ndjson","error_mode":"continue","max_bad_records":100}'

    Response (HTTP 201):

    {
      "import_id": "imp_a1b2c3...",
      "dataset": "products",
      "status": "awaiting_upload",
      "format": "ndjson",
      "error_mode": "continue",
      "max_bad_records": 100,
      "upload": {
        "method": "PUT",
        "url": "https://minio.example/rosalinddb/staging/...?X-Amz-Signature=...",
        "content_type": "application/octet-stream",
        "max_bytes": 5368709120,
        "expires_at": "2026-05-15T13:34:56Z"
      },
      "created_at": "2026-05-15T12:34:56Z"
    }
    • format — required. ndjson or parquet. A Parquet file must conform to the internal landing schema (id string, values list of float of length dimension, optional metadata struct).
    • error_mode — optional, default continue. continue drops bad records and writes them to a rejected-records file; abort fails the whole job on the first bad record.
    • max_bad_records — optional, default null (unlimited). With error_mode: continue, if the rejected count exceeds this value the job is marked failed.

    Step 2 — upload the file directly to storage. Do a single PUT of the raw file body to upload.url, with Content-Type set to exactly upload.content_type:

    curl -s -X PUT "$UPLOAD_URL" \
      -H "Content-Type: application/octet-stream" \
      --data-binary @vectors.ndjson

    It is a presigned PUT — no multipart form, no fields. Sending any other Content-Type is rejected with 403 SignatureDoesNotMatch. The size cap upload.max_bytes is enforced after the fact (the import worker checks the staged object before validation), so check your file size against it to fail fast.

    Step 3 — signal the upload is done:

    curl -s -X POST \
      http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3.../complete \
      -H "Authorization: Bearer $RB_API_KEY"

    This verifies the staged object is present, moves the job to validating, and returns HTTP 202 with the job object. Calling complete on a job that is not awaiting_upload returns 409 import_not_pending; calling it before the file was staged returns 400 upload_missing.

    Step 4 — poll the job until it reaches a terminal state:

    curl -s http://localhost:8080/v1/datasets/products/imports/imp_a1b2c3... \
      -H "Authorization: Bearer $RB_API_KEY"

    Response (HTTP 200):

    {
      "import_id": "imp_a1b2c3...",
      "dataset": "products",
      "format": "ndjson",
      "status": "completed",
      "error_mode": "continue",
      "max_bad_records": 100,
      "records_processed": 10000,
      "records_accepted": 9998,
      "records_rejected": 2,
      "percent_complete": 100,
      "rejected_records_url": "https://minio.example/...&X-Amz-Signature=...",
      "error_message": null,
      "created_at": "2026-05-15T12:34:56Z",
      "completed_at": "2026-05-15T12:36:10Z"
    }

    The job walks awaiting_upload → validating → indexing → completed, or lands on failed from any stage with error_message populated. percent_complete is 0 / 25 / 90 / 100 across those states. When records_rejected > 0, rejected_records_url is a presigned link to a rejected.jsonl file — one JSON object per rejected record with a line, reason, and the offending record. It stays available for at least 30 days. GET /v1/datasets/{name}/imports lists a dataset's jobs, newest first.

    Two-stage quota. At create time an admission check rejects the request with 429 vector_quota_exceeded if the tenant is already at or over its vector quota — before anything is staged. After validation, the settlement stage charges records_accepted against the quota; if that would cross the remaining quota the job is marked failed and nothing is indexed.

    Status lifecycle

    A dataset progresses through four states as vectors land and a shard is built. Poll GET /v1/datasets/{name} until status === "indexed" before issuing queries.

    empty  ──►  validating  ──►  indexing  ──►  indexed
                          │              │
                          └──────────────┴──►  error
    • empty — created, no vectors uploaded yet
    • validating — the validator has received records and is checking shapes
    • indexing — a shard is being built
    • indexed — at least one shard is queryable
    • error — last build or validate failed; error_message explains why

    GET /v1/datasets

    List datasets

    List every dataset owned by your tenant.

    curl -s http://localhost:8080/v1/datasets \
      -H "Authorization: Bearer $RB_API_KEY"

    Response (HTTP 200):

    {
      "datasets": [
        {
          "name": "products",
          "dimension": 768,
          "status": "indexed",
          "row_count": 12500,
          "created_at": "2026-05-14T12:34:56Z",
          "last_indexed_at": "2026-05-14T12:40:00Z",
          "error_message": null
        }
      ]
    }

    DELETE /v1/datasets/{name}

    Delete a dataset

    Soft-delete a dataset. The catalog row is marked deleted_at; shard cleanup runs in the background. Subsequent reads return 404 dataset_not_found.

    curl -s -X DELETE http://localhost:8080/v1/datasets/products \
      -H "Authorization: Bearer $RB_API_KEY"

    Returns HTTP 204 with no body on success.

    Limits

    • Max upload body: 10 MiB per POST /v1/datasets/{name}/vectors request. Larger batches return 413 payload_too_large — chunk them into multiple calls, or use bulk import.
    • Upload formats: the vectors endpoint takes NDJSON only. The async bulk-import flow accepts both NDJSON and Parquet.
    • Per-tenant vector quotas: enforced via 429 vector_quota_exceeded — see Rate limits & quotas.

    On this page

    • Create a dataset
    • Upload NDJSON vectors
    • Bulk import (large files)
    • Status lifecycle
    • List datasets
    • Delete a dataset
    • Limits