ScyllaDB Vector Search, Millisecond Vector Retrieval for Real-Time AI at Scale Introduction Real-time AI is now a business requirement. Retrieval-augmented generation, semantic search, personalisation, fraud signals, recommendations. These workloads demand fast reads, predictable latency, and scale without disruption. ScyllaDB has long been trusted for low-latency transactional data. With its new vector search engine, ScyllaDB brings millisecond vector retrieval to the same environment, a strong step toward unified AI infrastructure. Why ScyllaDB Built Vector Search Teams building AI features often coupled ScyllaDB with a separate vector database for similarity search. That solved one problem but created others, extra systems, cost, and latency overhead. ScyllaDB’s goal was clear: deliver performance, accuracy, and scale while removing operational complexity. The result is an integrated vector search engine that fits naturally with ScyllaDB’s real-time workloads. How the Design Works ScyllaDB avoided embedding HNSW directly inside the database core. Instead, it built a dedicated Vector Store in Rust, placed next to each ScyllaDB replica in the same availability zone. ScyllaDB stores vectors and metadata; the Vector Store reads data through Change Data Capture (CDC), builds indexes, and serves similarity results over HTTP. Clients simply query ScyllaDB using CQL and the ANN OF clause, ScyllaDB handles the heavy-lifting behind the scenes. This design keeps ingestion fast while the Vector Store processes intensive ANN queries asynchronously. What’s Different for Operators The database and Vector Store scale independently. You can fine-tune hardware per role, storage-optimised for SSTables, RAM-optimised for the Vector Store. Traffic remains local to the availability zone, keeping network costs under control. Regular queries stay predictable, while ANN workloads scale separately. Performance Highlights In benchmark tests, ScyllaDB’s vector engine delivered outstanding numbers: These results confirm ScyllaDB’s consistency in low-latency, high-throughput workloads. Tuning Findings That Matter in Production ScyllaDB’s engineers discovered that TCP delayed ACK combined with Nagle’s algorithm inflated latency. Disabling Nagle (TCP_NODELAY) dropped latencies from around 50 ms to single-digit milliseconds. They also tested thread layouts within Rust’s Tokio runtime. These tuning insights matter when squeezing maximum performance from real-time AI systems. Working with the Vector Type and ANN Queries Creating a keyspace, table, and index is straightforward. You define a vector column, create a custom index, and query with ANN OF to find the nearest neighbours. Client-Side Tip: Remove Extra Latency When using the Java driver with Netty, enable TCP_NODELAY to avoid waiting for delayed acknowledgements. Optional: Calling the Vector Store Directly While Vector Store runs internally, the call pattern is simple. Here’s a clean Rust client sketch that performs an ANN request over HTTP. In this beta, vector indexes are stored fully in memory. This design enables ScyllaDB to achieve sub-10 ms responses, and for optimal performance, the Vector Store keeps all indexes in memory. This means the entire index needs to fit into a single node’s RAM. For example, with a benchmark of 100 million vectors at 768 dimensions, the raw vectors alone require approximately 307 GB, and with HNSW index overhead, the real-world memory footprint is around 333 GB. This approach suggests a need for substantial RAM-optimized machines for each index node to support very large datasets. It would be great to see the future versions to include hybrid in-memory/ disk indexing, which will offer expanded capacity while maintaining speed. What This Means for Enterprises For teams building search layers for RAG, personalisation, or fraud detection, this launch is significant. It brings: ScyllaDB’s approach simplifies AI-driven workloads and reduces operational friction. Where to Start: A Short Checklist Datanised Insight At Datanised, we view this as an important move toward unifying transactional, analytical, and AI-driven workloads under one platform. The architecture is elegant, the early metrics impressive. However, the in-memory limitation is a real operational factor. Teams must plan capacity and shard strategy carefully. Our experts are already helping enterprises evaluate ScyllaDB Vector Search, designing, tuning, and optimising configurations across ScyllaDB, Cassandra, MongoDB, PostgreSQL, and modern streaming systems. If you’re exploring AI-ready NoSQL architectures, we can help you assess, design, and optimise the right setup for your workloads.   Sources: https://www.scylladb.com/2025/10/08/building-a-low-latency-vector-search-engine/   Ricardo Gil Writer & Blogger