Part 2: Schema Pitfalls – How Bad Data Modeling Triggers Tombstone Overload

Tombstone overload is rarely caused by bad luck — it’s almost always the result of implicit architectural decisions made during schema design.

At Datanised, we’ve worked with teams running ScyllaDB at scale (hundreds of thousands to millions of ops/sec), and we’ve consistently found that small, unintentional data modelling choices often lead to massive downstream costs in compaction, read latency, and repair performance.

  1. Artificial Sharding: Write Scalability at the Cost of Cleanup Pain

A common pattern we see is artificial sharding of write paths — usually by appending a shard_id, bucket, or similar suffix to the partition key. The goal is typically to spread write load and avoid hot partitions.

Example:

PRIMARY KEY ((user_id, writeshard))

While this does help ingestion scale out across nodes and shards, it comes at a high cost: data for the same logical entity ends up scattered across multiple partitions. To consolidate reads or keep storage lean, teams often implement normalisation jobs that:

• Read across all shards per user,

• Write the latest version to a “canonical” partition (e.g. writeshard = 0),

• Delete all non-canonical rows.

This results in daily mass deletes, producing millions of tombstones per node, overwhelming compaction and repair.

  1. Upserts That Act Like Deletes

Upserts in ScyllaDB are idempotent — which is great. But if you’re upserting data by overwriting fields with null values, you’re actually triggering cell tombstones under the hood

This pattern shows up when:

  • APIs send partial updates using null to remove values
  • Application logic explicitly writes DELETE statements before each update
  • Lifecycle rules purge fields manually

Even without range deletes, this can lead to high cardinality tombstones that degrade read performance.

  1. Unbounded Range Deletes on Wide Partitions

When teams store time-series or session data in wide partitions and use queries like:

DELETE FROM events WHERE user_id = ? AND ts > ?

They create range tombstones that affect all clustering keys past a certain value. These tombstones are particularly expensive to skip during reads, and persist across SSTables until purged by full compaction.

Common Symptoms We Encounter:

  • Read Latency Spikes: Range scans exhibit unpredictable latency, sometimes increasing 2–5x.
  • Compaction Pressure: STCS compaction struggles to purge tombstones, increasing SSTables per read.
  • Repair Overhead: Repairs must stream tombstones, amplifying network and disk I/O.

Strategies for Managing Tombstones

1. Adopt a TTL-First, Upsert-Only Design

Avoid unnecessary deletes. Use TTL for ephemeral fields and design schemas to support write-once, read-many patterns. This reduces tombstone creation at the source.

2. Switch to Leveled Compaction Strategy (LCS)

LCS performs better in tombstone-heavy environments by compacting overlapping key ranges across levels. This leads to faster tombstone purging and fewer SSTables per read.

TimeWindowCompactionStrategy (TWCS) is well-suited for time-series workloads, where data is written in time-bounded partitions and purged via TTL. This avoids tombstone scanning altogether, as expired SSTables can be dropped wholesale.

3. Tune gc_grace_seconds Safely

Lowering gc_grace_seconds (e.g., from 10 days to 2) reduces the retention window for tombstones, allowing them to be purged sooner. This is safe only if all replicas are regularly and fully repaired — otherwise, you risk resurrecting deleted data. We recommend using Scylla Manager’s repair scheduler to automate this.

4. Avoid Unbounded Range Deletes

Design partitions to be time-bounded. For example, time-series data should be bucketed (daily/hourly), so entire partitions can be dropped or expired via TTL, avoiding range tombstones altogether.

5. Monitor for Tombstone Build-Up

Use the Scylla Monitoring Stack (Grafana) to track:

  • tombstone_scanned_histogram
  • sstables_per_read_histogram
  • compactions_in_progress

Spikes in these metrics are early warning signs of schema or workload issues.

Real-World Impact (Before vs After Optimization)

⚠️ Watch for Tombstone-Related Failures

Excessive tombstone reads can trigger:

  • TombstoneOverwhelmingException: If a query scans more tombstones than tombstone_failure_threshold (default: 100,000).
  • Read timeouts or partial reads due to increased latency.

Monitor your thresholds (tombstone_warn_threshold, tombstone_failure_threshold) and tune queries/schema if you observe warnings in logs or dashboards. This optimization was completed over a 4-week period, including schema analysis, compaction strategy migration, and performance validation, typical for production workloads of this scale.

Leave a Reply

Your email address will not be published. Required fields are marked *

Most Recent Posts

  • All Posts
  • Data Sovereignty
  • Data Strategy & Architecture
  • Database Technologies
  • Expert Playbooks & Best Practices
  • News, Events & Product Updates
  • Scaling & Performance Engineering
  • Uncategorized

IFZA Business Park, Building A1, Dubai Silicon Oasis, Dubai, UAE

Office: +971 50 532 0988
info@datanised.com

Subscribe to Datanised Updates.

Stay updated on new releases and features, guides, and community updates.
Copyright © 2025 Datanised Limited