Skip to main content

Log Storage Calculator

Estimate log storage needs based on log volume, retention period, and compression ratio. Enter values for instant results with step-by-step formulas.

Share this calculator

Formula

Total Storage = (Lines/sec x LineSize x 86400 x Retention / Compression) x (1 + IndexOverhead) x Replicas

Daily raw volume equals total lines per second multiplied by average line size multiplied by seconds per day (86,400). The retention period determines total raw volume. Compression reduces this by the compression ratio. Index overhead adds a percentage for search indexing structures. Replication multiplies the final total by the replica count for durability and query performance.

Worked Examples

Example 1: Mid-Size SaaS Platform

Problem: A SaaS platform with 10 servers each generating 500 log lines per second, 250 bytes average line size, 90-day retention, 5x compression ratio, 15% index overhead, and 2 replicas.

Solution: Total lines/sec = 500 x 10 = 5,000\nRaw bytes/sec = 5,000 x 250 = 1.25 MB/s\nRaw per day = 1.25 x 86,400 = 105.5 GB/day\nCompressed per day = 105.5 / 5 = 21.1 GB/day\nTotal compressed (90 days) = 21.1 x 90 = 1,899 GB\nIndex overhead = 1,899 x 0.15 = 285 GB\nBefore replicas = 2,184 GB\nWith 2 replicas = 4,368 GB = 4.27 TB

Result: 5,000 lines/sec | 105.5 GB/day raw | 4.27 TB total with replicas | ~$436/month storage

Example 2: High-Volume Microservices Architecture

Problem: 50 microservice instances generating 2,000 log lines per second each, 400 bytes average, 30-day retention, 6x compression, 20% index overhead, 3 replicas.

Solution: Total lines/sec = 2,000 x 50 = 100,000\nRaw bytes/sec = 100,000 x 400 = 40 MB/s\nRaw per day = 40 x 86,400 = 3,375 GB/day = 3.3 TB/day\nCompressed per day = 3,375 / 6 = 562.5 GB/day\nTotal compressed (30 days) = 562.5 x 30 = 16,875 GB\nIndex overhead = 16,875 x 0.20 = 3,375 GB\nBefore replicas = 20,250 GB\nWith 3 replicas = 60,750 GB = 59.3 TB

Result: 100K lines/sec | 3.3 TB/day raw | 59.3 TB total storage | Enterprise infrastructure required

Frequently Asked Questions

How do I estimate log volume for my application?

Log volume estimation requires understanding your application logging patterns across different components and severity levels. A typical web server generates 100-1000 log lines per second under moderate traffic, with each line averaging 200-500 bytes for structured JSON logs or 100-300 bytes for plain text logs. Application logs vary widely based on logging verbosity configuration. Debug-level logging can generate 10-50x more volume than info-level logging. To estimate accurately, enable logging at your planned level for a representative period and measure the actual output. Common sources include HTTP access logs (one line per request), application logs (variable), database query logs (one per query if enabled), and system metrics logs. Remember that log volume scales with traffic, so plan for peak traffic periods, not just average load.

What compression ratios can I expect for different log formats?

Log compression ratios vary significantly based on log format and content redundancy. Plain text logs with repetitive patterns (like Apache access logs) typically achieve 5-10x compression with gzip. Structured JSON logs compress slightly less at 4-8x because of the repeated field name overhead, though this overhead itself compresses well. Binary formats like protobuf logs are already compact and may only achieve 2-3x additional compression. Log-specific compression algorithms and columnar storage formats used by systems like ClickHouse can achieve 10-20x compression for highly structured data. The compression level setting also matters. Gzip level 6 (default) provides a good balance, while level 9 achieves marginally better ratios at significantly higher CPU cost. Zstandard (zstd) generally outperforms gzip with better ratios and faster compression speeds, making it the preferred choice for modern log aggregation systems.

What are the costs of popular log management platforms?

Log management costs vary dramatically across platforms and scale. Datadog charges $0.10 per GB ingested per month with 15-day retention included and additional charges for longer retention. Splunk Enterprise Cloud costs $150-200 per GB ingested per day for their standard tier. Elastic Cloud pricing starts around $95 per month for basic clusters with storage-based pricing. New Relic offers free tier up to 100 GB per month then charges $0.30 per GB. Self-hosted ELK (Elasticsearch, Logstash, Kibana) eliminates licensing costs but requires significant infrastructure investment, typically $0.05-0.15 per GB stored in cloud infrastructure. ClickHouse-based solutions like Signoz offer open-source alternatives at lower operational costs. At scale (over 1 TB per day), self-hosted or open-source solutions often cost 3-10x less than managed SaaS platforms, but require dedicated engineering resources for maintenance and reliability.

How does indexing affect log storage requirements?

Indexing enables fast full-text search across log data but adds significant storage overhead. Elasticsearch, the most common log search engine, creates inverted indexes that typically add 10-30 percent to the compressed data size. The exact overhead depends on the number and cardinality of indexed fields. Fields with high cardinality (like request IDs or user IDs with millions of unique values) create larger indexes than low-cardinality fields (like log level with only 5 values). Some systems allow selective indexing where you only index fields you frequently search on, reducing overhead to 5-15 percent. Columnar storage systems like ClickHouse use different indexing strategies that are more space-efficient for structured data. Consider whether all log fields need to be searchable or if you can save storage by only indexing key fields like timestamp, log level, service name, and error codes while keeping the full message for display only.

What log aggregation architecture should I use for my scale?

Log aggregation architecture should match your scale and reliability requirements. For small deployments (under 10 GB per day), a simple agent-to-centralized-store pattern works well using Filebeat or Fluentd shipping directly to Elasticsearch or a managed service. Medium deployments (10-100 GB per day) benefit from adding a buffering layer like Apache Kafka or Redis between agents and the storage backend, which absorbs traffic spikes and prevents data loss during storage outages. Large deployments (over 100 GB per day) require a distributed architecture with local pre-processing agents that filter and sample logs before forwarding, a Kafka cluster for reliable buffering and multi-consumer distribution, and a horizontally scaled storage backend with proper sharding. At any scale, implement backpressure mechanisms so log volume spikes do not overwhelm your infrastructure, and use sampling strategies for high-volume debug logs to control costs.

How can I reduce log storage costs without losing important data?

Several strategies reduce storage costs while preserving useful log data. Log level management is the first optimization. Ensure production environments run at info or warning level, not debug, which can reduce volume by 80 percent or more. Structured logging with consistent formats compresses better and enables field-level storage optimization. Log sampling retains only a percentage of high-volume repetitive events like successful health check responses or routine heartbeats while keeping all error and warning logs. Parsing and filtering at the agent level prevents unnecessary data from ever reaching your storage backend. Tiered storage automatically moves older logs to cheaper storage tiers, with hot-warm-cold architectures providing 5-10x cost reduction for archived data. Aggregation and rollup replace detailed per-event logs with statistical summaries for older data, maintaining trend visibility without individual event storage costs.

References