Question 1

How do you calculate ETL throughput requirements?

Accepted Answer

ETL throughput is calculated by dividing the total data volume by the available processing window. For example, 100GB of data in a 4-hour window requires 100GB / (4 * 3600 seconds) = 7.1 MB/s raw throughput. However, transformations add significant overhead — a medium-complexity transformation (joins, aggregations, type conversions) typically doubles the effective data moved, requiring 14.2 MB/s of processing capacity. Heavy transformations involving machine learning scoring, complex lookups, or data quality checks can multiply this by 3-6x. Always size for peak throughput plus a 20-30% buffer to account for variability in data distribution and system load.

Question 2

How many parallel workers should an ETL pipeline use?

Accepted Answer

The optimal number of parallel workers depends on data volume, transformation complexity, and available resources. As a rule of thumb, start with one worker per 25GB of data for medium-complexity transformations. Adding workers provides near-linear scaling up to the point where I/O or network becomes the bottleneck. For most setups, 4-16 workers handle typical enterprise workloads efficiently. Beyond 16 workers, coordination overhead starts to offset parallelism gains. Key constraints include: source database connection limits, target system write capacity, available CPU cores (at least 1 per worker), and memory (each worker needs its own buffer space). Monitor actual utilization to fine-tune.

Question 3

How much memory does an ETL pipeline need?

Accepted Answer

ETL memory requirements depend on worker count, batch size, record size, and transformation complexity. Each worker needs a base allocation (typically 64-128MB for framework overhead) plus buffer memory for read/write batches, plus transformation state (hash tables for lookups, aggregation buffers, sort memory). For medium-complexity transforms, plan for 200-500MB per worker. Heavy transforms with large dimension lookups can need 1-2GB per worker. A 4-worker pipeline with medium transforms typically needs 2-4GB total. The biggest memory consumers are hash join operations and in-memory lookup tables — consider switching to disk-based approaches if these exceed available memory.

Question 4

What IOPS are needed for ETL workloads?

Accepted Answer

ETL workloads are typically sequential I/O heavy, which is more forgiving than random I/O. Read IOPS can be estimated by dividing throughput by block size: at 7 MB/s with 64KB blocks, you need about 112 read IOPS. Write IOPS are typically 60-80% of read IOPS due to aggregation reducing output volume. For SSDs, these numbers are easily achievable (modern SSDs deliver 50,000+ IOPS). For HDDs, sequential throughput is the limiting factor — a single HDD delivers 100-150 MB/s sequential. Cloud storage (S3, GCS) has different characteristics: high throughput but higher latency per request, making larger batch sizes more important. For data lakes, partition your data to enable parallel file reads.

Etl Throughput Sizing Assistant Calculator

Formula

Worked Examples

Example 1: Nightly Data Warehouse Load

Example 2: Real-Time Micro-Batch Pipeline

Frequently Asked Questions

How do you calculate ETL throughput requirements?

How many parallel workers should an ETL pipeline use?

How much memory does an ETL pipeline need?

What IOPS are needed for ETL workloads?

References