K Mer Counter Calculator

Calculate mer with our free science calculator. Uses standard scientific formulas with unit conversions and explanations.

Reviewed by Daniel Agrici, Founder & Lead Developer

Formula

Total k-mers = L - k + 1; Complexity = Unique k-mers / 4^k

Where L is the sequence length, k is the k-mer size, and 4^k represents all possible DNA k-mers of length k. Complexity measures the fraction of possible k-mers that actually appear in the sequence.

Worked Examples

Example 1: K-mer Analysis of a Short DNA Sequence

Problem:Given the DNA sequence ATCGATCGATCG (12 bp), count all 3-mers and determine the sequence complexity.

Solution:Total 3-mers = 12 - 3 + 1 = 10\n3-mers: ATC(x3), TCG(x3), CGA(x2), GAT(x2)\nUnique 3-mers = 4\nPossible 3-mers = 4^3 = 64\nComplexity = 4/64 = 6.25%

Result:Total: 10 k-mers, 4 unique, 6.25% complexity (highly repetitive sequence)

Example 2: Comparing K-mer Sizes on a Repeat Region

Problem:For the sequence ATATATATATATAT (14 bp), compare 2-mer vs 3-mer counts.

Solution:2-mer analysis: Total = 13, Unique = 2 (AT, TA), Possible = 16, Complexity = 12.5%\n3-mer analysis: Total = 12, Unique = 2 (ATA, TAT), Possible = 64, Complexity = 3.13%\nLarger k reveals lower complexity in this tandem repeat.

Result:2-mers: 12.5% complexity | 3-mers: 3.13% complexity. Larger k better exposes the repetitive nature.

Frequently Asked Questions

How do I choose the right k-mer size for my analysis?

The optimal k-mer size depends on your application and organism complexity. Smaller k values (k=15-21) are useful for error correction and work well with low-coverage data, but may produce many false overlaps in repetitive genomes. Larger k values (k=31-127) improve specificity and resolve repeats better but require higher coverage and more memory. For bacterial genomes, k=21-31 often works well. For human genome assembly, k=51-101 is common. Many modern tools like SPAdes use multiple k-mer sizes simultaneously to balance sensitivity and specificity.

What does k-mer complexity or linguistic complexity mean?

K-mer complexity (also called linguistic complexity) is the ratio of observed unique k-mers to the total number of possible k-mers (4^k for DNA). A complexity of 100% means every possible k-mer of that size appears at least once. Low complexity indicates a repetitive or biased sequence. For instance, the sequence AAAAAAA has only one 3-mer (AAA), giving a complexity of 1/64 = 1.56%. This metric helps identify low-complexity regions that may confound analyses and is used by tools like DUST and RepeatMasker for masking repetitive elements.

How is k-mer counting used in genome size estimation?

K-mer frequency histograms from whole-genome sequencing data can estimate genome size without assembly. The principle is: Genome Size = Total k-mers / Peak k-mer coverage. You plot a histogram of k-mer frequencies, identify the main peak (representing single-copy regions), and divide the total number of k-mers by that peak depth. For example, if you have 3 billion total 21-mers and the coverage peak is at 30x, the estimated genome size is ~100 Mb. Tools like GenomeScope and KmerGenie automate this process and can also estimate heterozygosity and repeat content from the histogram shape.

References

Reviewed by Daniel Agrici, Founder & Lead Developer · Editorial policy