Question 1

What are the average frequencies of letters in English?

Accepted Answer

For standard English (case-folded, excluding spaces), the most frequent letters are E (12.7%), T (9.1%), A (8.2%), O (7.5%), I (7.0%), N (6.7%), S (6.3%), H (6.1%), R (6.0%), D (4.3%), L (4.0%). The next tier: C, U, M, W, F, G, Y each around 2–3%. The rarest: P (1.9%), B (1.5%), V (1.0%), K (0.8%), J (0.15%), X (0.15%), Q (0.1%), Z (0.07%). These figures come from large balanced corpora like the Brown Corpus and COCA (Corpus of Contemporary American English). They've been remarkably stable over centuries — Edgar Allan Poe used essentially the same distribution in his story "The Gold-Bug" (1843), reflecting cryptanalytic knowledge from the era. Some variation across genres: technical writing has more uppercase letters; fiction has more pronouns and common articles; legal text has more uppercase and longer words.

Question 2

How does character frequency analysis break substitution ciphers?

Accepted Answer

In a simple monoalphabetic substitution cipher, each letter is consistently replaced by another (A → Q, B → F, etc.). The ciphertext preserves the frequency distribution of the underlying plaintext language. For sufficiently long messages (a few hundred characters or more), the most frequent ciphertext character almost always corresponds to E in English plaintext. The second most frequent is usually T, the third A, and so on. By matching ciphertext frequency to known plaintext frequency, you can immediately guess several substitutions. Combined with knowledge of common digraphs (TH, HE, IN, ER, AN), trigraphs (THE, AND, ING), and short-word patterns (single letters are I or A; "the" is the most common 3-letter word), you can typically crack a basic substitution cipher with 200+ characters of ciphertext in 15–30 minutes by hand. Modern ciphers (AES, RSA) are immune because they're designed to produce ciphertext with uniform distribution regardless of plaintext characteristics — exactly to defeat frequency analysis.

Question 3

Why do letter frequencies differ across languages?

Accepted Answer

Different sounds, different orthographies, different word frequencies. English E is common partly because of many -e endings and the high frequency of "the". French E is even more common (~17%) because of unaccented final -e in most feminine words and the article "le". German E (~17%) reflects the many -e endings and inflectional suffixes. Spanish A (~12.5%) reflects feminine endings and common articles. Italian E (~11.8%) and A (~11.7%) similar reasons. Welsh W (~7%) far exceeds English W (~2.4%) because W functions as a vowel in Welsh (gwlad, cwrt). Polish W (~4.6%) and Polish-specific characters (Ł, Ż, Ź) reflect unique orthography. Chinese using pinyin has very different distributions from Latin-script languages because pinyin includes initial-final structures with limited combinations. For language identification from short samples, these distributional differences are diagnostic — even 100 characters often suffices to distinguish English from Spanish from French.

Question 4

What are the most common mistakes in character frequency analysis?

Accepted Answer

The first is inconsistent normalisation — case-folding (treating A and a as the same), handling of punctuation, spaces, and digits all affect results, but different studies use different conventions. Always specify how the count was made. The second is using small samples (under 200 characters) — distribution noise dominates and frequencies are unreliable; the most-frequent letter in a 50-character sample might be S or T rather than E by chance. The third is comparing frequencies across very different genres without controlling for vocabulary; legal text has very different distribution from casual conversation. The fourth is forgetting that the question being asked changes the answer: "what letter appears most often in this text" vs "what letter appears most often per word" vs "what letter is most likely to start a word" all give different answers. The fifth is applying English-derived expectations to non-English text — even closely related languages (Spanish, Italian, Portuguese) have different distributions; never assume "E is always the most common" works for non-English text.

Question 5

When should I not use this calculator?

Accepted Answer

Skip it for tiny text samples (under 100 characters) — frequency estimates are too noisy to be meaningful, dominated by chance. Don't use it for distinguishing closely-related languages (Spanish vs Portuguese, Norwegian vs Danish) without much larger samples and statistical tests (chi-square against reference frequencies). It's the wrong tool for modern cryptanalysis; all serious encryption schemes (AES, RSA, ECC) produce ciphertexts indistinguishable from random, so frequency analysis reveals nothing about the plaintext. Avoid it for non-Latin scripts without language-specific reference distributions; Chinese, Japanese, Arabic, Tamil all have completely different character-frequency landscapes that English-based intuitions don't inform. Don't use it for ngram analysis or higher-order statistics; bigram and trigram frequencies are more informative than single-character frequencies for many tasks (text classification, authorship attribution, compression). Finally, for serious linguistic research, use established corpus tools (NLTK, spaCy, R's quanteda) which compute frequencies across large balanced corpora with proper statistical reporting.

Character Frequency Calculator

Compare with similar

About this calculator

How to use

Frequently asked questions

What are the average frequencies of letters in English?

How does character frequency analysis break substitution ciphers?

Why do letter frequencies differ across languages?

What are the most common mistakes in character frequency analysis?

When should I not use this calculator?

Sources & references