Index of Coincidence
This online calculator calculates index of coincidence (IC, IOC) for the given text
Here is the calculator, which calculates the index of coincidence, or IOC (IC) for the given text. You can read what is the index of coincidence and how it is calculated below the calculator.
The index of coincidence
The index of coincidence is the probability of two randomly selected letters being equal. William F. Friedman first proposed this metric in 1922 in Revierbank Publication No. 22 titled "The Index of Coincidence and Its Applications in Cryptography". In 1967, the historian David Kahn wrote.
Revierbank Publication No. 22, written in 1920, when Friedman was 28, must be regarded as the most important single publication in cryptography. It took science into a new world. 1
Having the definition above, one can devise the formula for IOC.
Let be the length of the text.
Let be the size of the alphabet.
Let be the i-th letter of the alphabet.
Let be the number of occurrences of i-th letter in the text.
Then the probability of having two selected is
The total probability (which is the IOC) is the sum of probabilities for each letter:
Note that sometimes IOC is "normalized". This is usually done by multiplying the result by - the alphabet's size.
The calculator below parses the text and calculates the IOC using the formulas above. You can also read why it is so important below the calculator.
Why Index of Coincidence is so important?
It is important, because we can calculate expected index of coincidence for given language using language's frequency of letters. With the letter frequency as we can approximate the as . Which gives us the following:
If is large enough, we can approximate the fraction as , which gives us
We can also calculate expected index of coincidence for completely random text - there all the letters have equal frequency . It is indeed .
Having expected index of coincidence, you can quickly estimate ciphered text if you suspect that it was produced by one of the "classical" ciphers. If the index of coincidence is high and close to the expected IC for the language, then the text probably was encrypted using transposition cipher or simple (monoalphabetic) substitution cipher. Otherwise, if the index of coincidence is low and close to the expected IC for random text, then the text probably was encrypted using a polyalphabetic cipher.
According to Wikipedia,
The index of coincidence is useful in the analysis of natural-language plaintext and ciphertext analysis (cryptanalysis). Even when the only ciphertext is available for testing and plaintext letter identities are disguised, coincidences in the underlying plaintext can cause coincidences in the ciphertext. This technique is used to cryptanalysis the Vigenère cipher, for example. For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system. Coincidence counting can help determine when two texts are written in the same language using the same alphabet. (This technique has been used to examine the purported Bible code). The causal coincidence count for such texts will be distinctly higher than the accidental coincidence count for texts in different languages or texts using different alphabets or gibberish texts.2
-
David Kahn, The Code Breakers, Macmillan, 1967. ↩
Comments