If you are a language enthusiast, you may often be thinking about how Spanish is similar to Portuguese, or what similarities exist between French and Italian. Romance languages obviously have a wide range of very similar grammatical constructs, not to mention very similarly used tenses, which I am sure is a subject of very intense discussion among comparative linguists at various language faculties, but I have the impression that it has never been quantified and only talked about. So let’s try to answer this question with some data.
But how to measure it?
One of the methods to estimate the similarity between two text strings is the method called Levenstein distance. It is a measure of the similarity between two strings by calculating the minimum number of single character edits (insertions, deletions, or substitutions) required to change one string into the other.
For the sake of this small kind of research, I collected a total of about 110 sentences of texts available in French, Italian, Spanish, Portuguese and Romanian.
Of course, the results won’t tell you much about grammar or phonetics, but they will tell you about lexical similarities.
Are you ready?
If we measure the average of the distances between the examined text strings, we get the following results:
If we scale it to look at the language that is closest to the one in the header, it will look something like this:
What does this tell us?
The closest language to French is Italian, to Italian is Spanish, to Spanish is Portuguese and vice versa (what a surprise😀), and to Romanian is Italian.
What Romance language is most similar to English?
According to this simple analysis, the closest language to English is Spanish, followed by Portuguese and French. But I guess we would need a large sample to make it statistically more significant.
Does this analysis has any limitation?
Yes, of course it does. It may differ slightly from reality for various reasons, as the selected texts may not be representative of all language usage, but it gives us a data-driven answer.