|Click to see the full size image|
About the Image
Since 2003 I've been gathering texts from the web written in indigenous and minority languages. The image above is a "family tree" of the 1000 languages I've found to date, where proximity in the tree is measured by a straightforward statistical comparison of writing systems (details below).
- When you load the full image it will be too big to fit in a browser window and you may not see anything at first – you'll need to use the horizontal and vertical scrollbars to explore different parts of the tree (most browsers will let you zoom in and out also). And because it's an SVG image, you can use your browser's search functionality (probably Ctrl+F or ⌘-F) to find different language codes, although the search behavior can be a bit weird/unpredictable.
- Each language is colored according to its linguistic family (details here). For example, all Indo-European languages are greenish colors, with different subfamilies (Celtic, Germanic, etc.) being slightly different shades of green. I also tried to use similar colors for languages from the same geographical region even when there is no known genetic relationship among them, and so Arawakan, Quechuan, Tucanoan languages (all from South America) are shades of purple, while Central and North American languages are shades of blue.
- Clicking on a language opens a new tab or window with the documentation page for the ISO 639-3 language identifier where you'll find a name for the language in English and a link to its Ethnologue page for additional information.
- What I'm calling "languages" are really "writing systems"; you'll see, for example, separate nodes for bo (Tibetan) and bo-Latn (Tibetan written in Latin script). In a small number of cases I track macrolanguages, regional variants (e.g. en, en-IE, en-ZA), and some dialects. In total, there are 919 distinct ISO 639-3 codes among the 1000 writing systems represented.
The Gory Details
Everything is based on an analysis of three character sequences ("3-grams") in the different languages. It turns out that computing the statistics of 3-grams in a given language provides a "fingerprint" that can be used for language identification and a number of other applications. Specifically, imagine the huge-dimensional vector space V whose axes are labelled with all possible 3-grams of Unicode characters (dim V > 1015). Given a collection of texts in a language, you can compute the frequencies of all 3-grams that appear in the collection, defining a (sparse) vector in V "representing" the language. We then define the distance between two languages to be the angle between their representative vectors in V. This can be computed by scaling the vectors to unit length and computing their dot product (which is the cosine of the angle we want).
Once we know the distance between each pair of languages, we can reconstruct a phylogenetic tree using any of a number of well-known algorithms. The image above was created using the so-called "neighbor-joining" algorithm (which basically builds the tree in a greedy, bottom-up way). A side-effect of the algorithm is that each edge in the tree is assigned a length, but note that the edge lengths in the rendered image have nothing to do with the computed edge lengths (indeed, it's unlikely that the tree can be rendered in a distance-preserving way in two dimensions). Another side-effect of the algorithm is that the tree is connected – by definition, all languages are within a bounded distance of each other – and so near the root of the tree you'll see various languages which use completely different scripts joined in a more-or-less random fashion (Khmer, Georgian, Tamil, Cherokee, etc.). It would be easy enough to tweak the distance function or the algorithm to render languages with different scripts as separate connected components.
How many languages are out there?
Ethnologue lists 6909 living languages in the world, but how many have some presence on the web? The answer depends greatly on what kinds of documents you include. If one takes linguistic studies into account, the number might be as high as 4000 – the Open Language Archives Community (OLAC) brings together data from linguistic archives all over the world into a single, searchable interface. The OLAC coverage page shows, at present, the existence of online resources for 3930 of the 6909 Ethnologue languages, with more material coming online every day. The amazing ODIN project harvests examples of interlinear glossed text from linguistic papers, and has over 1250 languages in its database.
The 1000 languages found by my web crawler are, for the most part, what you might call "primary texts": newspapers, blog posts, Wikipedia articles, Bible translations, etc. My best guess at present is that around 1500 languages have primary texts of this kind on the web. If you know of online resources written in a language that's not listed on our status page, please let me know in the comments.
Here are a couple of closely-related (but ill-defined) questions: first, "How many of the 6909 languages have a writing system?" and second, since a great number of the texts we've found are Bible translations or other evangelical works, one might ask "How many languages have a writing system that's used regularly by members of the speaker community?" I've looked around a bit for answers to these questions but I haven't found any careful studies in the literature.
Mash it up!
First, I'd like to thank the hundreds of people who have contributed to the project over the years by providing training texts in many of the languages, correcting errors in the language identification, editing word lists, and helping separate different dialects/orthographies. You'll find many of their names on the project status page. Thanks also to Michael Cysouw who first suggested generating an image of this kind (you can find his image, created in 2005, on the main project page). Finally, thanks to my colleagues at Twitter for several helpful conversations and for their interest in the Indigenous Tweets project.