2011-12-23

1000 Languages on the Web

Click to see the full size image

About the Image

Since 2003 I've been gathering texts from the web written in indigenous and minority languages.  The image above is a "family tree" of the 1000 languages I've found to date, where proximity in the tree is measured by a straightforward statistical comparison of writing systems (details below).
  • When you load the full image it will be too big to fit in a browser window and you may not see anything at first you'll need to use the horizontal and vertical scrollbars to explore different parts of the tree (most browsers will let you zoom in and out also).  And because it's an SVG image, you can use your browser's search functionality (probably Ctrl+F or ⌘-F) to find different language codes, although the search behavior can be a bit weird/unpredictable.
  • Each language is colored according to its linguistic family (details here).  For example, all Indo-European languages are greenish colors, with different subfamilies (Celtic, Germanic, etc.) being slightly different shades of green.  I also tried to use similar colors for languages from the same geographical region even when there is no known genetic relationship among them, and so Arawakan, Quechuan, Tucanoan languages (all from South America) are shades of purple, while Central and North American languages are shades of blue.
  • Clicking on a language opens a new tab or window with the documentation page for the ISO 639-3 language identifier where you'll find a name for the language in English and a link to its Ethnologue page for additional information.
  • What I'm calling "languages" are really "writing systems"; you'll see, for example, separate nodes for bo (Tibetan) and bo-Latn (Tibetan written in Latin script).  In a small number of cases I track macrolanguages, regional variants (e.g. en, en-IE, en-ZA), and some dialects.  In total, there are 919 distinct ISO 639-3 codes among the 1000 writing systems represented.
I'm using these data in collaboration with language groups all around the world to develop basic resources that help people use their language online: keyboard input methods, spell checkers, online dictionaries, and so on.  This work also underlies the Indigenous Tweets and Indigenous Blogs projects, which aim to strengthen languages through social media.  You can learn more about how indigenous and minority language communities are using the web, social media, and technology to help revitalize their languages by following us on Twitter.

The Gory Details

Everything is based on an analysis of three character sequences ("3-grams") in the different languages. It turns out that computing the statistics of 3-grams in a given language provides a "fingerprint" that can be used for language identification and a number of other applications.  Specifically, imagine the huge-dimensional vector space V whose axes are labelled with all possible 3-grams of Unicode characters (dim V > 1015).  Given a collection of texts in a language, you can compute the frequencies of all 3-grams that appear in the collection, defining a (sparse) vector in V "representing" the language.  We then define the distance between two languages to be the angle between their representative vectors in V.  This can be computed by scaling the vectors to unit length and computing their dot product (which is the cosine of the angle we want).

Once we know the distance between each pair of languages, we can reconstruct a phylogenetic tree using any of a number of well-known algorithms.  The image above was created using the so-called "neighbor-joining" algorithm (which basically builds the tree in a greedy, bottom-up way). A side-effect of the algorithm is that each edge in the tree is assigned a length, but note that the edge lengths in the rendered image have nothing to do with the computed edge lengths (indeed, it's unlikely that the tree can be rendered in a distance-preserving way in two dimensions).  Another side-effect of the algorithm is that the tree is connected by definition, all languages are within a bounded distance of each other and so near the root of the tree you'll see various languages which use completely different scripts joined in a more-or-less random fashion (Khmer, Georgian, Tamil, Cherokee, etc.).  It would be easy enough to tweak the distance function or the algorithm to render languages with different scripts as separate connected components.

How many languages are out there?

Ethnologue lists 6909 living languages in the world, but how many have some presence on the web?  The answer depends greatly on what kinds of documents you include.  If one takes linguistic studies into account, the number might be as high as 4000 – the Open Language Archives Community (OLAC) brings together data from linguistic archives all over the world into a single, searchable interface.  The OLAC coverage page shows, at present, the existence of online resources for 3930 of the 6909 Ethnologue languages, with more material coming online every day.  The amazing ODIN project harvests examples of interlinear glossed text from linguistic papers, and has over 1250 languages in its database.

The 1000 languages found by my web crawler are, for the most part, what you might call "primary texts": newspapers, blog posts, Wikipedia articles, Bible translations, etc.  My best guess at present is that around 1500 languages have primary texts of this kind on the web.  If you know of online resources written in a language that's not listed on our status page, please let me know in the comments.

Here are a couple of closely-related (but ill-defined) questions: first, "How many of the 6909 languages have a writing system?" and second, since a great number of the texts we've found are Bible translations or other evangelical works, one might ask "How many languages have a writing system that's used regularly by members of the speaker community?"  I've looked around a bit for answers to these questions but I haven't found any careful studies in the literature.


Mash it up!

I put all of the data and scripts needed to generate the image in a github repository.  I'm not an expert on data visualization, so I'm hoping others will grab the data and experiment.  One idea would be to use a more sophisticated algorithm for reconstructing the tree, such as Fitch-Margoliash. In terms of the visualization itself, it would be cool to do something that connects the tree to locations on a world map where the languages are spoken. There are also some Javascript/HTML5 graph viewers that might provide a better browsing experience.  Or you might simply select the colors in different ways (perhaps colors for different typological features: for example, SVO, VSO, etc.).  Feel free to post additional ideas in the comments!

Thanks

First, I'd like to thank the hundreds of people who have contributed to the project over the years by providing training texts in many of the languages, correcting errors in the language identification, editing word lists, and helping separate different dialects/orthographies.  You'll find many of their names on the project status page. Thanks also to Michael Cysouw who first suggested generating an image of this kind (you can find his image, created in 2005, on the main project page). Finally, thanks to my colleagues at Twitter for several helpful conversations and for their interest in the Indigenous Tweets project.

11 comments:

  1. You're a genius, Kevin. Gan amhras.

    ReplyDelete
  2. Cad é ga-x-slais? An sean nós le slaiseanna in áit fada, an ea?

    ReplyDelete
  3. Sin é go díreach - an sean-nós Gaelic-L. Is mór an chabhair samhail staitisticiúil ar leith a bheith agam do na téacsanna sin.

    ReplyDelete
  4. Agus go raibh maith agat a Pheadair (ach ní dóigh liom é!)

    ReplyDelete
  5. Hi Kevin,
    Great blog, thank you! Here's the url for a blog in the Tutong language, not yet on your list:
    http://tutongkita.blogspot

    Tutong is spoken in Brunei (Borneo)by no doubt less than 10,000 speakers. The blog consists mainly of translations of news stories from English, but has some original content. The Ethnologue code is ttg, see:

    http://www.ethnologue.com/show_language.asp?code=ttg

    Best,

    Adrian

    ReplyDelete
  6. Thank you Adrian!
    I added a model for Tutong, and added a page for the blog here:
    http://indigenoustweets.com/blogs/ttg/

    What is the proper name for the language in the language itself? Tutong?

    ReplyDelete
  7. Here is a Twitter account that is all in Lakota (Sioux): http://twitter.com/#!/peterpaha

    ReplyDelete
  8. Thank you for the pointer!

    All the ones I've found so far:
    http://indigenoustweets.com/lkt/

    ReplyDelete
  9. http://www.pulaagu.com first ever website in Fulah (ff) created in 1994 under different domain names, from free subdomains to dyndns... But still dedicated to promoting this West African language spoken natively in 20 countries...

    ReplyDelete