2011-03-21

How many languages are out there?

     I added two more languages over the weekend: Inuktitut/ᐃᓄᒃᑎᑐᑦ, thanks to some prompting from Tim Pasch, and Rangi/Kɨlaangi thanks to Oliver Stegen who did what we think are the first tweets in that language.  As the site is set up now, it will only detect Inuktitut tweets written using syllabics, although if I have time I may extend it to find examples in Latin script as well.

    Two more language pages were translated over the weekend also: Chichewa, thanks to Edmond Kachale (who was kind enough to blog about us too), and Welsh, thanks to Carl Morris, Rhys Wynne, and Gareth Jones.

    Just how many languages are out there on Twitter?  This is a question I've been exploring for many years in the broader context of the web, where my Crúbadán web crawler has found documents written in almost 500 languages.   Those texts are used to train the language recognition algorithms that drive IndigenousTweets.com (I'm planning a blog post on the details of the language recognition).   I could conceivably add any of these 500 languages to IndigenousTweets, with the following restrictions:

  1. Twitter limits the number of queries I can make to their API so I don't plan on adding any languages with Twitter communities that are any more active than the top languages I have now: Haitian Creole, Basque and Welsh.   It's even unlikely I can get everything in Creole; my friend Jean Came Poulard conjectures there may be at least a half a million people tweeting in the language.
  2. My language recognition algorithms work well at the level of full documents, but things are more challenging when working with tweets of 140 characters or less, and which often contain URLs, abbreviations, etc.   So many languages that I'd like to include are turning out to be very challenging, for example distinguishing the Filipino languages Cebuano, Tagalog, and Hiligaynon.
  3. Finally, my guess is that there is no one using Twitter in the vast majority of the other 400+ languages, at least not yet.   I should mention that I've set up IndigenousTweets for several other languages and made a non-trivial attempt at finding tweeters, with no luck: Aymara, Bislama, Kashubian, Marshallese, Pohnpeian, Sango, and Songhay.

Please keep the suggestions for new languages coming, and if you can point me to one or two people you know are tweeting in the language, that's a big help.

12 comments:

  1. It just occured to me that tweeting in Cherokee syllabics rather than the Roman alphabet would significantly extend the message length. Wouldn't it?

    ReplyDelete
  2. Thank you for this wonderful service!

    Because language names sometimes overlap or have multiple names, is it possible to include the three-letter identifiers (ISO 639-3 codes)? They are used by Wikipedia and the Ethnologue.

    Just a suggestion, thank you.

    ReplyDelete
  3. @Dennis: my understanding is that twitter counts unicode characters and not bytes, so in most cases you'd actually come out better using the syllabary: ᏣᎳᎩ is length 3 vs. tsalagi which is length 7. There's some info on this here.

    ReplyDelete
  4. @wakablogger: thanks for the suggestion! Some growing pains at play here - the ISO 639 codes didn't seem necessary when I just had 3 or 4 languages!

    ReplyDelete
  5. In obsédé mode anois! I have the 3M in Chinyanja / Chichewa already, so I can't reasonably harrass Edmond Kachale.

    But I wonder if Oliver Stegen would like to provide a Rangi/Kɨlaangi version?

    ReplyDelete
  6. Hi Kevin,
    Neat idea, and a great service!

    Maybe this will help push Twitter, phone manufacturers and so on to support identification of language better in mobile communications.

    I would second wakablogger's request to use ISO 639-3 codes, which cover more languages.

    Your languages page links to sites like www.ethnologue.com for individual languages. Suppose those sites wanted to link back to indigenoustweets.com for a language such as e.g. Hausa. How would they know whether to use indigenoustweets.com/ha/ or indigenoustweets.com/hau/ ? Apparently this URL requires 2-letter codes for some languages, and 3-letter codes for others. A single code set for this URL would be very helpful.

    Regards,
    Lars

    ReplyDelete
  7. Thanks Lars. Good point regarding backlinks to the site - I'll find some solution probably with redirects or something like that so that /hau/ points to /ha/ (or vice versa). That said, the current scheme isn't that complicated: ISO 639-1 code if it exists, ISO 639-3 code if not. Wikipedia, Mozilla, OpenOffice.org all use this scheme. It's nice for the languages that are quite used to their 2-letter code - most Irish speakers know that we're "ga" but not too many would be able to tell you our three-letter code ("gle").

    ReplyDelete
  8. Congratulations - this looks like a really valuable site/service on a number of levels.

    Is it right to say there aren't any Australian Indigenous languages in your list yet?

    ReplyDelete
  9. Thanks Andrew!

    That's right - no indigenous Australian languages yet, but I have some leads so I hope maybe soon. If you know of any Twitter users I should look at, please let me know!

    ReplyDelete
  10. Just in case you're looking for it:
    http://en.wikipedia.org/wiki/ISO_639:a
    List of ISO 693 codes

    ReplyDelete
  11. @Lars @wakablogger: I just added redirects so that you can also use the ISO 639-3 codes to link to the site - thanks for the suggestion. I'm slowly getting to all of the great ideas people have sent in - be patient with me!

    ReplyDelete