Two more language pages were translated over the weekend also: Chichewa, thanks to Edmond Kachale (who was kind enough to blog about us too), and Welsh, thanks to Carl Morris, Rhys Wynne, and Gareth Jones.
Just how many languages are out there on Twitter? This is a question I've been exploring for many years in the broader context of the web, where my Crúbadán web crawler has found documents written in almost 500 languages. Those texts are used to train the language recognition algorithms that drive IndigenousTweets.com (I'm planning a blog post on the details of the language recognition). I could conceivably add any of these 500 languages to IndigenousTweets, with the following restrictions:
- Twitter limits the number of queries I can make to their API so I don't plan on adding any languages with Twitter communities that are any more active than the top languages I have now: Haitian Creole, Basque and Welsh. It's even unlikely I can get everything in Creole; my friend Jean Came Poulard conjectures there may be at least a half a million people tweeting in the language.
- My language recognition algorithms work well at the level of full documents, but things are more challenging when working with tweets of 140 characters or less, and which often contain URLs, abbreviations, etc. So many languages that I'd like to include are turning out to be very challenging, for example distinguishing the Filipino languages Cebuano, Tagalog, and Hiligaynon.
- Finally, my guess is that there is no one using Twitter in the vast majority of the other 400+ languages, at least not yet. I should mention that I've set up IndigenousTweets for several other languages and made a non-trivial attempt at finding tweeters, with no luck: Aymara, Bislama, Kashubian, Marshallese, Pohnpeian, Sango, and Songhay.
Please keep the suggestions for new languages coming, and if you can point me to one or two people you know are tweeting in the language, that's a big help.