2011-04-08

Some Milestones

   We've reached some milestones in the last week.   First, I've added 17 new languages to the site since the last update, so there are now 71 supported languages in all, more than twice the number we started with three weeks ago.  Again Michael Bauer helped with several of these, and I also had a number of people write to me after the BBC interview asking if I would support their language.    Here's the full list of new languages:


   Among these are our first indigenous Australian language (Gamilaraay, with 3 speakers according to Ethnologue) and two other critically endangered languages: Ainu (~15 speakers in Japan), and Nawat (~20 speakers, all older).  Thanks to Alan R. King who provided training data for Nawat and who is responsible for the first couple of tweets in that language.

   We also have a number of new translations.   The first round of translations came mostly from friends working on the Firefox localization teams.  Many of these new translations are directly from members of supported language communities on Twitter: Rumantsch (Gion-Andri Cantieni, @gionandri), Setswana (Sternly Simon, @talk2ras), Kɨlaangi (Oliver Stegen, @babatabita), Occitan (Maxime Caillon, @caillonm), Kernewek/Cornish (John Gillingham, @Bodrugan), Brezhoneg/Breton (Ahmed Razoui, @duzodu), and Nawat/Pipil (Alan R. King, @alanrking).    We also have a translation into Marshallese from Marco Mora, but no tweets in that language yet!

   One additional milestone.  The site is generated by using a program that "crawls" Twitter users, grabbing the tweets on their timeline and performing statistical language recognition on those tweets (details to come).   Then, if a given user has more than a certain fraction of their tweets in the target language, that user's followers are added to a queue to be checked in the same way.    In the last couple of days, the initial crawls for Basque and Welsh were completed, meaning all languages, with the exception of Haitian Creole, are now complete.   Therefore the number of users currently listed for each language should represent a good initial estimate of the total user base on Twitter.  Of course the program will continue to add new users as they are discovered by the crawler (through random search queries for words in each language) and as they are suggested via the form on each language page on IndigenousTweets.com.

   Haitian Creole is a special case and will remain so.  As noted in an earlier post, we expect there are at least 100,000 people tweeting in Creole and it is unlikely I can keep up with all of them given the limits imposed by Twitter, but I will do my best.

   Next milestone: 100 languages!

13 comments:

  1. Congratulations Mr Scanell...Expressing deep appreciation for your work, I would also like to inform you about this twitter account I manage to tweet mostly in my mother tongue, Nepal bhasa (Newar language)...I hope this minority language from South Asia will also get space in your site...The script I currently use is Devanagari (the script used to write Hindi, Nepali and many other South Asian languages) as our native script Nepal (Prachalit) Script is currently just in the process of being encoded...I hope for your positive response and also draw your attention to another blog that blogs in minority language of indigenous Jumma people of Chittagong, Bangladesh. My twitter account and the address of the blog are provided below:

    my twitter: @newa_issues
    And the blog of another minority blogging from Bangladesh: http://chtnewsupdate.blogspot.com/

    ReplyDelete
  2. Dear Prabin, thanks - I'd like to add Newar to the site. To do language identification I would need a certain amount of text written in the language (in Devanagari script). Can you point me to a few web sites where there is some text I could use? If you can point to specific pages written primarily in Newar that helps (specific blog postings, for example).

    Are all the @newa_issues tweets in Devanagari script written in Newar, or are there some Nepali tweets too?

    I'll look into the Chittagong blog too, thanks for the pointers!

    ReplyDelete
  3. Thank you Mr Scanell for such a quick response...below are some of the links where you can find some text in Nepalbhasa (Newar language) in Devanagari script.

    http://jwajalapa.com/index.php?option=com_content&view=article&id=92:2008-10-20-23-14-30&catid=1:latest-news&Itemid=40

    http://nepalbhasa.co.cc/learn/nb/0_greetings.htm

    http://nepalmandal.com/ (Nepalbhasa news site - all articles in Nepalbhasa)

    Almost 99% of tweets of newa_issues are in Nepalbhasa (mostly news and other issues) but some in Nepali (also in Devanagari). So, I wonder how you will be able to differentiate (only if we could use our own script on the internet).

    Anyways, I hope this will encourage more people to start tweeting in Nepalbhasa.

    PS: The name of the language is Nepalbhasa or you can write Nepalbhasa (Newar language)

    ReplyDelete
  4. Seo téacs gairid duit i Néaváiris (नेपाल भाषा):

    http://www.smo.uhi.ac.uk/sengoidelc/donncha/tm/ilteangach/?teanga=new

    Agus an rud céanna i Neipeáilis (नेपाली):

    http://www.smo.uhi.ac.uk/sengoidelc/donncha/tm/ilteangach/?teanga=ne

    ReplyDelete
  5. Thanks Prabin (agus Donncha); these files worked well and I was able to add Newar to Indigenous Tweets:

    http://indigenoustweets.com/new/

    ReplyDelete
  6. thats a great project! thanks for do it. will be the catalan language added to the list? i hope so. is not yet official for the european union, has been baned in both spain and france for centuries and is official language only in andorra. but i think that we are a lot of people :D

    i think that this web could help you, is about languages and is called "linguamón, casa de les llengües" (linguaworld, home of the languages)

    http://www10.gencat.cat/casa_llengues/AppJava/ca/index.jsp

    ReplyDelete
  7. @Derelicte: thanks for the link. Yes there are many people tweeting in Catalan - it was at the top of the list of languages I wanted to add!

    One problem: the Twitter API limits the number of queries I can make to their database so I don't have the capacity to do it quite yet, but stay tuned, I'm trying to find a workaround.

    ReplyDelete
  8. Aragones:

    @purnas

    Web: http://an.wikipedia.org/wiki/Portalada

    ReplyDelete
  9. @pasapues:
    http://indigenoustweets.com/an/

    The language identification is difficult, so the percentages of tweets in Aragonese that are reported are probably incorrect (too low).

    Feel free to add users through the form on the site and they'll get added eventually (not immediately).

    Thanks!

    ReplyDelete
  10. You can add the #Purhepechas twitter user to the P'urhépecha language. We are from Michoacán, state of México. Every one is welcome to indigenous page: www.Purhepecha.com

    Everything on you page is great!

    Thank you!

    :)

    ReplyDelete
  11. Nice done! Another link for my blog (www.linguodiversitat.wordpress.com). Using new technologies to help saving endangered languages is the right thing to do... The wrong thing is thinking there is no solution.
    I agree catalan is unfortunately a minorized language. It is not in a catastrophic situation compared to other languages but it is indeed endangered. As we say in catalan "El català té una mala salut de ferro" that means like "Catalan language has cast-iron unhealthiness".
    Nevertheless I understand how difficult must it be to include catalan in indigenous tweets, given the vast number of users. It would for sure beat kreyòl, almost every catalan speaker with twitter writes something in catalan... there are even some trending topics in catalan (#bcndecideix, for example).
    Good job, and go on with your fantastic task. PS:Don't doubt asking me in my blog if you needa translate something into catalan

    ReplyDelete
  12. @Taurons: thanks for writing - I'm glad to discover your blog, and I'll add a link to it from here (see "Friends of Indigenous Tweets").

    I assume you know about http://umap.cat/. The total number of users they report for Euskara and Welsh are about half of the totals I've discovered (I think my language identification works a bit better). So if we assume the same is true for Catalan, that would be around 40,000 users in all - very impressive! (that said, our best guess for Haitian Creole is greater than 100,000 users!)

    ReplyDelete
  13. @Tatá: I have added the @Purhepechas user: http://indigenoustweets.com/tsz/. Let me know if you want to translate the page to your language. Thanks for the suggestion!

    ReplyDelete