2016-02-12

Manx-to-English machine translation

A couple of months ago I posted here about an experiment which involved linking up the Intergaelic machine translation system with Google Translate to create the first ever Scottish Gaelic to English MT system.  Now that I've released Manx Gaelic to Irish MT on Intergaelic, I thought I would try the same experiment.

Here is the Manx source text I used: http://www.manxradio.com/news/manx-gaelic/dong-looks-to-manx-waters-for-wind-farm/

The Irish translation via Intergaelic:

I trádáil d'fhéadfadh a bheith fiú cúig milliún punt sa bhliain do Manainn, tá Bree-Geayee Dong tar éis tabhairt ainm le comhaontú léasa le rialtas Mhanann, le measúnú limistéar den ghrinneall farraige, amach as cósta thoir an Oileáin.

Is céim thábhachtach ach luath sa tionscadal bhí déanta Dé luain, le neart comhchomhairle agus suirbhéanna comhshaoil fós le bheith déanta. Chomh maith leis sin, is éigean aire a thabhairt do na córais d'fhéadfadh feirm ghaoithe déanamh ar raaidyn-lhuingys agus meá iascaireachta.

Mar sin féin, má tá an limistéar cuí, d'fhéadfadh Dong tógáil céad muileann gaoithe idir sé agus dhá mhíle dhéag amach as an cósta thoir, ó Rhumsaa go háit ó thuaidh de Lacksey, taobh istigh de cagliagh-marrey an Oileán. Ag an am atá ann, ní eol dóibh cé chomh ard agus bheadh na muilte gaoithe.

Tá Dong tar éis ceangal é féin do déanamh roinnt shuirbhéanna den ghrinneall farraige, le fáil amach bheadh sé cuí dó suí tuirbíní.

Plean an chomhlacht le tógáil feirm ghaoithe ollmhór amach as cósta theas an Oileáin, An Líonra Ceilteach, bhí sé caite amach mar gheall ar staid an ghrinneall farraige sa limistéar sin.

Má gabhfaidh na pleananna ar aghaidh, d'fhéadfadh na muilte gaoithe a bheith ag obair go luath sna blianta fiche. Roimhe sin, beidh ar Dong margadh go rathúil ag mórán comhchomhairle le pobal Mhanann, agus le déanamh staidéir comhshaoil.

Benji Sykes, Leas-Uachtarán Dong de Pooar-Geayee an Ríocht Aontaithe, tá sé rá go bhfuil rudaí féachaint le bheith dearfach, ach tá obair fós le déanamh.

And the resulting English via Google Translate:

In trading could be worth five million pounds a year for the Isle of Man, Bree-Geayee Dong has been giving name to a lease agreement with the government of Man, the assessment of seabed area, from the east coast of the Island.
But an important step in the project was made early Monday, with plenty of consultation and environmental surveys still to be done. Also, is forced to care for the system could do a wind farm and weighing lhuingys raaidyn-fishing.
However, if the relevant area could build Dong windmill between six hundred and twelve thousand out of the east coast, from north to place Rhumsaa of Lacksey, within cagliagh-marrey the Island. At the time there, do not realize how high would the windmills.
Dong has been bound himself to do some surveys of the seabed, found out it would be appropriate for him to sit turbines.
The company plans to build a massive wind farm off the south coast of the island, The Celtic Network, he was worn out because of the state of the seabed in the area.
If the plans entail, the windmills could be at work in the early twenty years. Previously, Dong will successfully market many community consultation Man, and to do environmental studies.
Benji Sykes, vice president of Dong-Geayee Pooar the UK, it is said that things look to be positive, but there is still work to be done.

2015-12-02

Ríomhaistriúchán Gàidhlig → Béarla

(English summary: For fun, I linked up my Scottish Gaelic to Irish machine translation system with Google Translate to produce what may be the first Gàidhlig to English translator.  It's not completely terrible.)

    Chruthaigh mé foclóir dátheangach agus inneall aistriúcháin GàidhligGaeilge i mbliana (an dá rud ar fáil ar an suíomh intergaelic.com, a bhuí le Michal Měchura).  Ar son na craice, chuir mé alt randamach ó BBC Alba trí intergaelic le leagan Gaeilge a fháil, ansin an t-aschur trí Google Translate ó Ghaeilge go Béarla. Níl a fhios agam an mbeadh a leithéid seo úsáideach d'éinne; b'fhéidir go mbainfeadh foghlaimeoirí Gàidhlig (gan Gaeilge!) tairbhe as, chun brí ghinearálta atá le scéal nuachta a thuiscint. 
    Tuigim go bhfuil suim mhór ag roinnt daoine in Albain i gcóras ríomhaistriúcháin Gàidhlig↔Béarla, agus go bhfuil an cheist seo thar a bheith conspóideach (féach Do minority languages need machine translation? agus The spectre of Google Translate for Gaelic). Níor mhaith liom mo ladhar a chur isteach sa díospóireacht sin, seachas a rá go léiríonn an turgnamh beag seo cur chuige éifeachtach amháin.

Seo é an buntéacs: http://www.bbc.co.uk/naidheachdan/34973516

Agus an t-aistriúchán uathoibríoch, gan glanadh:

Council liked the Gaeltacht would have to save £ 20m next year.

They are now saying that there will be £ 40m - after the chancellor's
speech the week went.

They are saying that this is now an understanding that everything they
are doing to be under scrutiny of the cuts.

"The Council must save £ 40m over the next year. The huge amount of
money that," said former Chathraiche Resource Committee of Council,
Alasdair Macfhionnghain.
Services

"It is more than £ 20m as well as our desire was.

"We need to be viewed services. This is probably a function of cutting services.

"There are proposals to close them all Council offices Friday evening,
and will probably also the closing schools. We look at things like
that too.

"Everything must be scrutinized. We need to look at education.

"We strive to maintain services," he said.

Council Tax is to be frozen, and did not put up from 2008/09, and that
it focus on budget councils.

Gaeltacht Council could rise to the condemned, both dtogróidís, but
would spend to pay 30% tax on every £ 1m would get through doing so.

"We need to go back again to the board, and have everything come under
scrutiny," said Mgr Macfhionnghain.

"We will talk with the Gaeltacht department and department of
Scotland. We need to talk and see what we have to do," he said.

Mgr Macfhionnghain said that the Council is likely to apply for the
one over there, see people from the exploitation against the will.

"At this level, we are not looking for that," he said.

"But as I said before, we must look at all.

"And the one over there, must also be viewed as the services.

"My perspective is also a group of prevention services over the should.

"We must also viewed.

"The roads, things like that, and look after our homes.

"We must look at the costs there," he said.


2014-12-16

Atlas Teangeolaíochta don 21ú haois

rf/gl ("rófhada/gan léamh"): Ba mhaith liom atlas teangeolaíochta nua-aimseartha a chruthú don Ghaeilge. Má scríobhann tú i nGaeilge ar líne, seol chugam (kscanne ag gmail ponc com) ainm do bhaile dúchais, nó an áit a bhfuair tú do chuid Gaeilge.  

Wagner 1.0 

Is maith liom léarscáileanna (féach anseoanseo) agus is maith liom an Ghaeilge.  Dá bhrí sin is é "Linguistic Atlas and Survey of Irish Dialects" (LASID) le Heinrich Wagner, atlas teangeolaíochta don Ghaeilge a foilsíodh i gceithre imleabhar idir 1958 agus 1969, ceann de na leabhair is ansa liom.  Bhí sé bunaithe ar shaothar teangeolaíochta a rinne Wagner agus a chomhghleacaithe le cainteoirí dúchais Gaeilge idir 1949 agus 1956. Tá cóip agam anseo sa bhaile, agus is iomaí oíche bhreá chiúin a chaith mé leis thar na blianta.
 
Anois ba mhaith liom atlas teangeolaíochta nua a chruthú, ceann a thaispeánfaidh an teanga sa lá atá inniu, go háirithe an teanga mar a úsáidtear ar an Idirlíon agus sna meáin shóisialta í.  Dhírigh Wagner et al a n-aird ar chainteoirí dúchais amháin ("... people whose first language had been Irish only, or both Irish and English") agus den chuid is mó ar dhaoine a bhí ina gcónaí sa cheantar céanna i rith a saoil. Ach tá aidhm dhifriúil agamsa, is é sin an teanga mar atá sí á labhairt ag pobal ilchineálach soghluaiste domhanda a léiriú.  Mar sin, b'fhearr liom Éireannaigh a d'fhoghlaim an teanga sa scoil agus fiú foghlaimeoirí thar lear a chur san áireamh ar aon chéim le cainteoirí dúchais.  Is é "Wagner 2.0" an teideal oibre atá agam ar an togra seo, agus mar sin cuirfidh mé "Wagner 1.0" ar LASID sa díospóireacht thíos.

Wagner 1.5

Tá dul chun cinn déanta agam air seo cheana, ag baint úsáide as roinnt sonraí atá ar fáil saor in aisce ar an Idirlíon.

Ar dtús, ba mhaith liom comparáid a dhéanamh idir na léarscáileanna atá in Wagner 1.0 leis na cinn a bheidh bunaithe ar shonraí ón Idirlíon.  Dá bhrí sin, rinne mé iarracht domhanleithead agus domhanfhad a aimsiú do na 91 áit atá luaite in Wagner 1.0 (gheobhaidh tú na sonraí sin anseo fáilte roimh cheartúcháin) sa chaoi go mbeinn in ann chuile rud a chur ar taispeáint le chéile ar Google Maps.

Tá foireann Fhoclóir Stairiúil na Nua-Ghaeilge in Acadamh Ríoga na hÉireann go díreach tar éis go leor seantéacsanna a fhoilsiú ar an suíomh atá acu níos mó ná 10 milliún focal.  Tagann na téacsanna seo le roinnt meiteashonraí, mar shampla ainm an údair, dáta foilsithe, teideal, aicme (prós nó filíocht), srl.  Bhain mé úsáid as na meiteashonraí sin chun cuardach a dhéanamh ar gach údar sa chorpas ar an suíomh iontach ainm.ie, agus nuair a d'éirigh liom an duine ceart a aimsiú, bhí mé in ann nasc a leanúint ó ainm.ie go dtí a chomhshuíomh logainm.ie, ar a bhfuil sonraí faoi áit bhreithe an duine, domhanleithead agus domhanfhad san áireamh.  Phew.

Anois, leis na sonraí seo go léir, is féidir na léarscáileanna as Wagner 1.0 a leathnú agus a shaibhriú go mór.  Mar shampla, bhí suim agam in úsáid na bhfocal "feiscint" agus "feiceáil(t)" sna canúintí (lch. 125 in Wagner 1.0).  Rinne mé cuardach ar an dá fhocal sa chorpas (na céadta sampla le fáil ann), agus bhí mé in ann iad a cheangal leis na húdair a bhí á n-úsáid, agus ansin áit bhreithe na n-údar a mharcáil ar léarscáil Google Maps.  Sin é go díreach an rud atá déanta agam anseo:


 

Má chliceálann tú an bosca beag ag barr na léarscáile ar an taobh clé, feicfidh tú dhá "shraith" sa léarscáil; ceann amháin leis na sonraí as Wagner 1.0 agus ceann eile leis na sonraí leathnaithe as corpas an Acadaimh, sa chaoi gur furasta comparáid a dhéanamh eatarthu.  Ní gá ach an tic in aice le sraith a bhaint chun an tsraith sin a chur i bhfolach.

Agus seo iad na torthaí nuair a rinne mé an próiseas céanna le "práta/fata/préata":




Níl an cur chuige seo gan locht ar ndóigh.  Seans go bhfuilim ag lorg cáinte trí "Wagner 2.0" a chur ar an tionscadal seo, nó trí na sonraí ón mbunstaidéar a chur ar an léarscáil chéanna!  Ach tuigim go maith nach ionann cur chuige chomh simplí seo (na milliún focal a bhailiú ó chorpas nó ón Idirlíon) agus an saothar cúramach teangeolaíochta a rinne Wagner sna 1950idí.   Mar léiriú amháin ar na fadhbanna seo, thug mé faoi deara go raibh an focal "fata" in úsáid ag údar éigin as Rann na Feirste ar an léarscáil thuas.  Nuair a d'fhiosraigh mé an scéal, is éard a bhí ann ná sliocht leis an údar Connachtach Pádraic Ó Conaire as an leabhar "Pádraic Ó Conaire agus Aistí Eile", leabhar de chuid an údair Chonallaigh Seosamh Mac Grianna.  Nó, feicfidh tú an focal "feiscint" sa Tuaisceart ar an teorainn in aice leis an Srath Bán i gContae Thír Eoghain.  Tháinig an sampla seo ó "An Béal Bocht" le Myles na gCopaleen is dócha gur sórt aithrise ar an mblas Muimhneach a bhí ann, ar son an ghrinn.  Agus ar ndóigh bhí Wagner ag lorg an fhocail a d'úsáideadh i gcomhthéacsanna ar leith, rud nach féidir liom a dhéanamh trí chuardach simplí a dhéanamh i gcorpas gan marcáil shéimeantach.

Is léir go mbeadh sé an-deacair an cineál seo truaillithe a sheachaint, ach is cuma liom i ndáiríre.   Baintear úsáid as focail chanúna ar fháthanna éagsúla, uaireanta ar son grinn nó magadh a dhéanamh, uaireanta, b'fhéidir, chun cumarsáid a éascú nó fáilte a chur roimh dhuine ina chanúint féin. Creidimse go mbaineann castacht den chineál sin le húsáid teangacha i gcónaí, agus gur gá dul i ngleic leis an gcastacht seo seachas neamhaird a dhéanamh di.  Agus is dócha go bhfuil úsáid na Gaeilge ar an Idirlíon níos casta fós!  Cé go dtaitníonn an t-ainm "Wagner 2.0" go mór liom, mar ómós do cheann de na leabhair is ansa liom, is é ainmhí go hiomlán difriúil atá á chruthú anseo!

 

Wagner 2.0

Mar is gnáth, níl aon deontas agam don togra seo, ná go leor ama le caitheamh air, agus mar sin braithfidh mé oraibh, pobal na Gaeilge ar líne, chun teacht i gcúnamh orm.  Más mian leat páirt a ghlacadh, seol ríomhphost chugam (kscanne ag gmail ponc com), agus tabhair ainm do bhaile dúchais dom, nó an áit a bhfuair tú do chuid Gaeilge.  Má tá blag nó cuntas Twitter agat, bheinn buíoch díot as nasc leis an mblag agus/nó ainm do chuntais Twitter.  Déanfaidh mise gach rud eile!  Ní úsáidfidh mé d'ainm nó sonraí pearsanta ar chor ar bith – nílim ag iarraidh ach rud amháin a dhéanamh: focail Ghaeilge a cheangal le pointí ar léarscáil.

Buíochas

Tá mé an-bhuíoch de Mhícheál Johnny Ó Meachair as go leor leor comhráite ar an ábhar seo agus as a chuid moltaí stuama, agus de Michael Bauer as na léarscáileanna ar a shuíomh álainn faclair.info a spreag mé chun tabhairt faoin togra craiceáilte seo!

Aguisín

Tuilleadh léarscáileanna - Nollaig shona daoibh!









2014-04-29

Social media in bilingual environments: online practices of Frisian teenagers

   The following is a guest post by Lysbeth Jongbloed, researcher at the Fryske Akademy, specializing in the use of the Frisian language in social media.  We're grateful to Lysbeth for taking the time to share her research with us!



Lysbeth Jongbloed
Probably most of you know the Netherlands: from tulips, clogs, or Amsterdam. Most people in the Netherlands speak Dutch, a West Germanic language. However, in the north of the Netherlands, in the province of Fryslân, we speak a different language: Frisian. Frisian is, besides Dutch, the second officially recognised language in the Netherlands. In Fryslân, the legal status of Frisian and Dutch are equal, however, in practice, in many domains Dutch is the dominant language and also in many schools, education in Frisian is rather limited. It is estimated that Frisian is the mother tongue for around half of the Frisian population, roughly some 350,000 people. Frisian is mainly a spoken language: while 85% of the population can speak the language, only 12% indicate that they can write the language well (De Fryske Taalatlas, 2011).

Frisian Twitter conversations; map by Indigenous Tweets
Research in Fryslân

In Fryslân, the Mercator Research Centre and the Fryske Akademy carry out fundamental and applied research in the fields of the Frisian language, culture, history and society. One of the current projects studies language use on social media. The expectation is that social media offer chances for minority languages to increase their vitality.

In 2013 and early 2014 the Mercator Research Centre received financial support from the Province of Fryslân and the municipality of Leeuwarden (capital of Fryslân) to research the language use of Frisian teenagers between 14 and 18 on social media. The outcomes of this research will be discussed below. Are you also studying the use of your minority language on the internet? We are interested in setting up an international network so we can compare results and initiate European funded projects in the future. Read more about these plans at the end of this blog.
#frysk was the #1 trending topic in the Netherlands for 7 hours on April 17th
WhatsApp most popular social media platform

Twenty Frisian schools for secondary general and vocational education participated in the research. As a result, over 2,000 Frisian teenagers filled in an extensive questionnaire. Almost all Frisian teenagers (98%) use social media. 95% of the teenagers use WhatsApp (a cross-platform mobile messaging app), 86% use Facebook and 76% use Twitter. Of the three, WhatsApp is used most: 47% chose the answer 'only when I am asleep, I do not check WhatsApp'.

Oral rather than written language


In general it can be concluded that Frisian still is rather an oral than a written language. For Frisian teenagers the Dutch language is the dominant language used in writing. On average, the more formal the medium, the less often Frisian is used. For instance, for text messages and WhatsApp approximately half of the Frisian-speaking teenagers use Frisian. On Facebook and Twitter that proportion decreases to around 30%, and in emails it is 15%. In personal messages Frisian is used more than in public or group messages.

Phonetical writing

Frisian is often written phonetically. Most teenagers are aware of that but do not mind: 'People will understand what I mean anyway.' Some think it is too much work to add all diacritics, others are not sure when to use them. Furthermore, the influence of Dutch is clearly visible in the teenagers' written language, and so is the use of dialect and abbreviations that are typical of social media. It also often happens that different languages are mixed intentionally.

Teenagers from the ‘Walden’ region use Frisian most on social media
Regional differences

In the province of Fryslân, big differences have been found regarding Frisian language use. In general, Frisian is hardly used in the big cities while it is much more common to use Frisian on social media in smaller towns and in the north-east of Fryslân.

Determining factors

The language one prefers to speak is the main factor determining one's language use on social media. Other factors affecting language choice are one's attitude towards Frisian, one’s writing skills in Frisian, and the general attitude towards Frisian at one's school.

Approximately one fifth of the Frisian-speaking teenagers never uses Frisian on social media. The main reason is that they find it difficult to write Frisian, but it also has to do with their surroundings not being Frisian and their own attitude towards Frisian.

Qualitative Twitter research

Besides mapping language use of Frisian teenagers by means of a questionnaire, I also studied tweets of 50 Frisian teenagers. The 50 teenagers for the Twitter research were selected from the participants of the second ‘Fryske Twitterdei’ (Frisian Twitter day), which was organised on April 18th 2013 by the organisation ‘Praat mar Frysk’ (Do speak Frisian). During this day people were encouraged to send Frisian tweets in combination with the hashtag Frysk. The whole day #Frysk was a trending topic in the Netherlands, and almost 10,000 tweets were sent with the hashtag Frysk. Per participant, their last 50 tweets before the Twitter day, their tweets on the Twitter day, and their first 50 tweets after the Twitter day were analysed: in total over 6,000 tweets.

Share of Frisian tweets

The analysis shows that on regular days, just over 10% of the tweets were in Frisian and 65% were in Dutch. On the Frisian Twitter day 53% was in Frisian and 29% in Dutch. Although the Twitter day has a strong upwards effect on the use of Frisian in tweets, the effect is not long-lasting.

Variables of influence on language choice


Variables of influence on language choice are the type of tweet and gender. The proportion of Frisian is highest in messages addressed to a particular person. On regular days 25% of those tweets are in Frisian. On the Twitter day the proportion doubles to almost half. The use of Frisian in other type of messages rises from under 10% to over 50%. In the analysed sample, the male teenagers tweet much more in Frisian than their female counterparts.

Every Wednesday @praatmarfrysk tweets a Frisian poem. On April 16th it was a poem about the Twitter Day.
Frisian Twitter day 2014

Last week, on April 17th, the third Frisian Twitter day was organised: again the Twitter day was a big success: during the whole day it was a trending topic in the Netherlands and during seven hours it even was the number one trending topic. Over 6 million people saw the #Frysk or #frysketwitterdei on their timeline, tweets came from over 25 countries.

Further research


The Province of Fryslân has granted a new subsidy to the Mercator Research Centre of the Fryske Akademy to carry out further research into Frisian language use on social media in 2014 and 2015; in particular, the question will be addressed what dynamics in a multilingual society lead to the use or non-use of a minority language on social media. To answer this question, we are also looking for partners in other minority language regions with whom we can compare research outcomes. Consequently we would like to build up an expert network to initiate European funded projects in the future. Please contact @lysbeth2_0 if you are interested to participate. For more information about the Frisian Twitter day, you can contact @praatmarfrysk.

2014-02-27

Indigenous Tweets #IMLD14 Roundup

Last Friday February 21st was International Mother Language Day, a celebration of linguistic diversity originally created by UNESCO in 1999.  This year, together with Rising Voices and the Living Tongues Institute, we tried to encourage people to tweet in their native language using the hashtag #imld14 (#dilm14 in Spanish).    We were thrilled with the response, and you can see some of the many tweets by searching for #imld14 on Twitter, or by checking out the Storify created by Laura Morris from Rising Voices.

For fun, I looked specifically at tweets written in any of the 157 languages we're tracking on the Indigenous Tweets site. In all, there were 491 tweets containing #imld14 or #dilm14, written in 31 of the 157 languages.  Leading the way were Gàidhlig with 158 tweets, followed by 74 tweets in Aragonese, 45 in Ojibwe/Nishnaabemwin, 41 in Malagasy, and 28 in Irish/Gaeilge.

One of the primary goals of the Indigenous Tweets project is to get people to use their language every day on Twitter and other social media sites.    We hope that a few of you who did this for the first time for #imld14 will continue to tweet in your native language and encourage others in your community to do the same.

For additional inspiration, we'll close with a sampling of tweets in a few other languages.   Looking forward to an even better turnout for #imld15!!

Chichewa:
Nahuatl:

Manx Gaelic:

Lezgian:
Karuk:
Nez Perce:
North Sámi:
Māori:




2013-12-29

Mapping the Celtic Twittersphere

Over the last couple of weeks I've created maps showing the Twitter conversations taking place in the Irish, Basque, and Māori languages.  The inspiration for this came from an email conversation with Paora Mato from the University of Waikato in Aotearoa, who has co-authored (with Te Taka Keegan) an excellent analysis of the Māori Twitter community based on data from Indigenous Tweets (forthcoming).   Since people seemed to enjoy the maps I decided to do similar ones for the other Celtic languages (Welsh, Scottish Gaelic, Manx Gaelic, Cornish, and Breton) which you'll find below.

Welsh language Twitter conversations (CC-BY-SA)
These maps were all created in more-or-less the same way.  I started with the lists of people tweeting in each language from the Indigenous Tweets site – the site includes everyone tweeting in the smaller languages like Breton, Cornish, and Māori, and the top-500 most active users for Irish, Basque, Welsh, etc.

Irish language Twitter conversations (CC-BY-SA)
Next, a small percentage of Twitter users have geolocation activated for their tweets, which means that when they tweet from a mobile device, a latitude and longitude are recorded in Twitter's database along with the tweet.  These coordinates are then accessible to developers like me through the Twitter API.  For users without geolocation activated, I just collected the (self-reported) location from their Twitter profile, canonicalized the placenames, and looked up the lat/longs in a database.  For these users, I assumed that all of their tweets were sent from the resulting location.  This means, for example, that all tweets from people whose profile location is set to "Dublin", "Baile Átha Cliath", "BÁC", or variants thereof will appear to come from one particular location near the center of the city – whatever's in the database (as it happens, it's the Dublin Spire).   This isn't really a problem since I'm only interested in creating maps at the level of countries or continents.

Scottish Gaelic Twitter conversations (CC-BY-SA)
Canonicalizing the placenames takes a bit of manual labor, for a few reasons.  First, sometimes people will give their location in their profile as something like "American ex-pat living in Galway", and the geolocation services I've tried usually fail on strings like this.  Second, many people tweeting in indigenous or minority languages give their location in their native language, and for languages like Welsh, Cornish, Māori and so on, these names are often missing from geolocation databases.  Finally, there are misspellings and other noise in people's profiles that are best handled manually.

Scottish Gaelic, Great Britain and Ireland only (CC-BY-SA)
So at this point I have good coordinates for between 50-60% of the users listed on the Indigenous Tweets pages.  I then gather all tweets from the database that are in the desired language and in which one user "mentions" another.  In the case that I have coordinates for both the sender and the mentioned user, I simply draw an arc of a great circle on the map connecting the two points.  I rendered the maps using the statistical package R, which has libraries that make this sort of thing very easy (nice tutorial here, for example).

It's very common for a large number of conversations to take place between two specific points.   For example, there have been 5878 Welsh language tweets sent from Caerdydd that mention a user in Caernarfon, and 1519 Irish language tweets sent from An Cheathrú Rua that mention a user in Baile Átha Cliath.  In such cases, I've scaled the brightness of the arcs so that these frequent paths show up more prominently on the maps.

Breton language Twitter conversations (CC-BY-SA)
I'm not a linguist or sociolinguist so it's not really my place to draw conclusions about linguistic geography, language vitality, or anything else from these maps. It's best to leave this to members of the language communities themselves, who will have the best understanding of the local situation.  That said, I want to address a couple of issues people raised on Twitter after I posted the Irish, Basque and Māori maps.

Cornish language Twitter conversations (CC-BY-SA)
The most striking thing about the Basque map is how compact it is geographically, especially when compared to the Irish map where we see many conversations between Ireland, North America, continental Europe and even Brazil.  In contrast, all of the Basque conversations take place within the Basque Country, roughly speaking.   And the Welsh map, which appears here for the first time, looks much more like the Basque map than the Irish one, with just a small percentage of tweets involving a user outside of Wales, most of those to and from London.  Does this mean that somehow Irish is a more "international" language than the other two, or that the Irish-speaking diaspora is more engaged with the language?  It might, but more careful research would be needed to establish this.  My guess is that the Welsh and Basque communities look more compact in part because I'm only displaying the top-500 users in each case.  Since these languages have such vibrant communities on Twitter, the bar is set extremely high to make it into the top-500 tweeters (currently, the 500th most active tweeter in Welsh has 1073 tweets in the language, for Basque the number is 1958, but for Irish it's just 176), and I expect that users with thousands of tweets in the language are more likely to live in the traditional homeland where the language is still used on a daily basis by the local community.
Manx Gaelic Twitter conversations (CC-BY-SA)

A word or two regarding the Manx map.  Of the six Celtic languages, Manx has the smallest number of users on Twitter and probably the smallest number of speakers also.   Several users have "Isle of Man", "Ellan Vannin" (or variants thereof) as their location (and no more specific location on the island).  Because of this, I normalized all locations on the island to a single lat/long, and therefore (disappointingly) the map doesn't show what I expect is actually an interesting network of communication taking place on the island; instead it just shows the conversation pathways between the island and three users off the island.

Finally, a word about privacy.   I haven't plotted locations at a granularity finer than a city or town except in cases where users have explicitly activated geolocation for their tweets.  And even in those cases, since the maps are at a pretty large scale, it's impossible to pinpoint the exact location of any particular user.  That said, not everyone will be so scrupulous with your data, and if the idea of a stranger plotting your movements on a map creeps you out (I think it should), you should deactivate geolocation on your Twitter account (under Settings, go to "Security and Privacy", and then make sure the box next to "Add location to my tweets" is unchecked).  If you don't want anyone to know where you are at all, you can also remove your location from your Twitter profile (Settings → Profile → Location).   And if you don't want sites like Indigenous Tweets to have access to your tweets at all, the easiest solution is to make your Tweets private (Settings → Profile, and tick the box next to "Protect my tweets").




2012-10-15

Facebook in your language

It's been a long time since I posted anything here.  The Indigenous Tweets project is still going strong, and the number of languages we're tracking on Twitter continues to grow - we added the 138th and 139th languages (Inari and South Saami) to the site a couple of weeks ago.  Last week, the team at Twitter was nice enough to feature Indigenous Tweets on their "Twitter Stories" site; you can read that piece here.

Since January, I've spent a lot of time working on another project aimed at encouraging indigenous language groups to use their languages in social media.  What we're trying to do is produce translations of Facebook's interface (the menus, navigation, etc.) into as many languages as possible.

You may be aware that Facebook has a nice system in place that allows volunteers to translate the site into about 100 different languages, including a number of languages that we care about here, like Irish, Cherokee, Northern Sámi, and Aymara.  This is about the same as the number of language teams currently translating Mozilla Firefox (105) and somewhat less than the number of languages the Google search interface is available in (150). 

The trouble is, neither Facebook nor Google has added any new languages to their translation systems for quite a while.  In the case of Google, this is stated explicity in their translation FAQ: "Right now, we're unable to support more languages in GIYL".  We haven't been able to reach anyone at Facebook about this, but we've heard second-hand that they have had problems with spam translations and poor quality from some of the smaller translation teams.  Whatever the reason, there are hundreds of language groups out there actively using Facebook to communicate in their language, but who are forced to use the site in English, Spanish, etc.  This flies in the face of Facebook's stated aim to "make Facebook available in every language across the world".

To solve this problem for his own language of Secwepemctsín, the late Neskie Manuel came up with a clever solution using a technology called Greasemonkey.  His code acts as a kind of "overlay" that runs in your web browser; as you navigate pages on Facebook, they are sent across the network to you in English, but then can be translated on the fly in your browser.

At one level this is just a "hack", and even Neskie viewed it as a temporary workaround: "It would be good to be able to use the official Facebook Translations App, but Secwepemctsín isn’t listed. Until then, we can use this script."  Personally, I think it's a bigger, more important idea than that.  What it means is that any language group can undertake a translation without having to wait for Facebook's approval or permission, and the same approach works in theory for Google or other popular web sites that aren't open to translation.   I've been working on open source software translations for more than ten years, and have contributed to the Irish translations of Mozilla Firefox, LibreOffice, KDE, etc.  I've strongly advocated [PDF] for an open source approach among indigenous language groups who are just starting out on software translation, because it means that the community itself can maintain control and ownership of their work, instead of having to rely on the goodwill of a big, for-profit corporation.  The trouble we're facing now, however, is that more and more of the software we use is "software as a service": Gmail instead of Mozilla Thunderbird, Google Docs instead of LibreOffice, etc., or social media sites like Twitter and Facebook.  This trend puts control of the online "linguistic landscape" firmly back in the hands of big corporations.  Neskie's approach gives us a way to maintain a measure of control over the language we choose to use online.

The response to this project has been overwhelming.  More than 60 different language groups have started translations, and we already have more than 30 that are in a usable state.  About two-thirds of these languages are endangered according to the UNESCO Atlas of the World's Languages in Danger, and in the majority of cases, I'm aware of no previous efforts to translate software into the language.

Doing a "complete" translation is quite easy.  Depending on how much terminology you have to make up, it can take as little as a couple of hours of work. I've picked out around 200 of the most common messages that appear on Facebook to be translated.  Of course this is only a small fraction of the entire site (which would be overwhelmingly large for a small language group to undertake), but by choosing these 200 messages carefully, we're able to achieve a convincing immersive experience in the target language with a minimum of effort.

There are a few technical terms needing translation (e.g. "Mobile Uploads", "email address", "Apps", "Cookies"), some site-specific jargon ("to like/unlike", "to poke someone", "status update"), and western concepts that have been difficult to render in some indigenous languages ("Privacy", "Advertising").   A useful technique for terminology creation is to see how other languages have dealt with a given concept.  To help with this, I've asked everyone who has contributed a new Facebook translation to also provide "back translations" of some of these tricky terms into English, in the hope that some of these might be helpful to new translators.   These back translations are stored on the project wiki, and we welcome additional contributions in any language.

I should also say that you don't need to translate all 200 messages if you don't want to.  For a language that is rarely, if ever, seen on the computer, I think there's great symbolic value in even a translation of just a few key words, for example "Like", "Unlike", "Comment", and "Share".

Would you like to try translating Facebook into your language?  Leave a comment below and I can send you detailed instructions!