Existing “lingua francas” – some history – an international language – Part 3

History

In the previous part, we have seen that existing languages can be fairly complex. Besides, we cannot speak all languages spoken by others. Clearly, from the beginning of language, people needed communicating with “others who don’t speak your own language”.

Many existing languages served as an common tongue between different civilizations. In fact, such languages emerged in history in every region that had enough trade going on, and where communication was necessary.

In the West, during the Middle Ages, traders around the Mediterranean spoke a language called “Lingua Franca”. The term now means “common language”: a language spoken by people who otherwise wouldn’t understand one another. All over the world, groups that had strong interactions used some variations of regional languages as “lingua franca”, such as Chinese, Sanskrit, Arabic, Swahili, Latin, Quechua, and many others.

Vulgar Latin served as the “lingua franca” for the whole European continent for centuries. It is actually still in use in the scientific world, where Latin is the root for botanical and anatomical names.

After the Renaissance in Europe, there were many attempts to devise simplified languages, mostly for scientists from different countries who couldn’t communicate across Europe because they spoke different languages.

Volapük

In the late 19th century, Schleyer, a German priest, created a language called Volapük, which he hoped would become an international language. His language gained a big popularity very quickly, with over a million enthusiasts, but it was short lived.

Among Schleyer’s followers, some wanted to make changes to the language. However, he wanted to keep it “pure and unmodified” – and also didn’t want the credit for creating the language to slip from his hands. As a result, many “unofficial” variants of the language appeared. Different factions started arguing and claiming that their version was “the best”.

Power is a treacherous thing. Every promoter of their own version fought with others. Many branches of the language appeared, and this is probably one of the main causes for the destruction of this language. I insist on this part of the story because it explains what followed with Esperanto, and is still visible to this day.

Esperanto

10 years after the creation of Volapük, a Polish ophthalmologist, Zamenhof, published a book under the pseudonym “Doktoro Esperanto”, describing what would become the language “Esperanto”. He had been working on creating an international language for more than 15 years. Indeed, he believed that humanity needed a common language to be at peace. Many Esperantists (people who speak and promote Esperanto) still have this same goal in mind: a unifying language for the world.

I do admire the goal, and I absolutely share the vision. Not being able to understand one another creates division. And certainly, the “divide and rule” tactic is working partly because we all speak different languages.

Fear and rigidity

However, as many other people, I find that the Esperanto language is lacking. In fact, Zamenhof himself thought that his language could be improved. However, many Esperantists feared that modifying the original language would lead to a “Volapük schism” which would eventually destroy Esperanto. A large majority rejected Zamenhof’s own reform when he proposed improvements to the language. This fear is still felt by many Esperantists today.

Although I do understand the fear, I feel it is based on the wrong grounds. Volapük did not disappear because it was modified by others. It disappeared because Schleyer would not allow to modify it to make a better version – thus provoked the splitting which eventually led to its demise. The thing is: the inevitable obviously cannot be stopped. Trying to stop a natural evolution makes things worse, not better. From a shallow point of view, it looks like schism destroyed Volapük. However, when looking deeper, the root cause was rigidity. I fear the same might happen with Esperanto.

Granted, some reforms did take place in Esperanto. But they are very minimal and insufficient, at least from my point of view. On the other hand, I do share the view of many Esperantists that an international language shouldn’t go through a reform every month. I also agree that “better” is often the enemy of “good”. However, what is clearly broken and/or creates strong negative emotional reactions cannot be widely adopted and creates resistance. Esperanto is the oldest of the current major surviving conlangs, it also benefited from a number of hypes over time. Yet, it never really took off and we’re very far today from it being a common international language.

Ido

Many, very early on in the life of Esperanto, shared the view that it needed some extra touch. Branches of the language appeared despite all the efforts from Esperantists to stop anyone from modifying the language in fear of a schism. Those branches separated forcefully from the main Esperanto speakers since there was no tolerance from their side. One of them is called Ido, a “revised Esperanto” (“ido” in Esperanto means “offspring”).

Frankly, I find that Ido feels like a patch. It does solve some of the problems I see in Esperanto, but it is still lacking. Quite a lot. We will see that in more detail in the next posts. It is a honest attempt at fixing many aspects of Esperanto that many people judge negatively, but I believe it is not sufficient. It is just like fixing a broken house with tape.

And indeed, the result is here: despite being around for quite a long time, Ido is at a standstill. In fact, it is far behind Esperanto in terms of number of speakers. If it was as good as it claims, I believe it would have overtaken Esperanto by now.

Glosa

In the middle of the Second World War, a scientist, Lancelot Hogben, devised the bases for an international language during his idle hours. He published a book called “Interglosa”, mostly aimed at language teachers, confident that people would pick up his language immediately. However, people at the time had other things on their minds, with the war going on. His introduction manual for the language never took any attention.

Almost 20 years later, another scientist, Ron Clark, found Hogben’s book in a second-hand book shop. He read it, immediately found it fascinating. He was soon joined by Wendy Ashby, and they worked with Hogben, who was still alive, to improve the language. However, Hogben died in 1975. Clark and Ashby founded Glosa after some further modifications of the language.

I find Glosa much better than any previous attempts. After all, Hogben benefited from many failures before him, so he definitely had an advantage. Here are some of the core features of Glosa:

  • it is fully phonetic, every single printed character is pronounced in one way, and vice-versa,
  • unlike Esperanto, words do not change, they can serve multiple purposes by the simple addition of prepositions. This feature eliminates complex inflections which makes it difficult for many people to speak and understand languages – people in general don’t think about what an adjective or an adverb are when they speak!
  • the vocabulary is limited to 1000 words,
  • roots for words are exclusively taken from Greek and Latin.

However, like Esperanto or Ido, Glosa is still not widespread. Granted, it is much younger and didn’t benefit from big hypes as Esperanto did.

Toki Pona

There are many other constructed languages which aim to be an international language. It would be impossible to list them all. I’ll just present a last one, which is quite fun and intriguing.

Toki Pona was created by a Canadian, Sonja Lang, whose aim wasn’t exactly to create an international language, but rather a “minimalist” language. It was a tool to help organize her thoughts. Indeed, the official Toki Pona vocabulary contains no more than 120 words. In fact, the philosophy is close to Northern minimalism, which along with her pen name led me to believe for quite some time that Sonja Lang was a Swede!

Similar to the founder of Esperanto, the inventor of Toki Pona had something in mind: doing good. While the word “Esperanto” means “the one who hopes”, “toki pona” means “the language of good”.

Besides, I particularly like some of Toki Pona’s features. For instance, the sounds have been selected so that basically all people in the world can read, understand and pronounce the language easily. This is nice! Besides, it can be written in plain ASCII without any diacritics.

Minimalism comes at a price

However, minimalism comes at a price. Although a few root words may be sufficient to express very simple things, it becomes very difficult when you need to express more complex thoughts. You need to become extremely descriptive, like a 2 year-old child who doesn’t know his vocabulary yet.

For instance,

The student learns history from the teacher

becomes

The one who studies learns the communicated chronology that passed from the person whose job is to instruct others.

As a primary tool for communication, this can become very tedious. Better than nothing, for sure. Indeed, the original creator’s intention is to develop creativity and imagination.

However, if you do want to be understood by other human beings, we need a consensus. Everyone must use the same metaphors to be understood. This in turn means that you have to learn vocabulary – or in this case idioms -, just like in any other language. If you don’t, you might scratch your head when someone mentions a “confident bird”. Could it possibly be one of the following?

Actually, it is the metaphor people generally use for an “eagle” in Toki Pona. But if you haven’t come across the expression yet, any one of those above could certainly qualify as a “confident bird”:

Besides, others would probably also scratch their heads when you mention a “confident bird from the Andes”, although that one could come very easily as a “condor” for someone who is familiar with Andean culture.

Conlangs are also biased

Created languages, or “conlangs”, also suffer from the biases of their creator(s). Because someone created a language with “goodwill” doesn’t mean the language itself is good, easy or even usable for communication. We’ll examine the ease of learning and communicating of existing conlangs in part 5.

Does a “lingua franca” replace local languages?

To conclude this post, I would like to address this very sensitive topic. Language is deeply connected to the culture of the people who speak it.

Languages politically forced on populations destroy dialects

This is a common fear, due to a very big misunderstanding of what a “lingua franca” should be. Since English has become the international language, many believe that an international language always tries to force itself on people and aims to replace all languages.

In many countries where a common language was forcefully introduced as “the common lingo”, it has been the case.

In my native France, the French Republic has spent countless efforts in the last 250 years trying to kill all regional languages, forcing people to speak French and ditch their local language or dialect. All this in the name of “unity”. And this strong will to eradicate dialects is still very much alive within the Parisian administration today – and, sadly, it has succeeded quite well. In June 2021, it voted a new law to restrict any teaching of regional languages at school. But that is a political agenda, not “good will” to “help people communicate”. Instead, it is a tool to control the population from a centralized authoritarian administration.

In the same way, some other languages did replace the local dialects. But in most cases, those were the results of military conquest. Indeed, Vulgar Latin replaced many local languages in Europe during the Middle Ages. But that was the result of the colonization of territory by the Romans. Similarly, the Incas pushed Quechua on the people they conquered. With the Spanish takeover of South America, it grew even more because the Spaniards didn’t want to have translators for every single local language. Again, it was a military invasion. And after all, the language of the conquerors, Spanish, replaced Quechua to some extent.

English also comes with the conquering American mentality. Conquering through music, a huge film industry, fast foods and many other aspects. It is not the language itself which endangers others, but the associated culture.

This is not what an international language is for.

Lingua francas don’t kill other languages

As I have mentioned, many “lingua francas” emerged naturally in history, and rarely ever replaced or killed other languages. The trading language of the Mediterranean called “Lingua Franca” never replaced other local languages or dialects. Chinese outside of China, Malay, Swahili, while being widely understood well beyond the borders of their native speakers, didn’t replace local dialects.

The goal needs to be precise: a “lingua franca” is an “alternative means of communication”. Not a new “unified world” thing. By the way, another name for such a language is “international auxiliary language”. Auxiliary.

I actually make the point that providing a constructed language to the world as a lingua franca is actually saving languages rather than endangering them. If you are speaking Cherokee today, you could just learn the easy lingua franca along with Cherokee, rather than having to study English. Because of this, some countries like some parts of Switzerland as well as Finland are actually switching fully to English, ditching their own language. Which I personally find catastrophic. To keep the diversity, we need a very easy access to the lingua franca.

Enough spoken of existing languages. In the next part, we will focus on constructed languages: conlangs.

An international language – living languages are BAD – Part 2

In this part, we will see why any living language makes a very bad candidate to be an international language. Yet, we do need a communication tool across the globe, so let’s see how existing languages can help – or actually create more problems.

In part 1, we have seen that English is a very complex language. This complexity makes it very difficult to learn for many people on the earth. Besides, it is highly ambiguous. All these points make it a very bad “international language”. But it is not the only language with these difficulties. In fact, I state that, in general, an existing living language cannot and should not become an international language.

You might wonder why we can’t safely take an existing living language as an international language. After all there are many alternative choices, some of them without many drawbacks of English, while already widely spoken across different populations: Spanish, Chinese, Hindi, Arabic, French, Swahili…

Obviously, the advantage of picking an existing language is the strong base of speakers who use it without any extra learning effort. Of course, this existing base helps the initial spread of any language, but it comes with some attached strings.

Domination

The first point is a cultural and philosophical one. There is a symbolic meaning when one language of a certain culture is forced upon the world population, as it is currently the case with English. Colonizers always imposed their language to their colonies. It clearly means: “You need to make the effort of learning my language, but I certainly won’t make any effort learning yours. And that’s because I’m superior to you.”

Yup. That’s it. Racism at its worst.

So for this reason alone, any existing language as a “lingua franca” (a term I will use a lot in the future – meaning a common language) is simply a no-no. Should we still continue the discussion at this point? Maybe not. But just for the fun of it, there are other reasons why any existing “living” language is not a good candidate.

Languages are… a mess

There are many other practical reasons why living languages are totally inadequate to serve as an international language.

They have grown randomly

A language changes mostly through usage. There are many reasons why languages evolve. But they generally do when something is considered “inefficient” by the social group that uses it. And that fills the language with exceptions over time, which makes it more difficult to learn for people who were not born within that social group.

Vocabulary

New words appear when new objects or concepts appear. This can be the case with technological, scientific or philosophical advances, for instance.

Very often, languages also take “loanwords” from other languages. “Oh, this word doesn’t exist in my language, but it does in that other language. That’s fun! Let me borrow it!” In the meantime, another word from your language might have actually done the job quite well – but you didn’t think about it. Typically, in France, everyone speaks about a « week-end » while in Quebec it is a « fin de semaine ». On the other hand, what is « une job » in Quebec is « un travail » or « un emploi » in France.

English is full of loanwords from Latin and French, from the Middle-Ages, especially nouns. For instance, almost all words ending in “ion” are French words (adoption, lion, explosion – etc.), spelled exactly the same, pronounced slightly differently. On the other hand, many verbs come from Germanic and Scandinavian languages: “to run” is “rennen” in German, “to drink” is “trinken” in German, etc. And the irregular verbs come directly from Germanic languages. In German, “to sing” is “singen”, also an irregular verb which becomes “singt – sang – gesungen”. Sounds familiar, doesn’t it?

Phonetics

The way syllables and mores are pronounced play a huge role in the evolution of languages. Whenever something is lengthy, difficult to pronounce or judged ambiguous, usage changes to correct this perceived fault.

“It is” becomes “It’s”. “Getted” becomes “Get”. “Pronounciation” becomes “Pronunciation”. “Logique” becomes “Logic”. And “a apple” becomes “an apple” because pronouncing two a’s in a row is “breaking the flow” of speech. Almost all languages go through these changes, which have some logic to them from the point of view of the social group that uses this language, and occur more often on words that are most used in daily life.

New meanings

Old words can also get new meanings. And before you know it, the vocabulary changes quite a lot. And again, it is daily life and daily usage that affects the language most. Common words are often transformed, whereas literate words change more rarely. An English “plate” is a French « assiette » (nowadays, the French word « plat » means “flat”… which makes a lot of sense, doesn’t it?) and an English “tree” is a French « arbre » (whose root can be found in the English word “arborist”, for instance). But the English “sediment” is also the French « sédiment » and “vernacular” is also « vernaculaire ». Of course, there are many counter-examples of this, but the point is that, as a general rule, what is used more often is a more likely target for changes to “simplify” the language.

This phenomenon also causes exceptions to occur within the vocabulary and expressions that are mostly used daily by everyone.

They adapted locally

Languages also adapt to local circumstances. If you’re a tribe living in a hot climate and near the sea, you have very little use for the word “snow”. There is no snow in your environment, and you probably don’t even know what it is. So you don’t need a word for it. On the other hand, you certainly need to name very precisely every species of fish and sea animals, in order to know whether you’re speaking of a predator or a prey, or whether that thing is edible or not. You probably also need very specific vocabulary for water currents, tides, waves, wind and other sea-related concepts, which can actually play a role in your survival.

However, when you live in the mountains up North, far from the sea, you don’t even know what the “horizon” looks like, you’ve got a mountain in front of you! However, you have plenty of snow, and it’s very critical and sometimes a matter of life and death that you describe precisely the type of snow that is on the ground today. You may need to describe accurately if it is icy, sticky, slippery, at risk of causing avalanches, whether it covered animal tracks, etc.

If you’re a tribe of hunters, you don’t need the same vocabulary than farmers or breeders. A sedentary vs nomadic lifestyle also brings its own range of useful words. And an industrial world has other communication needs than a rural one.

Granted, this specialization and optimization for a specific environment makes the language more practical and richer for those people who use it. However, it is completely not adapted for others. Besides, it makes it more difficult to learn for no real advantage outside its original environment.

They also adapt culturally

Without going into too much detail, languages also adapt depending on the people’s culture and rituals. More religious or spiritual people will invent lots of words to describe their feelings, mystical events and so on. A monotheist religion doesn’t bring the same vocabulary than an animist one (in which every being is suddenly brought to life, even a stone).

Whether the culture gives a lot of importance to family, social ranking, and other relationships, also brings a richer or poorer vocabulary. Japanese has notably different forms of expression and vocabulary for women and men. One fun fact: after WWII, many American soldiers learned the language from their Japanese girlfriends, and ended up speaking like women, which would trigger quite a laugh from the Japanese. An “elder brother” is called differently in Korean if the sibling is a female (he’s then Oppa) or a male (he’s called Hyeong), while there is no difference in European languages, and little brothers are also called differently in Korean, rather than using “elder” or “younger” as European languages do. And so on.

Ranges of sounds

Any given language has been living within a group of people over time. As a consequence, the sounds it uses have become specialized in such a way that they are very easy to distinguish within this social group. However, we are all different, and every society puts emphasis on different things. Because of that, every social group has come, generation after generation, to select different sounds as “different”. Linguists use various classifications to put every sound into a nice set of categories, from vocative, ablative, labial, dorsal, and many others. Those categories indicate where the sound is produced, with which organ (we don’t only use our vocal chords to produce sounds – the tongue, throat, lips, jaw and larynx play huge roles as well), etc.

Do you imagine an international language with tongue clicks, like in Xhosa (if you watch the video, notice how he pronounces Xhosa… can you do it?)?

If you’re not an African who speaks one of those languages with clicks (and there are many, especially in the South of Africa – Zulu has some too), probably not. If you’ve ever wondered why Japanese people can pronounce “sa”, “ki”, but not “si” (they say “shi” instead), the video above may have given you a clue. Did you notice how the white guy says: “I can make the click by itself, but I can’t do it with the vowel”.

What about a language that uses tones to change the meaning of words, like Mandarin and other Asian languages? If you are not a speaker of those languages, you simply can’t distinguish the different forms of the word ma: a mother (mā), a horse (mǎ), hemp (má), a grasshopper or “to scold/abuse” (mà), a question indicator or a pause (ma), among others. Because of this, the Chinese language allows for pretty cool tongue twisters, such as this one which tells a full story using only the syllable “shi” but with different tones:

As they have evolved within a closed community, living languages are simply too different for people who were raised in a different environment. This makes every single living language difficult for anyone who doesn’t speak it as a native tongue. It’s like asking a musician to learn a computer programming language – or a programmer to learn a musical instrument. I’m not saying it’s impossible, but it’s very difficult because there is so much to learn at once with skills that are hard to acquire as an adult.

Languages are rich – too rich?

So, definitely, the environment shapes languages. Does an international language need to go to such deep extremes? Certainly not. You can afford to be more descriptive when the situation needs for it. This is not your daily tongue. It is an auxiliary one.

Consider the adjective “many” in English, it has a lot of synonyms: diverse, countless, copious, innumerable, manifold, myriad, numerous, plenty, several, various… What about “interesting”: fascinating, engaging, intriguing, though-provoking, inspiring, titillating, exciting, absorbing, enthralling, curious, captivating, enchanting, bewitching, appealing… and many others.

Granted, most of those adjectives have a very slightly different meaning than all the others. Sometimes they are completely interchangeable. Of course, it can help us write better novels and better poetry, in order not to repeat the same word twice or to convey the exact concept we have in mind – that is, if the reader/listener knows that word… and associates the exact same nuances to it than the writer/speaker. Besides, this is adding a considerable amount of vocabulary to learn, for very little gain, if you consider “communication” alone.

Languages move around

I was pointing out in the last paragraph “if the receiver understands the word the same way than the sender”. This becomes especially true when the same language moves from one location to the next.

This can end up being extremely confusing. Consider some examples:

Word / expression Britain America
I can easily jump out the window since I live on the first floor.
Let’s use a dummy to calm down the baby.
Oh, you are a chemist?
Can you check the post please?
‘Going to the bog?

Between the French spoken in Quebec and in France, we have quite a few false friends like these that can actually become extremely awkward. The same goes for Spanish spoken in South America vs Spain (think about “coger”, for instance which is very normal in Spain but… well, don’t use it elsewhere, prefer “tomar” instead!).

An international language must be simple

If we want people to learn an extra language and be able to communicate, it has to be simple. Its vocabulary has to be limited so that we don’t need to learn tens of thousands of words to start communicating.

It also has to be as unambiguous as possible:

  • vocabulary must have a definite meaning for everyone, and it should avoid synonyms,
  • grammar must be clear and allow as little misunderstandings as possible,
  • sounds must be easy to pronounce and distinct for most people,
  • related to the previous point, it shouldn’t have homonyms: one sound, one meaning, and vice-versa.

Besides, it has to offer all the needed flexibility to get as precise as possible when it is needed. Do I really want to convey “enthralling”? We can use some metaphor for that with simple vocabulary: that is actually what dictionaries do to explain complex words. And that’s exactly what we do naturally when we struggle to find the exact word we mean to use.

However, “simple” is a relative concept. How many words should an international language have? 100? 1 000? 10 000? More? As we will see in the next parts, “too simple” can be very crippling. We need to find the correct balance between “simple but inconvenient” and “overly complex and hard to learn”.

We will see in the next part that many languages have served as “common languages”, and that new ones also appeared, especially in the recent history.

An international language – English is BAD – Part 1

For a time, two centuries ago, the French language shone on the Western world, and was spoken by most travelers and high society. During the 20th century, English has gradually become the main international language. Yet, this language is incredibly difficult to learn for many people on this Earth.

Of course, we do need a language to communicate across countries and cultures. Even more so since we can now communicate instantly with other people all over the world thanks to technology.

However, language can be a huge wall between different people in terms of communication. Not being able to communicate, not understanding someone fully, not understanding at all, or worse, misunderstanding because of the language barrier, is extremely frustrating.

Is English really that difficult?

Although I was born in France, I’ve been lucky to have been exposed to the English language almost since birth, so I don’t mind speaking it. In fact, English serves me well, personally. However, not everyone is as lucky as I am.

Let’s face it: English is incredibly hard to learn, read, write and speak for a large portion of the world’s human beings. It’s not even easy for natives!

English sounds

Pronunciation of English sounds is very challenging for a large portion of non native English speakers (ever heard a French or a Chinese person struggling to speak English and getting the sounds right?). Many speakers of other languages can’t make the difference between some English sounds, especially the vowels, for instance “sheet” and another word which I will let you guess. 💩 And another one is “peace”. Yes, you got that one right too.

Pronunciation

So, how do you expect people to pronounce things correctly when they can’t even hear and make a difference between the different sounds?

Besides, pronunciation rules of written text are incredibly complex, to the point where if you don’t know some words, you wouldn’t know how to pronounce them. Think for instance of “thoroughly” and “through”, or the word “choir”. In this regard, the absence of strict rules about syllable emphasis makes it extremely challenging for non natives. It’s alias but akin, it’s misnomer but mischievous. And even a similarly stressed syllable doesn’t guarantee the same pronunciation: the stressed “a” of alias (/ˈeɪ.li.əs/) is different from the one of alibi (/ˈæl.ə.baɪ/). For a learner of the language, these types of rules go on and on like this forever. You basically have to learn almost every single word.

Spelling rules

Accordingly, spelling is also a big challenge. If you don’t know a word, you’re often at a loss when it comes to writing something you’re hearing. What about “juggler” vs “jugular”, “able” and “abide”, etc.?

Do I also have to have the offence/offense of mentioning as an annex/annexe that the agonising/agonizing specter/spectre of the “z” (zed/zee) is always an unrivaled/unrivalled endeavor/endeavour when it comes to British vs American English?

And yes, it’s pronunciation, although it is pronounced.

And one little doubled consonant can make a whole difference.

Incidentally, a comma also changes everything.

Let’s eat, kids.

Let’s eat kids.

Grammar in general and exceptions

The English grammar is incredibly difficult, tenses are a mess. Who hasn’t struggled with has had been, even natives?

There are exceptions everywhere. Verb conjugation of course. But more typically, prepositions are a headache to learn and can change the whole meaning of a sentence:

I took the statue in the garden. => it was in the garden, I’m taking it away

I took the statue into the garden. => I’m putting it in the garden

Exceptions always lurk around the corner:

The adjective for metal is metallic, but not so for iron.

Which is ironic.

Multiple meanings and homophones

A large proportion of words in the English language have multiple meanings. And I do mean multiple! Think about the very common word “date”: a day in the calendar, a romantic encounter, a fruit, or “old fashioned” as in “dated”. As a computer scientist working in AI, I have to note that this has also a terrible effect on computerized processing of language: it is very hard to automatically translate the billions of text written in English accurately to other languages.

Homophones are also all over the place. An ant is not an aunt, especially at a bizarre bazaar. “Wine and dine” sounds like “Whine and dine” but doesn’t mean exactly the same thing…

Word order

Of course, word order can be challenging for speakers of languages whose grammar orders words differently.

In English: I go to England.

In Japanese: I (the subject) Japan (destination) go.

In Irish: Go (I) to Ireland.

In Turkish: Turkey (to) go (I). Or: I Turkey (destination) go.

But no. I’m not speaking about those. Because no matter how you look at things, there will be differences, that’s the way languages are. And word order does matter, it is quite normal. As a game, you can put the word “only” anywhere in the next sentence, and get very different meanings:

She told him that she loved him.

Here are the results:

  • Only she told him that she liked him: nobody else told him that
  • She only told him that she liked him: she didn’t say anything else
  • She told only him that she liked him: he’s the only one to whom she said that
  • She told him only that she liked him: that’s all she said, and it sounds like she could have said more
  • She told him that only she liked him: she claimed she was the only one who liked him
  • She told him that she only liked him: that may be awkward. He may love her but she’s pointing out that he’s just a friend for her…
  • She told him that she liked only him: she doesn’t like anyone else
  • She told him that she liked him only: same meaning than the previous one

But English has more tricks which makes far less sense.

Adjectives, for instance. Think about a woman who is: beautiful, tall, thin, young, black-haired, Scottish. To speak a correct English, you would have to describe her with those adjectives in this exact order, no other! She cannot be a young, tall woman. While this may come naturally from experience since birth to a native English speaker, it is a total headache for a non native, who might very well think that she is a “Scottish black-haired young thin tall beautiful woman”. It hurts, doesn’t it?

Word stress

Unfortunately, this also happens orally by stressing one word in particular. In that case, there is no real way of showing this when writing, except maybe by using italic or bold fonts. In the following sentence, stressing a particular word radically changes the global meaning and context:

I never said she stole my money.

Indeed:

  • I never said she stole my money: but someone else may have said it
  • I never said she stole my money: I would never do that!
  • I never said she stole my money: I just implied it, but never directly said it
  • I never said she stole my money: I didn’t point fingers at her as the culprit
  • I never said she stole my money: she may have borrowed it… or found my lost wallet
  • I never said she stole my money: but that she did steal someone else’s money
  • I never said she stole my money: she stole something else from me

Again, this is quite normal and exists in most languages, but it makes comprehension difficult. People do expect you to pick the difference in the meaning of every single of those sentences. Of course, context helps a lot here.

Ambiguity

English can be highly ambiguous, and relies on context and/or “common sense” to interpret what is being said. But in an international context, you absolutely don’t want to rely on “common sense” since this can vary a lot depending on the culture. Consider:

My brother and I are getting married this summer.

What? Well, maybe not “to each other”.

What about:

The lady hit the man with an umbrella.

The only thing we can tell for sure is that it probably did hurt. But who actually had the umbrella in their hands remains unclear. “Their” in the previous sentence is actually singular.

I read the book.

Is this past or present?

English is a local language

English has many native speakers around the world. However, all those native speakers are speaking “their own version of English”. English is actually a local language. And it has its own dialects and cultural versions.

They actually have such different accents and vocabulary that some of them can’t even understand one another. Just picture a Scot and a Texan trying to communicate. That’s the challenge we inevitably face when reusing an existing language to make it an international one artificially.

Let’s not pretend English is mutually intelligible by anyone who learns it anywhere in the world.

To conclude

I think this incomplete list speaks for itself. Although it doesn’t technically speak, as it’s written text. Here is a last funny and well known example and we’ll leave the subject at that:

Why is it that writers write, but fingers don’t fing, grocers don’t groce, and hammers don’t ham? If the plural of tooth is teeth, why isn’t the plural of booth beeth? One goose, 2 geese. So, one moose, 2 meese? One index, two indices? Is cheese the plural of choose?

So yes, let’s face it, English is a terrible international language. The most telling part is that people have invented “Simple English” or “Basic English” to try and reduce the difficulty. That’s simply acknowledging that it is too difficult in the first place. For no real benefit. Well yes, it creates millions of jobs for English teachers and translators all around the world. But wouldn’t that energy be better spent if it was for other purposes than trying to fit a square into a circle?

Here’s a link for those of you who are not afraid to go down the rabbit hole.

In the next part, we’ll explore why any other living language is a bad candidate as an international language. Then, we’ll see alternative languages that have emerged as “common languages” between groups which spoke different languages but needed to communicate – and why none of them make a good international language. And then I’ll suggest something else.

 

Why you should never use the type “char” in Java

The post title may be blunt. But I think after reading this article, you will never use the type “char” in Java ever again.

The origin of type “char”

At the beginning, everything was ASCII, and every character on a computer could be encoded with 7 bits. While this is fine for most English texts and can also suit most European languages if you strip the accents, it definitely has its limitations. So the extended character table came, bringing a full new range of characters to ASCII, including the infamous character 255, which looks like a space, but is not a space. And code pages were defining how to show any character between 128 and 255, to allow for different scripts and languages to be printed.

Then, Unicode brought this to a brand new level by encoding characters on… 16 bits. This is about the time when Java came out in the mid-1990s. Thus, Java designers made the decision to encode Strings with characters encoded on 16 bits. All Java char has always been and is still encoded with 16 bits.

However, when integrating large numbers of characters, especially ideograms, the Unicode team understood 16 bits were not enough. So they added more bits and notified everyone: “starting now, we can encode a character with more than 16 bits”.

In order not to break compatibility with older programs, Java chars remained encoded with 16 bits. Instead of seeing a “char” as a single Unicode character, Java designers thought it best to keep the 16 bits encoding. They thus had to introduce the new concepts from Unicode, such as “surrogate” chars to indicate that one specific char is actually not a character, but an “extra thing”, such as an accent, which can be added to a character.

Character variations

In fact, some characters can be thought of in different ways. For instance, the letter “ç” can be considered:

  • either as a full character on its own, this was the initial stance of Unicode,
  • either as the character “c” on which a cedilla “¸” is applied.

Both approaches have advantages and drawbacks. The first one is generally the one used in linguistics. Even double characters are considered “a character” in some languages, such as the double l “ll” in Spanish which is considered as a letter on its own, separate from the single letter “l”.

However, this approach is obviously very greedy with individual character unique numbers: you have to assign a number to every single possible variation of a character. For someone who is only familiar with English, this might seem like a moot point. However, Vietnamese, for instance, uses many variations of those appended “thingies”. The single letter “a”, can follow all those individual variations: aàáâãặẳẵằắăậẩẫầấạả. And this goes for all other vowels as well as some consonants. Of course, the same goes for capital letters. And this is only Vietnamese.

The second approach has good virtues when it comes to transliterating text into ASCII, for instance, since transliterating becomes a simple matter of eliminating diacritics. And of course, when typing on a keyboard, you cannot possibly have one key assigned to every single variation of every character, so the second approach is a must.

Special cases: ideograms

When considering ideograms, there are also a small number of “radicals” (roughly 200 for Chinese). Those get combined together to form the large number of ideograms we know (tens of thousands).

Breakdown of a chinese word into character, radical and stroke
A Chinese Word’s decomposition (credit: Nature: https://www.nature.com/articles/s41598-017-15536-w)

 

It would be feasible to represent any Chinese character using a representation using radicals and their position. However, it is more compact to list all possible Chinese characters and assign a number to each of them, which is what was done by Unicode.

Korean Hangul

Another interesting case is Hangul, which is used to write Korean. Every character is actually a combination of letters and represents a syllable:

Hangul characters are syllables that can be broken down into individual phonemes.
Credit: https://www.youtube.com/watch?v=fHbkwKAIQLA

 

So, in some cases, it is easier to assign a number to every individual components and then combine them (which happens when typing in Korean on a keyboard). There are only 24 letters (14 vowels and 10 consonants). However, the number of combinations to form a syllable is very large: it amounts to more than 11 000, although only about 3 000 of them produce correct Korean syllables.

Funny characters

People, especially in social media, use an increasing number of special characters, emojis, and other funny stuff, from 𝄞 to 🐻. Those have made it into Unicode, thus making it possible to write ʇxǝʇ uʍop ǝpısdn, 𝔤𝔬𝔱𝔥𝔦𝔠 𝔱𝔢𝔵𝔱, or even u̳n̳d̳e̳r̳l̳i̳n̳e̳d̳ ̳t̳e̳x̳t̳ without the need for formatting or special fonts (all the above are written without special fonts or images, those are standard Unicode characters). Every flag of the world’s countries have even made it as a single character into the Unicode norm.

This plethora of new characters which made it late into the standard are often using more than 16 bits for their encoding.

Using type “char” in Java

When using the type “char” in Java, you accept that things like diacritics or non existent characters will be thrown at you, because remember, a char is encoded with 16 bits. So, when doing “𝄞”.toCharArray() or iterating through this String’s chars, Java will throw at you two characters that don’t exist on their own:

  • \uD834
  • \uDD1E

Both those characters are illegal, and they only exist as a pair of characters.

Bottom line, when it comes to text, chars shouldn’t be used. Ever. In the end, as a Java developer, you have probably learned that, unless doing bit operations, you should never use String.getBytes(), and use chars instead. Well, with the new Unicode norms and the increasing use of characters above 0xFFFF, when it comes to Strings, using char is as bad as using byte.

Java type “char” will break your data

Consider this one:

 System.out.println("𝄞1".indexOf("1"));

What do you think this prints? 1? Nope. It prints 2.

Here is one of the consequences of this. Try out the following code:

System.out.println("𝄞1".substring(1))

This prints the following, which might have surprised you before reading this blog post:

?1

But after reading this post, this makes sense. Sort of.

Because substring() is actually checking chars and not code points, we are actually cutting the String which is encoded this way:

\uD834 \uDD1E \u0031
\___________/ \____/
𝄞 1

It is amazing that a technology such as Java hasn’t addressed the issue in a better way than this.

Unicode “code points”

Actually, it is a direct consequence of what was done at the Unicode level. If you tried to break down the character 𝄞 into 16 bits chunks, you wouldn’t get valid characters. But this character is correctly encoded with U+1D11E. This is called a “code point”, and every entry in the Unicode character set has its own code point.

The down side is that an individual character may have several code points.

Indeed, the character “á” can be either of these:

  • the Unicode letter “á” on its own, encoded with U+00E1,
  • the Unicode combination of the letter “a” and its diacritic “◌́”, which results in the combination of U+0061 and U+0301.

Java code points instead of char

A code point in Java is a simple “int”, which corresponds to the Unicode value assigned to the character.

So when dealing with text, you should never use “char”, but “code points” instead. Rather than

“a String”.toCharArray()

use

“a String”.codePoints()

Instead of iterating on chars, iterate on code points. Whenever you want to check for upper case characters, digits or anything else, never use the char-based methods of class Character or String. Always use the code point counterparts.

Note that this code will actually fail with some Unicode characters:

for (int i = 0 ; i < string.length() ; i++)
   if (Character.isUpperCase(string.charAt(i)))
        ... do something

This will iterate through characters that are NOT characters, but Unicode “code units” which are possibly… garbage.

Inserting data into a database

Consider a simple relational table:

Charac
id 🔑 (primary key) int(11)
c (unique constraint) varchar(4)

Now imagine your java program is inserting unique characters in this table. If based on “char” the Java program will consider two different surrogate chars as different since their code are different, but the database will store strange things at some point since those are not valid Unicode codes. And the unique constraint will kick in, crashing your program, and possibly allowing wrong Unicode codes to be pushed into the table.

Alternative replacements

String.toCharArray() String.codePoints() (to which you can append toArray() to get an int[])
String.charAt(pos) String.codePointAt(pos)
String.indexOf(int/char) String.indexOf(String)
iterate with String.length() convert String into an int[] of code points and iterate on those
String.substring() Make sure you don’t cut between a surrogate pair. Or use int[] of code points altogether.
replace(char, char) replaceAll(String, String) and other replace methods using Strings

new String(char[])
new String(char[], offset, count)
String.valueOf(char[])

new String(int[] codePoints, int offset, int count) with code points
Character methods using type char Character methods using int code points

Learning words in foreign languages

Recently I was asked how many times you should hear a word in a foreign language before it really sticks into your mind.

Sometimes, hearing/reading one word one single time in the right context will imprint it into your mind forever. And sometimes, you will repeat one word 100 times and it will not stick. Spaced repetition is a powerful way to get the words to stick while reducing the number of times you are exposed to each word, but it is not magical either. With the wrong context, you may also fail with spaced repetition.

I learned one thing from decades of studying, it is that context is everything. That’s why trying to immerse yourself in a certain context while learning a language is important. The best of all is to simply be in the country of the language you’re learning. But as it is not always possible, here are a few tricks.

When I use Anki to do spaced repetition, I listen to some music in the language I’m learning while repeating the words. This switches the brain into the mode “oh, that’s this language, okay!”, as well as cheering you up and setting up a mood. You might even want to tap your feet with the rhythm while learning words. And on all my cards, I have an image of something that is characteristic of the country/ies where that language is spoken, as well as some sentences in which the word is used – because learning a word by itself is boring, and learning it within a sentence makes it more interesting. I will make a post later to explain how I did it technically. Associating a picture with the word also helps quite a bit, especially for physical objects.

Teachers know that bored students don’t learn anything. That’s why teachers who make their classes very emotionally alive are more successful than others. There are some very serious scientific studies on this but I’m sure you have in your own experience that teacher who stood above all others because his classes were so lively, funny or exhilarating.

And of course it all depends on the language you’re learning and the language(s) you already know. The learning curve of Japanese or Arabic is obviously much greater for a native monolingual English speaker than the one of German for someone who knows Dutch and Danish.

So there is no “number of times for a word to stick”, it’s all about context!