We need an international language – it’s NOT English – Part 1

For a time, two centuries ago, the French language shone on the Western world, and was spoken by most travelers and high society. During the 20th century, English has gradually become the main international language. Yet, this language is incredibly difficult to learn for many people on this Earth.

Of course, we do need a language to communicate across countries and cultures. Even more so since we can now communicate instantly with other people all over the world thanks to technology.

However, language can be a huge wall between different people in terms of communication. Not being able to communicate, not understanding someone fully, not understanding at all, or worse, misunderstanding because of the language barrier, is extremely frustrating.

Is English really that difficult?

Although I was born in France, I’ve been lucky to have been exposed to the English language almost since birth, so I don’t mind speaking it. In fact, English serves me well, personally. However, not everyone is as lucky as I am.

Let’s face it: English is incredibly hard to learn, read, write and speak for a large portion of the world’s human beings. It’s not even easy for natives!

English sounds

Pronunciation of English sounds is very challenging for a large portion of non native English speakers (ever heard a French or a Chinese person struggling to speak English and getting the sounds right?). Many speakers of other languages can’t make the difference between some English sounds, especially the vowels, for instance “sheet” and another word which I will let you guess. 💩 And another one is “peace”. Yes, you got that one right too.

Pronunciation

So, how do you expect people to pronounce things correctly when they can’t even hear and make a difference between the different sounds?

Besides, pronunciation rules of written text are incredibly complex, to the point where if you don’t know some words, you wouldn’t know how to pronounce them. Think for instance of “thoroughly” and “through”, or the word “choir”. In this regard, the absence of strict rules about syllable emphasis makes it extremely challenging for non natives. It’s alias but akin, it’s misnomer but mischievous. And even a similarly stressed syllable doesn’t guarantee the same pronunciation: the stressed “a” of alias (/ˈeɪ.li.əs/) is different from the one of alibi (/ˈæl.ə.baɪ/). For a learner of the language, these types of rules go on and on like this forever. You basically have to learn almost every single word.

Spelling rules

Accordingly, spelling is also a big challenge. If you don’t know a word, you’re often at a loss when it comes to writing something you’re hearing. What about “juggler” vs “jugular”, “able” and “abide”, etc.?

Do I also have to have the offence/offense of mentioning as an annex/annexe that the agonising/agonizing specter/spectre of the “z” (zed/zee) is always an unrivaled/unrivalled endeavor/endeavour when it comes to British vs American English?

And yes, it’s pronunciation, although it is pronounced.

And one little doubled consonant can make a whole difference.

Incidentally, a comma also changes everything.

Let’s eat, kids.

Let’s eat kids.

Grammar in general and exceptions

The English grammar is incredibly difficult, tenses are a mess. Who hasn’t struggled with has had been, even natives?

There are exceptions everywhere. Verb conjugation of course. But more typically, prepositions are a headache to learn and can change the whole meaning of a sentence:

I took the statue in the garden. => it was in the garden, I’m taking it away

I took the statue into the garden. => I’m putting it in the garden

Exceptions always lurk around the corner:

The adjective for metal is metallic, but not so for iron.

Which is ironic.

Multiple meanings and homophones

A large proportion of words in the English language have multiple meanings. And I do mean multiple! Think about the very common word “date”: a day in the calendar, a romantic encounter, a fruit, or “old fashioned” as in “dated”. As a computer scientist working in AI, I have to note that this has also a terrible effect on computerized processing of language: it is very hard to automatically translate the billions of text written in English accurately to other languages.

Homophones are also all over the place. An ant is not an aunt, especially at a bizarre bazaar. “Wine and dine” sounds like “Whine and dine” but doesn’t mean exactly the same thing…

Word order

Of course, word order can be challenging for speakers of languages whose grammar orders words differently.

In English: I go to England.

In Japanese: I (the subject) Japan (destination) go.

In Irish: Go (I) to Ireland.

In Turkish: Turkey (to) go (I). Or: I Turkey (destination) go.

But no. I’m not speaking about those. Because no matter how you look at things, there will be differences, that’s the way languages are. And word order does matter, it is quite normal. As a game, you can put the word “only” anywhere in the next sentence, and get very different meanings:

She told him that she loved him.

Here are the results:

  • Only she told him that she liked him: nobody else told him that
  • She only told him that she liked him: she didn’t say anything else
  • She told only him that she liked him: he’s the only one to whom she said that
  • She told him only that she liked him: that’s all she said, and it sounds like she could have said more
  • She told him that only she liked him: she claimed she was the only one who liked him
  • She told him that she only liked him: that may be awkward. He may love her but she’s pointing out that he’s just a friend for her…
  • She told him that she liked only him: she doesn’t like anyone else
  • She told him that she liked him only: same meaning than the previous one

But English has more tricks which makes far less sense.

Adjectives, for instance. Think about a woman who is: beautiful, tall, thin, young, black-haired, Scottish. To speak a correct English, you would have to describe her with those adjectives in this exact order, no other! She cannot be a young, tall woman. While this may come naturally from experience since birth to a native English speaker, it is a total headache for a non native, who might very well think that she is a “Scottish black-haired young thin tall beautiful woman”. It hurts, doesn’t it?

Word stress

Unfortunately, this also happens orally by stressing one word in particular. In that case, there is no real way of showing this when writing, except maybe by using italic or bold fonts. In the following sentence, stressing a particular word radically changes the global meaning and context:

I never said she stole my money.

Indeed:

  • I never said she stole my money: but someone else may have said it
  • I never said she stole my money: I would never do that!
  • I never said she stole my money: I just implied it, but never directly said it
  • I never said she stole my money: I didn’t point fingers at her as the culprit
  • I never said she stole my money: she may have borrowed it… or found my lost wallet
  • I never said she stole my money: but she did steal someone else’s money
  • I never said she stole my money: she stole something else from me

Again, this is quite normal and exists in most languages, but it makes comprehension difficult. People do expect you to pick the difference in the meaning of every single of those sentences. Of course, context helps a lot here.

Ambiguity

English can be highly ambiguous, and relies on context and/or “common sense” to interpret what is being said. But in an international context, you absolutely don’t want to rely on “common sense” since this can vary a lot depending on the culture. Consider:

My brother and I are getting married this summer.

What? Well, maybe not “to each other”.

What about:

The lady hit the man with an umbrella.

The only thing we can tell for sure is that it probably did hurt. But who actually had the umbrella in their hands remains unclear. “Their” in the previous sentence is actually singular.

I read the book.

Is this past or present?

English is a local language

English has many native speakers around the world. However, all those native speakers are speaking “their own version of English”. English is actually a local language. And it has its own dialects and cultural versions.

They actually have such different accents and vocabulary that some of them can’t even understand one another. Just picture a Scot and a Texan trying to communicate. That’s the challenge we inevitably face when reusing an existing language to make it an international one artificially.

Let’s not pretend English is mutually intelligible by anyone who learns it anywhere in the world.

To conclude

I think this incomplete list speaks for itself. Although it doesn’t technically speak, as it’s written text. Here is a last funny and well known example and we’ll leave the subject at that:

Why is it that writers write, but fingers don’t fing, grocers don’t groce, and hammers don’t ham? If the plural of tooth is teeth, why isn’t the plural of booth beeth? One goose, 2 geese. So, one moose, 2 meese? One index, two indices? Is cheese the plural of choose?

So yes, let’s face it, English is a terrible international language. The most telling part is that people have invented “Simple English” or “Basic English” to try and reduce the difficulty. That’s simply acknowledging that it is too difficult in the first place. For no real benefit. Well yes, it creates millions of jobs for English teachers and translators all around the world. But wouldn’t that energy be better spent if it was for other purposes than trying to fit a square into a circle?

Here’s a link for those of you who are not afraid to go down the rabbit hole.

In the next part (soon to be published), we’ll explore alternative languages that have emerged as “common languages” between groups which spoke different languages but needed to communicate – and why none of them make a good international language. And then I’ll suggest something else.

 

Why you should never use the type “char” in Java

The post title may be blunt. But I think after reading this article, you will never use the type “char” in Java ever again.

The origin of type “char”

At the beginning, everything was ASCII, and every character on a computer could be encoded with 7 bits. While this is fine for most English texts and can also suit most European languages if you strip the accents, it definitely has its limitations. So the extended character table came, bringing a full new range of characters to ASCII, including the infamous character 255, which looks like a space, but is not a space. And code pages were defining how to show any character between 128 and 255, to allow for different scripts and languages to be printed.

Then, Unicode brought this to a brand new level by encoding characters on… 16 bits. This is about the time when Java came out in the mid-1990s. Thus, Java designers made the decision to encode Strings with characters encoded on 16 bits. All Java char has always been and is still encoded with 16 bits.

However, when integrating large numbers of characters, especially ideograms, the Unicode team understood 16 bits were not enough. So they added more bits and notified everyone: “starting now, we can encode a character with more than 16 bits”.

In order not to break compatibility with older programs, Java chars remained encoded with 16 bits. Instead of seeing a “char” as a single Unicode character, Java designers thought it best to keep the 16 bits encoding. They thus had to introduce the new concepts from Unicode, such as “surrogate” chars to indicate that one specific char is actually not a character, but an “extra thing”, such as an accent, which can be added to a character.

Character variations

In fact, some characters can be thought of in different ways. For instance, the letter “ç” can be considered:

  • either as a full character on its own, this was the initial stance of Unicode,
  • either as the character “c” on which a cedilla “¸” is applied.

Both approaches have advantages and drawbacks. The first one is generally the one used in linguistics. Even double characters are considered “a character” in some languages, such as the double l “ll” in Spanish which is considered as a letter on its own, separate from the single letter “l”.

However, this approach is obviously very greedy with individual character unique numbers: you have to assign a number to every single possible variation of a character. For someone who is only familiar with English, this might seem like a moot point. However, Vietnamese, for instance, uses many variations of those appended “thingies”. The single letter “a”, can follow all those individual variations: aàáâãặẳẵằắăậẩẫầấạả. And this goes for all other vowels as well as some consonants. Of course, the same goes for capital letters. And this is only Vietnamese.

The second approach has good virtues when it comes to transliterating text into ASCII, for instance, since transliterating becomes a simple matter of eliminating diacritics. And of course, when typing on a keyboard, you cannot possibly have one key assigned to every single variation of every character, so the second approach is a must.

Special cases: ideograms

When considering ideograms, there are also a small number of “radicals” (roughly 200 for Chinese). Those get combined together to form the large number of ideograms we know (tens of thousands).

Breakdown of a chinese word into character, radical and stroke
A Chinese Word’s decomposition (credit: Nature: https://www.nature.com/articles/s41598-017-15536-w)

 

It would be feasible to represent any Chinese character using a representation using radicals and their position. However, it is more compact to list all possible Chinese characters and assign a number to each of them, which is what was done by Unicode.

Korean Hangul

Another interesting case is Hangul, which is used to write Korean. Every character is actually a combination of letters and represents a syllable:

Hangul characters are syllables that can be broken down into individual phonemes.
Credit: https://www.youtube.com/watch?v=fHbkwKAIQLA

 

So, in some cases, it is easier to assign a number to every individual components and then combine them (which happens when typing in Korean on a keyboard). There are only 24 letters (14 vowels and 10 consonants). However, the number of combinations to form a syllable is very large: it amounts to more than 11 000, although only about 3 000 of them produce correct Korean syllables.

Funny characters

People, especially in social media, use an increasing number of special characters, emojis, and other funny stuff, from 𝄞 to 🐻. Those have made it into Unicode, thus making it possible to write ʇxǝʇ uʍop ǝpısdn, 𝔤𝔬𝔱𝔥𝔦𝔠 𝔱𝔢𝔵𝔱, or even u̳n̳d̳e̳r̳l̳i̳n̳e̳d̳ ̳t̳e̳x̳t̳ without the need for formatting or special fonts (all the above are written without special fonts or images, those are standard Unicode characters). Every flag of the world’s countries have even made it as a single character into the Unicode norm.

This plethora of new characters which made it late into the standard are often using more than 16 bits for their encoding.

Using type “char” in Java

When using the type “char” in Java, you accept that things like diacritics or non existent characters will be thrown at you, because remember, a char is encoded with 16 bits. So, when doing “𝄞”.toCharArray() or iterating through this String’s chars, Java will throw at you two characters that don’t exist on their own:

  • \uD834
  • \uDD1E

Both those characters are illegal, and they only exist as a pair of characters.

Bottom line, when it comes to text, chars shouldn’t be used. Ever. In the end, as a Java developer, you have probably learned that, unless doing bit operations, you should never use String.getBytes(), and use chars instead. Well, with the new Unicode norms and the increasing use of characters above 0xFFFF, when it comes to Strings, using char is as bad as using byte.

Java type “char” will break your data

Consider this one:

 System.out.println("𝄞1".indexOf("1"));

What do you think this prints? 1? Nope. It prints 2.

Here is one of the consequences of this. Try out the following code:

System.out.println("𝄞1".substring(1))

This prints the following, which might have surprised you before reading this blog post:

?1

But after reading this post, this makes sense. Sort of.

Because substring() is actually checking chars and not code points, we are actually cutting the String which is encoded this way:

\uD834 \uDD1E \u0031
\___________/ \____/
𝄞 1

It is amazing that a technology such as Java hasn’t addressed the issue in a better way than this.

Unicode “code points”

Actually, it is a direct consequence of what was done at the Unicode level. If you tried to break down the character 𝄞 into 16 bits chunks, you wouldn’t get valid characters. But this character is correctly encoded with U+1D11E. This is called a “code point”, and every entry in the Unicode character set has its own code point.

The down side is that an individual character may have several code points.

Indeed, the character “á” can be either of these:

  • the Unicode letter “á” on its own, encoded with U+00E1,
  • the Unicode combination of the letter “a” and its diacritic “◌́”, which results in the combination of U+0061 and U+0301.

Java code points instead of char

A code point in Java is a simple “int”, which corresponds to the Unicode value assigned to the character.

So when dealing with text, you should never use “char”, but “code points” instead. Rather than

“a String”.toCharArray()

use

“a String”.codePoints()

Instead of iterating on chars, iterate on code points. Whenever you want to check for upper case characters, digits or anything else, never use the char-based methods of class Character or String. Always use the code point counterparts.

Note that this code will actually fail with some Unicode characters:

for (int i = 0 ; i < string.length() ; i++)
   if (Character.isUpperCase(string.charAt(i)))
        ... do something

This will iterate through characters that are NOT characters, but Unicode “code units” which are possibly… garbage.

Inserting data into a database

Consider a simple relational table:

Charac
id 🔑 (primary key) int(11)
c (unique constraint) varchar(4)

Now imagine your java program is inserting unique characters in this table. If based on “char” the Java program will consider two different surrogate chars as different since their code are different, but the database will store strange things at some point since those are not valid Unicode codes. And the unique constraint will kick in, crashing your program, and possibly allowing wrong Unicode codes to be pushed into the table.

Alternative replacements

String.toCharArray() String.codePoints() (to which you can append toArray() to get an int[])
String.charAt(pos) String.codePointAt(pos)
String.indexOf(int/char) String.indexOf(String)
iterate with String.length() convert String into an int[] of code points and iterate on those
String.substring() Make sure you don’t cut between a surrogate pair. Or use int[] of code points altogether.
replace(char, char) replaceAll(String, String) and other replace methods using Strings

new String(char[])
new String(char[], offset, count)
String.valueOf(char[])

new String(int[] codePoints, int offset, int count) with code points
Character methods using type char Character methods using int code points

Learning words in foreign languages

Recently I was asked how many times you should hear a word in a foreign language before it really sticks into your mind.

Sometimes, hearing/reading one word one single time in the right context will imprint it into your mind forever. And sometimes, you will repeat one word 100 times and it will not stick. Spaced repetition is a powerful way to get the words to stick while reducing the number of times you are exposed to each word, but it is not magical either. With the wrong context, you may also fail with spaced repetition.

I learned one thing from decades of studying, it is that context is everything. That’s why trying to immerse yourself in a certain context while learning a language is important. The best of all is to simply be in the country of the language you’re learning. But as it is not always possible, here are a few tricks.

When I use Anki to do spaced repetition, I listen to some music in the language I’m learning while repeating the words. This switches the brain into the mode “oh, that’s this language, okay!”, as well as cheering you up and setting up a mood. You might even want to tap your feet with the rhythm while learning words. And on all my cards, I have an image of something that is characteristic of the country/ies where that language is spoken, as well as some sentences in which the word is used – because learning a word by itself is boring, and learning it within a sentence makes it more interesting. I will make a post later to explain how I did it technically. Associating a picture with the word also helps quite a bit, especially for physical objects.

Teachers know that bored students don’t learn anything. That’s why teachers who make their classes very emotionally alive are more successful than others. There are some very serious scientific studies on this but I’m sure you have in your own experience that teacher who stood above all others because his classes were so lively, funny or exhilarating.

And of course it all depends on the language you’re learning and the language(s) you already know. The learning curve of Japanese or Arabic is obviously much greater for a native monolingual English speaker than the one of German for someone who knows Dutch and Danish.

So there is no “number of times for a word to stick”, it’s all about context!