Why you should never use the type “char” in Java

The post title may be blunt. But I think after reading this article, you will never use the type “char” in Java ever again.

The origin of type “char”

At the beginning, everything was ASCII, and every character on a computer could be encoded with 7 bits. While this is fine for most English texts and can also suit most European languages if you strip the accents, it definitely has its limitations. So the extended character table came, bringing a full new range of characters to ASCII, including the infamous character 255, which looks like a space, but is not a space. And code pages were defining how to show any character between 128 and 255, to allow for different scripts and languages to be printed.

Then, Unicode brought this to a brand new level by encoding characters on… 16 bits. This is about the time when Java came out in the mid-1990s. Thus, Java designers made the decision to encode Strings with characters encoded on 16 bits. All Java char has always been and is still encoded with 16 bits.

However, when integrating large numbers of characters, especially ideograms, the Unicode team understood 16 bits were not enough. So they added more bits and notified everyone: “starting now, we can encode a character with more than 16 bits”.

In order not to break compatibility with older programs, Java chars remained encoded with 16 bits. Instead of seeing a “char” as a single Unicode character, Java designers thought it best to keep the 16 bits encoding. They thus had to introduce the new concepts from Unicode, such as “surrogate” chars to indicate that one specific char is actually not a character, but an “extra thing”, such as an accent, which can be added to a character.

Character variations

In fact, some characters can be thought of in different ways. For instance, the letter “ç” can be considered:

  • either as a full character on its own, this was the initial stance of Unicode,
  • either as the character “c” on which a cedilla “¸” is applied.

Both approaches have advantages and drawbacks. The first one is generally the one used in linguistics. Even double characters are considered “a character” in some languages, such as the double l “ll” in Spanish which is considered as a letter on its own, separate from the single letter “l”.

However, this approach is obviously very greedy with individual character unique numbers: you have to assign a number to every single possible variation of a character. For someone who is only familiar with English, this might seem like a moot point. However, Vietnamese, for instance, uses many variations of those appended “thingies”. The single letter “a”, can follow all those individual variations: aàáâãặẳẵằắăậẩẫầấạả. And this goes for all other vowels as well as some consonants. Of course, the same goes for capital letters. And this is only Vietnamese.

The second approach has good virtues when it comes to transliterating text into ASCII, for instance, since transliterating becomes a simple matter of eliminating diacritics. And of course, when typing on a keyboard, you cannot possibly have one key assigned to every single variation of every character, so the second approach is a must.

Special cases: ideograms

When considering ideograms, there are also a small number of “radicals” (roughly 200 for Chinese). Those get combined together to form the large number of ideograms we know (tens of thousands).

Breakdown of a chinese word into character, radical and stroke
A Chinese Word’s decomposition (credit: Nature: https://www.nature.com/articles/s41598-017-15536-w)

 

It would be feasible to represent any Chinese character using a representation using radicals and their position. However, it is more compact to list all possible Chinese characters and assign a number to each of them, which is what was done by Unicode.

Korean Hangul

Another interesting case is Hangul, which is used to write Korean. Every character is actually a combination of letters and represents a syllable:

Hangul characters are syllables that can be broken down into individual phonemes.
Credit: https://www.youtube.com/watch?v=fHbkwKAIQLA

 

So, in some cases, it is easier to assign a number to every individual components and then combine them (which happens when typing in Korean on a keyboard). There are only 24 letters (14 vowels and 10 consonants). However, the number of combinations to form a syllable is very large: it amounts to more than 11 000, although only about 3 000 of them produce correct Korean syllables.

Funny characters

People, especially in social media, use an increasing number of special characters, emojis, and other funny stuff, from 𝄞 to 🐻. Those have made it into Unicode, thus making it possible to write ʇxǝʇ uʍop ǝpısdn, 𝔤𝔬𝔱𝔥𝔦𝔠 𝔱𝔢𝔵𝔱, or even u̳n̳d̳e̳r̳l̳i̳n̳e̳d̳ ̳t̳e̳x̳t̳ without the need for formatting or special fonts (all the above are written without special fonts or images, those are standard Unicode characters). Every flag of the world’s countries have even made it as a single character into the Unicode norm.

This plethora of new characters which made it late into the standard are often using more than 16 bits for their encoding.

Using type “char” in Java

When using the type “char” in Java, you accept that things like diacritics or non existent characters will be thrown at you, because remember, a char is encoded with 16 bits. So, when doing “𝄞”.toCharArray() or iterating through this String’s chars, Java will throw at you two characters that don’t exist on their own:

  • \uD834
  • \uDD1E

Both those characters are illegal, and they only exist as a pair of characters.

Bottom line, when it comes to text, chars shouldn’t be used. Ever. In the end, as a Java developer, you have probably learned that, unless doing bit operations, you should never use String.getBytes(), and use chars instead. Well, with the new Unicode norms and the increasing use of characters above 0xFFFF, when it comes to Strings, using char is as bad as using byte.

Java type “char” will break your data

Consider this one:

 System.out.println("𝄞1".indexOf("1"));

What do you think this prints? 1? Nope. It prints 2.

Here is one of the consequences of this. Try out the following code:

System.out.println("𝄞1".substring(1))

This prints the following, which might have surprised you before reading this blog post:

?1

But after reading this post, this makes sense. Sort of.

Because substring() is actually checking chars and not code points, we are actually cutting the String which is encoded this way:

\uD834 \uDD1E \u0031
\___________/ \____/
𝄞 1

It is amazing that a technology such as Java hasn’t addressed the issue in a better way than this.

Unicode “code points”

Actually, it is a direct consequence of what was done at the Unicode level. If you tried to break down the character 𝄞 into 16 bits chunks, you wouldn’t get valid characters. But this character is correctly encoded with U+1D11E. This is called a “code point”, and every entry in the Unicode character set has its own code point.

The down side is that an individual character may have several code points.

Indeed, the character “á” can be either of these:

  • the Unicode letter “á” on its own, encoded with U+00E1,
  • the Unicode combination of the letter “a” and its diacritic “◌́”, which results in the combination of U+0061 and U+0301.

Java code points instead of char

A code point in Java is a simple “int”, which corresponds to the Unicode value assigned to the character.

So when dealing with text, you should never use “char”, but “code points” instead. Rather than

“a String”.toCharArray()

use

“a String”.codePoints()

Instead of iterating on chars, iterate on code points. Whenever you want to check for upper case characters, digits or anything else, never use the char-based methods of class Character or String. Always use the code point counterparts.

Note that this code will actually fail with some Unicode characters:

for (int i = 0 ; i < string.length() ; i++)
   if (Character.isUpperCase(string.charAt(i)))
        ... do something

This will iterate through characters that are NOT characters, but Unicode “code units” which are possibly… garbage.

Inserting data into a database

Consider a simple relational table:

Charac
id 🔑 (primary key) int(11)
c (unique constraint) varchar(4)

Now imagine your java program is inserting unique characters in this table. If based on “char” the Java program will consider two different surrogate chars as different since their code are different, but the database will store strange things at some point since those are not valid Unicode codes. And the unique constraint will kick in, crashing your program, and possibly allowing wrong Unicode codes to be pushed into the table.

Alternative replacements

String.toCharArray() String.codePoints() (to which you can append toArray() to get an int[])
String.charAt(pos) String.codePointAt(pos)
String.indexOf(int/char) String.indexOf(String)
iterate with String.length() convert String into an int[] of code points and iterate on those
String.substring() Make sure you don’t cut between a surrogate pair. Or use int[] of code points altogether.
replace(char, char) replaceAll(String, String) and other replace methods using Strings

new String(char[])
new String(char[], offset, count)
String.valueOf(char[])

new String(int[] codePoints, int offset, int count) with code points
Character methods using type char Character methods using int code points

Surveiller la mortalité « toutes causes » en open source

Introduction

La transparence est essentielle pour construire une société de confiance. Étant moi-même toujours en doute avec les informations que je rencontre, quelle que soit leur source, je me suis posé beaucoup de questions sur la mortalité depuis le début de l’épidémie de Covid. J’ai d’ailleurs posté des compte-rendus détaillés (et moins détaillés) de ce que je trouvais sur la Covid, ainsi que d’autres informations ici, , et et ailleurs.

L’un des principaux problèmes qu’un citoyen lambda rencontre très vite, c’est la difficulté à évaluer des données nationales qui sont par définition des agrégats, parfois des agrégats d’agrégats… sachant que tout le monde y va de ses propres interprétations. Il me fallait donc repartir de données brutes difficilement falsifiables ou manipulables.

Quelles données ?

Par ailleurs, la comptabilisation des « morts Covid » est toujours biaisée, car comme je l’indique dans cet autre billet, on meurt le plus souvent d’une multitude de facteurs, pas seulement de la Covid (ou d’une autre maladie).

Il ne nous reste donc plus, à nous autres citoyens lambda, la mortalité « toutes causes », qui représente une réalité non biaisée : telle personne est morte à telle endroit à telle date. Or, l’INSEE fournit exactement ces données détaillées. Les données ont toujours 1 mois et demi à 2 mois de retard. Cela permet tout de même de regarder rétrospectivement ce qui s’est passé pour pouvoir anticiper ce qui va arriver. Mais surtout, cela permet de juger du degré de fiabilité de diverses sources d’informations lors des événements passés, et ainsi de se construire un indice de confiance sur telle ou telle source.

J’ai donc développé un programme pour analyser ces données et calculer des courbes de mortalité à partir de ces données brutes. J’ai également mis en place un site pour afficher les résultats, que vous pouvez consulter (le site est chez moi, il est donc possible qu’il ne soit pas toujours disponible et qu’il donne une vue erronée au moment où je fais des mises à jour des données, une fois par mois).

Petit manuel de l’utilisateur

Quelques petites remarques sur le site :

  • les données proviennent directement des données brutes, filtrées (car il y a des doublons dans les fichiers de l’INSEE),
  • le graphe correspond à une année entière, donc les données les plus à gauche correspondent au début de l’année, et celles les plus à droite à la fin de l’année,
  • les différents types d’affichage sont :
    • absolu : c’est le nombre de morts chaque jour,
    • pondéré : le nombre de morts pour 1 million d’habitants, ce qui permet de comparer d’une année sur l’autre puisque la population varie d’une année sur l’autre,
    • étalé : c’est la moyenne glissante du nombre de morts pour 1 million d’habitants sur 10 jours, cela permet d’avoir une version « lissée » des courbes,
  • on peut filtrer par département (ou par pays étranger) pour observer la mortalité dans une région donnée et ainsi cibler des événements locaux,
  • on peut également filtrer par tranches d’âges pour voir l’impact des événements en fonction des âges,
  • on peut filtrer avec un intervalle d’années et/ou de mois, afin de pouvoir comparer des années proches sans le « bruit » des autres années,
  • lorsqu’une seule année est sélectionnée (filtrage de l’année x vers la même année x), l’affichage bascule en mortalité par tranche d’âges, pour analyser un événement en particulier, par exemple.

Transparence

Dans un souci de transparence et pour faire marcher l’intelligence collective, je publie les sources de mes programmes en open source. Ainsi, chacun peut vérifier mon code, et même l’installer chez soi, le modifier, l’améliorer, et se faire sa propre opinion en toute indépendance.

Le programme java de remplissage de la base de données : https://gitlab.com/jytou/country-death-observe-java

Le site web pour en faire des courbes : https://gitlab.com/jytou/country-death-observe-php

GANN-EWN / Part 2 / Building a Hand-Crafted Player For EWN

Welcome to this second part of the GANN-EWN (Genetic Algorithms for Neural Networks applied to the game “Einstein Würfelt Nicht”) series! If you haven’t read it yet, it’s probably time to read the first part.

The game is quite simple: at the beginning of the game every player has 6 pieces labeled from 1 to 6, and every player throws a die at every turn. Depending on the result of the throw, the player may move one or two of his own pieces. The goal is to reach the opposite side of the board (diagonally) or capture all opponent’s pieces.

The rules with the dice are simple as well. If the die corresponds exactly to one of his pieces, he has to play that piece in any direction forward (the three squares: up, up left, or left), like in here where the die shows “5”:

If no piece with the same number are on the board, then one of the two pieces that are the closest to the die can be played. For instance, if the die was “2”, then the “1” and “3” can be played here:

Again, if the die is “3” then on this board the “2” or the “5” can be played:

Note that if there was no “2” on this board, then only the “5” could be played since there is no lower number than “3” on the board.

The first thing to do was to code a simple board implementing the rules of the game, which is quite straightforward. I chose Java, which is probably the programming language I’m most comfortable with, but also for another reason, which I will develop later.

To have something to compare my neural networks to, and also to test my implementation of the EWN Board and game logic, my first first step was to implement a simple (and I can call it quite “naive”) implementation of a player, based on my very limited experience of the game. Remember that I just discovered EWN, so after having a look at some games from the best players, I realized that one common strategy is to “eat up” some of your own pieces early in the game, to give momentum to your remaining pieces. Although this is a “classic” strategy, it does have one drawback: you opponent may be able to take all your remaining pieces and you lose. Nevertheless, I decided to implement this rule, along with a few others. Here is the list of rules that are hard-coded in this first player:

  • if one move is a winning move (capture the last opponent’s piece or reach the opposite side of the board), then play it,
  • if an opponent piece is close to winning (basically at a certain distance from the goal), always take it,
  • always take your own pieces if they are within a certain number (typically from 2 to 5) if you have more than a certain amount of pieces (you don’t want to take your pieces if you have too few of them, that could be suicide in some situations),
  • move forward always the most advanced piece (rushing towards the goal),
  • preferably move forward-up rather than simply forward or up.

And that’s it!

So yes, this is a very simple player, but one funny fact is that this player beats me consistently! That’s how weak I am at EWN as a player.

But as those rules can have parameters (how far must an adversary piece so that we capture it? etc.), I ran a simple computer simulation using a range of possible parameters (including totally disabling a rule). And those simulation showed me that, by far, the best parameters are:

  • capture pieces that are at most at a distance of 2 from the goal, but actually this rule is not very important,
  • take your own pieces from 2 to 5 until you have only 2 pieces left,
  • move always forward-up,
  • do advance the most advanced piece.

Of course, every single rule taken on its own may have some exceptions, but my “simple player” doesn’t go that far.

I also created a “random” player which picks moves at random, again to have something to compare to that doesn’t vary much in its crappiness. 🙂 And indeed my simple player was definitely better than the random one, at least.

I published this first player on the littlegolem site, with a different user than my regular user.

This is when I discovered that the site was actually crawling with bots playing EWN! Nice, the challenge was getting interesting! That’s when I met with a German researcher who also wrote a bot to play EWN on littlegolem. He even made part of his PhD on this project, so needless to say that my own simple player was quite ridiculous compared to his, and got a very bad beating!

This is when I realized that EWN is actually a statistics-driven game, since you make a choice depending on the probabilities of the next dice throws. You have a “2” ahead and want to give it momentum to reach the opposite side of the board faster? Capture your own “1”, “3” and possibly “4” and “5” pieces. The enemy has one piece slightly ahead but that piece has a 1/6 chance to be picked by the dice? Just ignore it. But you will capture that piece if it has a 1/2 chance of being played. As the Wikipedia page points out, one of the first approach is to calculate probability tables for each piece to reach the goal. All the dynamics of that game are based on probabilities. Well, almost. So that makes it a very good candidate for a Monte-Carlo type approach.

As a reminder, the general principle behind the Monte-Carlo approach is: “at a given turn, for each possible move play as many random games as possible and play the one that leads to victory more often than the others.”

Brutal. But very effective and the easiest “AI” to program. I guess. At least on a CPU.

So I implemented a simple Monte-Carlo (that I will abbreviate as “MC” now on) algorithm and launched it against my simple player. And sure enough, the MC player won, even if it was just playing 1000 games with random moves to decide which move to pick. And I’m not speaking about a “little better” here. The MC player won almost 70 games out of 100.

Then I thought: picking random moves is not really a likely outcome for a normal game, so the results for the MC player are somehow skewed. What if I use my simple player instead of a random player to play the games?

No surprise, using the simple player within the MC player improved the results against the simple player by quite a bit. But against the random MC player, the difference was not that big, it just improved it slightly, but not really significantly, only by 2.5 to 3% which is not bad but not as much as I would have expected. Note that running this simulation on a CPU is costing a lot of cycles. 1000 games with each roughly 15 to 20 moves. And at each turn, play 1000 games for every possible move (there are 1 to 6 possible moves). That’s a lot of moves to run! I’ve accelerated that to run on a GPU, but that will come later.

And still, the MC player lost to the German PhD’s carefully hand-crafted player. Well if a simple MC approach would do it, that’s what he would have picked in the first place!

This is it for this part. Next, we’ll start building a Neural Network with its associated Genetic Algorithm.

Part 3 is here.