Every time I have to deal with emojis in Node.js, I have to google the same things over and over again. So I decided to write this post to have a reference for myself and you, dear reader.
If you know precisely why 'š'.length
equals 2, you can skip this post; otherwise, let's see it together.
What is an emoji?
An emoji is a pictogram, and they surround us in our daily life š§ But what is an emoji from a technical point of view?
It is a Unicode character, and Unicode is a standard that defines a unique code point number for every character, letter, symbol, emoji, etc. in the world. A code point is a number that uniquely identifies a character, and it is represented in hexadecimal format: U+<hex code>
.
Let's see some examples:
Character | Code points |
---|---|
A | U+0041 |
š | U+1F60A |
š¤š¼ |
U+1F90C U+1F3FC
|
šØš§š¦ |
U+1F468 U+200D U+1F467 U+200D U+1F466
|
As you can see, the character A
and the š
emoji are represented by a single code point, while š¤š¼
and šØš§š¦
has multiple code points, contrary to what I said in the introduction š±!
This is because some characters are special abstract symbols used to combine other characters to form a single one or change their properties.
In this case, we can have two types of characters:
Character modifiers : a special character modifies one near character. This is the case of the
š¤š¼
emoji, which is the combination of theš¤
(U+1F90C
) character and theš¼
(U+1F3FC
) character.Combining characters : a character that combines with other characters to form a single one or change its properties. This is the case of the
šØš§š¦
emoji, which is the combination of:
Let's see them in detail.
Character modifiers
The Unicode standard defines a set of abstract characters that can be combined with other characters to change the appearance of another character or its phonetic transpilation.
An example is to change the emoji color by adding a modifier next to the emoji itself.
The colors are represented by five Fitzpatrick modifiers:
Modifier | Code point |
---|---|
š¼ | U+1F3FB |
š¼ | U+1F3FC |
š½ | U+1F3FD |
š¾ | U+1F3FE |
šæ | U+1F3FF |
So, if you print the š¤
+ šæ
, every Unicode viewer will render the š¤š¾
emoji.
Combining characters
A combining character changes the character immediately before it by creating a new character.
This technique can be used to represent accents, diacritics, etc. An example is the Ʊ
character, which is the combination of the n
(U+006E
) character and the āĢ
(U+0303
) character or the Japanese ć“
character which is ć²
(U+3072
) and ć
(U+309A
).
A special mention goes to the ZWJ
(U+200D
) zero-width joiner character, which is non-printable and is used to merge two or more characters into a single one. It is the case of the šØš§š¦
emoji, which is the combination of: šØ
(U+1F468
) + š§
(U+1F467
) + š¦
(U+1F466
) as we read earlier. Every char depends on the previous one.
Note that the ZWJ
character is not the only one used to combine emojis, but it is the most common one for Arabic or Indic scripts such as U+200D
will be shown as in a browser.
We did a long introduction about Unicode. Nevertheless, we just scratched the surface of the standard, which is very complex and huge. We did not mention the Unicode Planes or the Grapheme concept, but we have enough knowledge to understand why 'š'.length
equals 2.
Why 'š'.length
is equal to 2
Before answering this question, we must understand how strings are represented in Node.js. So far, we have talked about Unicode characters as code points, but Node.js does not understand what a code point is. It only understands code units.
A code unit is the smallest unit of information that a system can manipulate. For example, in a 7-bit ASCII system, a code unit is a single byte. In a 16-bit Unicode system, a code unit is a 16-bit word.
Node.js (and its underlying V8 engine) sees strings as a sequence of 16-bit Unicode characters, so one code unit is 16 bits long.
The final step is understanding how to convert a code point to a code unit.
Let's say we want to convert the š
(U+1F60A
) emoji to its code units. To do it, we can write this low-level code:
const codePoint = 'U+1F60A'; // š
// Convert the code point to a string.
// Remember that a code point is a hexadecimal number.
const codePointAsInteger = codePoint.split('U+').filter(c=>c).map((hex) => parseInt(hex, 16));
// Convert the code point to a string.
const codePointAsString = String.fromCodePoint(...codePointAsInteger);
// Convert the string input to an array of code units.
const codeUnits = toCodeUnits(codePointAsString);
console.log({
codePoint,
codePointAsInteger,
codePointAsString,
stringLength: codePointAsString.length,
codeUnits,
codeUnitAsString: String.fromCharCode(...codeUnits),
codeUnitsLength: codeUnits.length,
});
function toCodeUnits(string) {
const codeUnits = []
for (let i = 0; i < string.length; i++) {
const char = string[i]
const codePoint = char.codePointAt(0)
codeUnits.push(codePoint)
}
return codeUnits
}
In the code above we are executing the following steps:
- Convert the code point string to an integer array.
- Convert the integer array to a string using
String.fromCodePoint
. - Convert the string to an array of code units.
- Print the results.
Note that the code is not optimized, and it is not intended to be used in production.
The first step is to convert the code point string from the U+<hex>
format to an integer array.
Note
The code point can be declared as an integer just like this:const codePoint = 0x1F60A;
(hexadecimal representation) or like a stringconst codePoint = '\u{1F60A}';
in UTF32 representation, but I find more interesting to show how to convert it from a Unicode format.
The second step is to convert the integer array to a string using String.fromCodePoint
. We did not use the String.fromCharCode
method because it does not support Unicode code points greater than 0xFFFF
(or 65536 in decimal).
The third step is to convert the string to an array of code units. We did it by iterating over the string and calling the String.prototype.codePointAt
method on each character.
The last step should print the following output:
{
codePoint: 'U+1F60A',
codePointAsInteger: [128522],
codePointAsString: 'š',
stringLength: 2,
codeUnits: [55357, 56842],
codeUnitAsString: 'š',
codeUnitsLength: 2
}
As we can see, the š
emoji is represented by two code units: 55357
and 56842
.
š The Node.js string length is equal to the number of its code units. So, if the string has characters that require more than one code unit to be displayed, the string length will be greater than the character length.
Technical explanation
Previously we said that Node.js stores strings as a sequence of 16-bit Unicode characters. This means that it can store up to 65536 different characters. However, Unicode defines more than 65536 characters, so it uses a technique called surrogate pairs. Every character that requires more than 16 bits to be represented is split into two 16-bit code units.
Now, try to check the output for these code points:
const codePoint = 'U+1F468U+200DU+1F467U+200DU+1F466'; // šØš§š¦
const codePoint = 'U+1F90CU+1F3FC' // š¤š¼
You will be surprised that the string length is 8 and 4, respectively.
How to count the number of characters in a string
But how can we count the number of characters in a string without using the String.length
property?
A common solution is to use the String.normalize
method to convert the string to a normalization form. Then, using the spread operator to convert the string to an array of code points will give us the number of characters in the string.
[...'ƱaƱaƱa'.normalize()].length; // 6
[...'ƱaƱaƱa'].length; // 9
In this example, the Ʊ
is represented by two code points n
(U+006E
) + āĢ
(U+0303
), but the normalize
method will convert it to a single code point Ʊ
(U+00F1
).
Unfortunately, this does not work for all emojis:
[...'š'.normalize()].length // 1
[...'š¤š¼'.normalize()].length // It is still 2!!!
So, we must adopt a new strategy by using Intl.Segmenter
to split the string into segments and count the number of segments.
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...segmenter.segment('š¤š¼')].length // 1 š
Note
It is available in Node.js since version 16.0.0 and it is not supported by all the browsers.
You may choose the best solution for your use case.
Summary
In this article, we learned how to convert a Unicode code point to a code unit and why String.length
is not equal to the number of characters in a string. I tried to explain the concepts in a simple and accessible way. If you want to learn more about Unicode and Node.js, I recommend you to read these detailed articles:
- https://mathiasbynens.be/notes/javascript-unicode
- https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/
Other useful resources that I used to debug things are:
- https://github.com/node-unicode/node-unicode-data to get all the Unicode data in Node.js.
- https://codepoints.net/ to explore Unicode characters and code points.
- https://www.compart.com/en/unicode/ to explore Unicode planes and groups.
If you enjoyed this article, comment, share and follow me on Twitter!
Top comments (0)