High value Unicodes need special treatment

cymplecy · March 9, 2025, 1:59pm

While working on a utf-8 byte list -> Unicode converter I seem to have discovered an issue with large Unicodes

All Unicodes up to 0xffff (65535) work fine

and convert both ways

but e.g. 0x10000 (65536) doesn't

And this seems to be due to JavaScript coding up big values as 16bit "surrogate pairs"

with a bit more searching

I came up with this to allow for it

and created a reporter

All scripts

bh · March 9, 2025, 5:57pm

That was an interesting article, although I boggled at "the English language letter (the meaning)" in the second paragraph of section 1. Letters aren't meanings. Letters don't even have meanings. Words have meanings, often. (Not always, because, for example, of metasyntactic variables, e.g., "foo.")

I'm surprised, in these days when memory is basically free, that they don't just use UTF-32 and forget about planes altogether. Although that wouldn't eliminate the need to handle combining forms when testing for equality. (And not just combining forms; glyphs with MATHEMATICAL in their names are often visually indistinguishable from the corresponding character without the MATHEMATICAL. And there's been a smiley face (U+263A U+FE0E) in Unicode from the beginning, long before they had emoji as a category, because it was in IBM's version of 8-bit ASCII. (Or maybe I mean Microsoft's version. Anyway, it was in DOS.))

Really there are several different kinds of equality testing. The article talks about the two ways to encode é in Unicode, but for some purposes, such as dictionary sorting, "café" and "cafe" should be considered equal, so that "caféx" sorts before "cafey." (Okay, here's a realistic example: "café" comes before "cafeteria" in the dictionary.) That's an issue that transcends the details of Unicode. The German "ß" is, for some purposes, equal to "ss" even though they don't look the same or have the same string length. (I think .normalize handles that one, but I wouldn't swear to it.) Similarly, there's the ongoing debate about whether lower case letters and upper case letters should be considered equal. (Hint: yes. )

And then there are the Delphic Unicode Consortium pronouncements. For example, they're very insistent that font styles such as italic and boldface aren't separate glyphs, except in MATHEMATICAL-land, where they are. And Elvish and Klingon characters aren't characters at all, but they have a non-Consortium semi-official position in the private use area. But emoji are characters. :~( There's a SUPERSCRIPT LATIN SMALL LETTER N (ⁿ) and digits, but not other letters, not even superscript k, which is almost as common in math as superscript n. And ligatures (ﬀ etc.), another font variation, are Unicode characters.

There's a LATIN SMALL LETTER CHI (ꭓ, U+AB53) and a GREEK SMALL LETTER CHI (χ, U+03C7). The Latin one is way far away from all the other Latin letters, even the weird ones such as ƻ (LATIN LETTER TWO WITH STROKE, U+01BB, the only digit with a stroke). Who uses chi in Latin or Latin-derived languages?

tl;dr: "looks the same" and "linguistically the same" are intersecting sets, but neither is a subset of the other.

P.S.: By the way,

:~(

specialred · March 9, 2025, 11:34pm

This is actually a pretty useful block! Too bad it (somehow) still doesn't support all characters...

...but it's still cool to see how easily you fixed the problem with high-value unicodes. I'm definitely going to be using this soon!

cymplecy · March 10, 2025, 8:09am

Oh dear
I don't think things like this (which I've found are called Variant Selectors) are going to be easily possible to deal with efficiently so I'm out of the Unicode game

dardoro · March 10, 2025, 10:19am

It's properly decoded as two codepoint.

, White Smiling Face , 0x263a;
Followed by a VS15, 0xFE0E=>65038, non-printable codepoint that, used for Emoji, should select variant monochrome text, but for White Smiling Face does nothing.

There is no other way to express those 2 codepoint as they are already Unicode'd

BTW:
The 0x1F642 Slightly Smiling Face is another codepoint.

Combined with the VS15

There are commented parts of the Process.prototype.reportUnicode@threads.js to properly handle codepoints above 0xFFFF.
But JS handles "large" codepoint only in selected functions. Finding the text length, splitting by a letter, for large text, may break Snap/JS. So maybe an extended Unicode library should be created.

This problem usually arises when someone is trying to "compress" a binary data to Unicode text for sharing/transmitting. MQTT or Web Services libraries have built-in support for binary data. For the other cases, some encoding should be used - say Base64 or QuotedPrintable.

bh · March 10, 2025, 7:31pm

Oh. So in other words it's MacOS's fault for giving me an unnecessary extra codepoint when I clicked on the smiley. :~( Thanks for the explanation!

bh · March 15, 2025, 7:50pm

Sorry for replying to myself, but I've just figured out why MacOS does that. It's to compensate for the other even more annoying way they mess up Unicode, namely to replace certain geometric glyphs (such as ︎ iirc) with colored keycap-like emoji. They use the monochrome text variant to prevent themself doing that to the smiley.

cymplecy · May 5, 2025, 2:52pm

Just a little update - reporter now takes a list as well as text as it's input
Just came in handy for me in another project

cymplecy · May 17, 2025, 7:58am

The standard unicode block has been updated in the current dev version so my modified reporter is no longer needed

cycomachead · May 20, 2025, 11:14pm

Sorry, yeah, the unicode thing was fixed last week when we were all together, but should have been fixed years ago... oh well.

To be clear, Snap! now does the mostly right thing. Multi-byte characters are correctly handled, but higher order items (like emoji with skin tones, an things which use a 'zero width joiner' character) are still split into multiple items. In native JS, there's not a good way to get "perfect" unicode representation without additional libraries.