Blog / Strings were a mistake

All but the most junior programmers know to avoid working with dates and times at all costs, but I have come to think that the same is true for strings. They might even be worse than dates!

What's wrong with strings?

Just as with dates and time, most programmers know about some of the pitfalls associated with strings. Even if we can assume that everything is UTF-8 nowadays, working with strings is still a minefield. Starting from the basics, one or more code points form graphemes like Å, 한 and 👍🏿. “Character” is not really a useful concept. If you ask your programming language of choice to determine any string's “length”, the answer will depend on the programming language and the way the string is represented. The following is JavaScript's take:

"한".length; // > 1
"Å".length; // > 2
"👍🏿".length; // > 4
123
"한".length; // > 1
"Å".length; // > 2
"👍🏿".length; // > 4

The black thumbs up emoji is built up from the comparatively simple emoji modifier sequence 👍 + 🏿 and Å in this case is composed of a plain A and the combining character U+030A. There is also a pre-composed code point for Å that has a “length” of 1... at least in JavaScript. Rust instead thinks the length of Å is 2 and the length of Å is 3. This makes some sense if you keep all the details in your head 24/7 while possibly context-switching between multiple programming languages, but the edge cases simply never end. You can get bitten by similar-looking characters or find yourself trying to render text, only to recognize that that's actually an impossible task. Have fun rendering the first “Character” in مكتبة in a different color than the rest!

Nothing is ever simple, fixed-width CJK edition

When starting (the third rewrite of) Code.Movie, I knew that anything related to text layout requires a Knuth-level genius or reckless use of LLMs. I therefore resolved to stick with UTF-8 and fixed-width fonts, but this still left ample room for failure:

Highlighted JS code rendered badly. Closing quotes for strings overlap the last few charcters of the string, as do statement-terminating semicolons.
Pretty serious character placement bugs when Chinese, Japanese, and Korean characters get involved.

Code.Movie performs all its operations on a grid and obviously has (had) a problem correctly placing several innocent-looking tokens. Followers on social media informed me about the fact that Chinese, Japanese, and Korean characters may take up two columns in fixed-width fonts and pointed me to wcswidth. This ancient C program is ostensibly the gold standard for determining the number of columns needed for a fixed-size wide-character string, and several JavaScript ports are available. All of them either choke on emoji modifier sequences (e.g. they split 👍🏿 into 👍 and 🏿 and claim the required space is 4 columns), return 2 for single-column characters like 𝛥, or both. The string-width package does not advertise itself as related to wcwidth or wcswidth and is therefore difficult to find if your searches are laser-focused on JS of or improvements upon wcwidth and wcswidth. While this package does (as far as I can tell) a terrific job at guessing the right number of columns, it is also really slow - at least compared to the 1:1 JS ports of wcwidth and wcswidth or to just ignoring the problem. But because this is JavaScript, there are always more packages to try! fast-string-width appears to do the same job as string-width, but is indeed faster. The tradeoffs that lead to this performance improvement are documented in a closed issue in the bug tracker for... string-width.

To be clear, this rollercoaster ride only happens because strings are insanely complicated. Neither the JS ecosystem nor any of the authors of any of the aforementioned packages do anything wrong - it is simply insanely complicated to figure out how many columns a sequence of unicode code points will take up when rendered in a fixed-width font. But it gets worse!

What about emojis?

Emojis are basically unpredictable. Unicode states that emoji presentation sequences behave as though they were East Asian Wide, and they should therefore take up two columns. But how wide they visually appear depends on the font being used, and that is if the font supports the latest and most sophisticated emoji. If not, ZWJ Sequences may be rendered using multi-emoji fallbacks. “Face exhaling” may indeed render as a face exhaling (😮‍💨), in which case it should take up something close to two columns. If the emoji font in use does not know how to weld the sequence U+1F62E + U+200D + U+1F4A8 into a single image, the fallback not only conjures up a process that is similar to, but not quite, exhaling (😮💨), but it will also take up something around of four columns.

For the purposes of Code.Movie, this presents a choice between multiple bad options:

  1. Just assume that emoji will have a size close to two columns and accept the fact that they won't fit perfectly (depending on the visual size) and that fallbacks for ZWJ sequences will ruin everything
  2. Ship a gargantuan emoji font (Noto Color Emoji clocks in at around 24 MB) and adjust the visual size using size-adjust
  3. Introduce a new output option that adjusts the number of columns allocated to emoji, which will only help if it matches the visual size of whatever emoji font gets used

Since option 1 includes the possibility that everything gets ruined and option 3 would mean a large-ish amount of work on my side, Code.Movie as of the new version 0.0.19 sticks to option 2. Emojis get two columns, and size adjustments, if required, can be applied via CSS. This aligns with unicode recommendations and provides some flexibility without too much overhead

Other changes in Code.Movie 0.0.18 and 0.0.19

There are two new releases since I screwed up 0.0.18. Changes include:

  • Breaking theme object changes in preparation for the upcoming release of actual decoration support. See the changelog for details.
  • A new styling hook --line-numbers-width reflects the width of the line number column (if line numbers are enabled)
  • ECMAScript: add proper highlighting support for generator functions and associated features, add proper highlighting support for the typeof operator, the ?. operator the as keyword (in both TS and JS contexts), fix several omissions around operators and literals like Infinity
  • CSS: add proper highlighting support for @-rules
  • Document typographic edge cases

What's next?

As promised previously, the next proper feature should be code decorations (underlines, highlighted lines, gutter icons and the like). This should get released fairly soon, unless another rabbit hole opens up. If you have other requests for languages or features, open an issue or contact me via email or Mastodon.

Other posts