Unicode Width System

This page explains how the library determines whether a Unicode code point occupies zero, one, or two terminal cells. It covers the generated lookup table, the generator script, and the design tradeoffs behind the current model.

Relevant Source Files

You will find the implementation primarily in:

  • src/erbsland/cterm/impl/UnicodeWidth.hpp

  • src/erbsland/cterm/impl/UnicodeWidth.cpp

  • src/erbsland/cterm/impl/UnicodeWidthTable.cpp

  • utilities/generate_unicode_width_table.py

  • utilities/data/unicode/EastAsianWidth.txt

  • utilities/data/unicode/UnicodeData.txt

  • src/erbsland/cterm/Char.cpp

  • src/erbsland/cterm/String.cpp

Why the Library Uses Its Own Width System

Terminal layout depends on cell width, not on byte count and not on Unicode code-point count. You will notice that immediately when text contains:

  • East Asian wide characters

  • combining marks

  • zero-width format characters

  • mixed scripts within the same paragraph

The library therefore does not rely on simple ASCII assumptions when wrapping text, placing the cursor, or calculating buffer sizes.

Instead, it uses an internal width model with three explicit values:

  • 0 for zero-width code points

  • 1 for ordinary narrow characters

  • 2 for wide or full-width characters

The main reasons for implementing this internally are:

  • reproducible behavior across supported platforms

  • no dependence on platform-specific wcwidth() behavior

  • no need for runtime Unicode database parsing

  • fast lookup during rendering

Where Width Information Is Used

The width system is not an isolated utility. It directly affects higher-level behavior throughout the library:

  • Char::displayWidth() sums the width of the stored base code point and any attached combining marks.

  • String::displayWidth() sums the width of all Char values.

  • String::splitCharacters() decides whether a code point becomes its own Char, attaches to the previous character as a combining mark, or is dropped.

  • Paragraph wrapping, buffer painting, cursor movement, and text alignment all depend on these widths.

If the width table changes, a significant part of the visible terminal layout changes with it.

The Runtime Lookup Format

At runtime, the library uses CharWidth entries with these fields:

  • begin: the first code point in the range

  • end: the last code point in the range

  • width: the terminal-cell width for the whole inclusive range

charWidthTable() returns a std::span<const CharWidth> over a generated static array. The generated file currently contains 925 contiguous ranges.

The table is intentionally stored as ranges rather than one entry per code point:

  • the Unicode scalar space contains 1,114,112 possible values

  • most adjacent code points share the same width

  • a range table is much smaller while still allowing fast lookup

How Lookup Works

consoleCharacterWidth() performs a binary search with std::ranges::lower_bound(), using the end field of each range as the search key.

The lookup logic is:

  1. Find the first range whose end is not smaller than the requested code point.

  2. Check whether the code point is also greater than or equal to that range’s begin.

  3. If yes, return that range’s width.

  4. Otherwise, return 1 as a defensive fallback.

In practice, the generated table covers the full Unicode range, so the fallback should normally never be used. It exists to keep the lookup function safe if table generation ever changes unexpectedly.

How the Table Is Built

The table is generated by utilities/generate_unicode_width_table.py. The script does not download data at build time. Instead, it uses checked-in Unicode Character Database snapshots from utilities/data/unicode.

This is an intentional design choice:

  • builds remain reproducible

  • contributors can review data updates in version control

  • packaging the library does not depend on network access

Step 1: Start with Width 1 Everywhere

load_widths() begins with a bytearray of length 0x110000. Every entry is initialized to 1.

That means:

  • every Unicode scalar value from U+0000 through U+10FFFF has an explicit slot

  • unassigned code points default to narrow width

  • later passes only need to overwrite exceptional cases

Step 2: Apply East Asian Wide and Full-Width Ranges

The script reads EastAsianWidth.txt and only reacts to the classes W and F.

For each matching entry:

  • the script parses either a single code point or a begin..end range

  • it overwrites those slots with width 2

This means the library intentionally treats:

  • East Asian wide characters as double-width

  • East Asian full-width characters as double-width

  • everything else as not wide unless a later pass changes it to zero-width

Notably, the script does not special-case the ambiguous A class. Those code points remain at width 1.

Step 3: Apply Zero-Width General Categories

The script then reads UnicodeData.txt and inspects the general category of each code point.

It sets width 0 for:

  • all mark categories whose category string starts with M

  • Cc control characters

  • Cf format characters

This second pass overrides earlier assignments. That override is important because a code point classified as a mark or formatting character must remain zero-width even if another dataset would otherwise suggest a visible width.

Step 4: Compress into Contiguous Ranges

After the per-code-point byte array is complete, compress_widths() scans it from 0 to 0x10FFFF and groups adjacent entries that share the same width.

Each output tuple contains:

  • the begin code point

  • the end code point

  • the width

This turns the dense byte array into a compact range representation.

Step 5: Emit the Generated C++ Table

write_output() writes src/erbsland/cterm/impl/UnicodeWidthTable.cpp. The generated file contains:

  • a static constexpr std::array<CharWidth, N>

  • one CharWidth{begin, end, width} entry per compressed range

  • the charWidthTable() accessor returning the array as a span

Because the generator emits ordinary C++ source, the runtime code has no parser, allocator, or file-I/O dependency for width lookup.

How This Integrates with Char and String

The width table is only useful because the higher-level text types are built around it.

String Construction

String::appendCodePoint() uses the width system when raw Unicode input is converted into terminal characters.

The rules are:

  • TAB and newline remain standalone Char values.

  • Other control characters are ignored.

  • Width-0 code points are attached to the previous Char via withCombining() if one exists.

  • A leading width-0 code point is dropped because there is no base character it could modify.

  • Width-1 and width-2 code points become their own Char.

This design means String is already normalized into renderable terminal cells before the layout code sees it.

Char Width Calculation

Char stores up to three code points:

  • one base code point

  • up to two attached combining code points

Char::displayWidth() sums the width of those stored code points and caches the result. In the normal case, that produces:

  • 1 for an ordinary character

  • 2 for a wide character

  • 1 or 2 for a base character with attached combining marks

  • 0 only for explicitly constructed zero-width characters

Since combining marks are expected to have width 0, attaching them normally does not change the width of the base character.

calculateDisplayWidth()

The helper calculateDisplayWidth(std::string_view) provides a lower-level UTF-8 path. It decodes raw UTF-8 with U8Buffer and sums consoleCharacterWidth() for each decoded code point directly.

This function is useful when you need width information for raw text without first constructing a full String object.

Design Rationales

Several aspects of the current system are deliberate simplifications.

Only Three Width Classes

The terminal model used here only distinguishes widths 0, 1, and 2. That matches the practical needs of terminal-cell layout.

The implementation does not attempt to model:

  • fractional widths

  • fonts with unusual glyph metrics

  • terminal-specific rendering quirks beyond the basic narrow-versus-wide distinction

Ambiguous Width Defaults to Narrow

Only East Asian width classes W and F become width 2. Ambiguous characters remain width 1.

This keeps behavior stable across non-CJK terminals and avoids locale-dependent layout changes. The tradeoff is that some terminals configured for East Asian ambiguous-width behavior may render a few characters wider than the library predicts.

No Runtime Terminal Probing

The library does not ask the terminal emulator which Unicode version or width rules it uses. That would make rendering behavior highly host-dependent and would complicate reproducibility.

Instead, the library chooses a fixed, reviewed width table and applies it consistently.

Zero-Width Format Characters Are Treated as Non-Rendering

The Cf category is forced to width 0. That includes characters such as zero-width joiners and variation selectors.

The rationale is straightforward: formatting code points should not consume visible columns in terminal layout. The tradeoff is that some terminal emulators interpret certain format characters as part of emoji shaping or other grapheme-cluster behavior that this library does not model fully.

Limitations

If you work on advanced Unicode behavior, keep these current limits in mind.

No Full Grapheme-Cluster Engine

The library does not implement Unicode grapheme-cluster segmentation. It uses a simpler model based on one base character plus up to two zero-width followers.

Consequences include:

  • some complex emoji sequences are not treated as one visual unit

  • zero-width joiner sequences are not shaped as a single cluster by the layout code

  • characters requiring more than three code points cannot be stored in one Char

This is a conscious tradeoff in favor of predictable, lightweight terminal rendering.

Leading Combining Marks Are Dropped

When a width-0 code point appears at the beginning of a string, String::appendCodePoint() drops it instead of storing a detached combining mark.

That keeps the internal text model renderable in cell terms, but it also means malformed or intentionally unusual Unicode sequences are normalized aggressively.

Unassigned Code Points Default to Width 1

Because the generator starts with width 1 everywhere, currently unassigned code points are treated as narrow by default.

This is pragmatic:

  • it keeps the table complete and explicit

  • it avoids a separate “unknown width” state

  • it keeps the runtime code simple

The tradeoff is that future Unicode assignments are only reflected after updating the checked-in data files and regenerating the table.

Updating the Table

When you want to refresh the Unicode data, the intended workflow is:

  1. Replace the checked-in UCD snapshot files in utilities/data/unicode.

  2. Run utilities/generate_unicode_width_table.py.

  3. Review the generated changes in src/erbsland/cterm/impl/UnicodeWidthTable.cpp.

  4. Re-run the relevant tests and documentation checks.

Keeping the generator simple and the inputs checked in makes this update path easy to audit.