Unicode Width System

This page explains how the library determines whether a Unicode code point occupies zero, one, or two terminal cells. It covers the generated lookup table, the generator script, and the design tradeoffs behind the current model.

Relevant Source Files

You will find the implementation primarily in:

src/erbsland/cterm/impl/UnicodeWidth.hpp
src/erbsland/cterm/impl/UnicodeWidth.cpp
src/erbsland/cterm/impl/UnicodeWidthTable.cpp
utilities/generate_unicode_width_table.py
utilities/data/unicode/EastAsianWidth.txt
utilities/data/unicode/UnicodeData.txt
src/erbsland/cterm/Char.cpp
src/erbsland/cterm/String.cpp

Why the Library Uses Its Own Width System

Terminal layout depends on cell width, not on byte count and not on Unicode code-point count. You will notice that immediately when text contains:

East Asian wide characters
combining marks
zero-width format characters
mixed scripts within the same paragraph

The library therefore does not rely on simple ASCII assumptions when wrapping text, placing the cursor, or calculating buffer sizes.

Instead, it uses an internal width model with three explicit values:

0 for zero-width code points
1 for ordinary narrow characters
2 for wide or full-width characters

The main reasons for implementing this internally are:

reproducible behavior across supported platforms
no dependence on platform-specific wcwidth() behavior
no need for runtime Unicode database parsing
fast lookup during rendering

Where Width Information Is Used

The width system is not an isolated utility. It directly affects higher-level behavior throughout the library:

Char::displayWidth() sums the width of the stored base code point and any attached combining marks.
String::displayWidth() sums the width of all Char values.
String::splitCharacters() decides whether a code point becomes its own Char, attaches to the previous character as a combining mark, or is dropped.
Paragraph wrapping, buffer painting, cursor movement, and text alignment all depend on these widths.

If the width table changes, a significant part of the visible terminal layout changes with it.

The Runtime Lookup Format

At runtime, the library uses CharWidth entries with these fields:

begin: the first code point in the range
end: the last code point in the range
width: the terminal-cell width for the whole inclusive range

charWidthTable() returns a std::span<const CharWidth> over a generated static array. The generated file currently contains 925 contiguous ranges.

The table is intentionally stored as ranges rather than one entry per code point:

the Unicode scalar space contains 1,114,112 possible values
most adjacent code points share the same width
a range table is much smaller while still allowing fast lookup

How Lookup Works

consoleCharacterWidth() performs a binary search with std::ranges::lower_bound(), using the end field of each range as the search key.

The lookup logic is:

Find the first range whose end is not smaller than the requested code point.
Check whether the code point is also greater than or equal to that range’s begin.
If yes, return that range’s width.
Otherwise, return 1 as a defensive fallback.

In practice, the generated table covers the full Unicode range, so the fallback should normally never be used. It exists to keep the lookup function safe if table generation ever changes unexpectedly.

How the Table Is Built

The table is generated by utilities/generate_unicode_width_table.py. The script does not download data at build time. Instead, it uses checked-in Unicode Character Database snapshots from utilities/data/unicode.

This is an intentional design choice:

builds remain reproducible
contributors can review data updates in version control
packaging the library does not depend on network access

Step 1: Start with Width 1 Everywhere

load_widths() begins with a bytearray of length 0x110000. Every entry is initialized to 1.

That means:

every Unicode scalar value from U+0000 through U+10FFFF has an explicit slot
unassigned code points default to narrow width
later passes only need to overwrite exceptional cases

Step 2: Apply East Asian Wide and Full-Width Ranges

The script reads EastAsianWidth.txt and only reacts to the classes W and F.

For each matching entry:

the script parses either a single code point or a begin..end range
it overwrites those slots with width 2

This means the library intentionally treats:

East Asian wide characters as double-width
East Asian full-width characters as double-width
everything else as not wide unless a later pass changes it to zero-width

Notably, the script does not special-case the ambiguous A class. Those code points remain at width 1.

Step 3: Apply Zero-Width General Categories

The script then reads UnicodeData.txt and inspects the general category of each code point.

It sets width 0 for:

all mark categories whose category string starts with M
Cc control characters
Cf format characters

This second pass overrides earlier assignments. That override is important because a code point classified as a mark or formatting character must remain zero-width even if another dataset would otherwise suggest a visible width.

Step 4: Compress into Contiguous Ranges

After the per-code-point byte array is complete, compress_widths() scans it from 0 to 0x10FFFF and groups adjacent entries that share the same width.

Each output tuple contains:

the begin code point
the end code point
the width

This turns the dense byte array into a compact range representation.

Step 5: Emit the Generated C++ Table

write_output() writes src/erbsland/cterm/impl/UnicodeWidthTable.cpp. The generated file contains:

a static constexpr std::array<CharWidth, N>
one CharWidth{begin, end, width} entry per compressed range
the charWidthTable() accessor returning the array as a span

Because the generator emits ordinary C++ source, the runtime code has no parser, allocator, or file-I/O dependency for width lookup.

How This Integrates with `Char` and `String`

The width table is only useful because the higher-level text types are built around it.

`String` Construction

String::appendCodePoint() uses the width system when raw Unicode input is converted into terminal characters.

The rules are:

TAB and newline remain standalone Char values.
Other control characters are ignored.
Width-0 code points are attached to the previous Char via withCombining() if one exists.
A leading width-0 code point is dropped because there is no base character it could modify.
Width-1 and width-2 code points become their own Char.

This design means String is already normalized into renderable terminal cells before the layout code sees it.

`Char` Width Calculation

Char stores up to three code points:

one base code point
up to two attached combining code points

Char::displayWidth() sums the width of those stored code points and caches the result. In the normal case, that produces:

1 for an ordinary character
2 for a wide character
1 or 2 for a base character with attached combining marks
0 only for explicitly constructed zero-width characters

Since combining marks are expected to have width 0, attaching them normally does not change the width of the base character.

`calculateDisplayWidth()`

The helper calculateDisplayWidth(std::string_view) provides a lower-level UTF-8 path. It decodes raw UTF-8 with U8Buffer and sums consoleCharacterWidth() for each decoded code point directly.

This function is useful when you need width information for raw text without first constructing a full String object.

Design Rationales

Several aspects of the current system are deliberate simplifications.

Only Three Width Classes

The terminal model used here only distinguishes widths 0, 1, and 2. That matches the practical needs of terminal-cell layout.

The implementation does not attempt to model:

fractional widths
fonts with unusual glyph metrics
terminal-specific rendering quirks beyond the basic narrow-versus-wide distinction

Ambiguous Width Defaults to Narrow

Only East Asian width classes W and F become width 2. Ambiguous characters remain width 1.

This keeps behavior stable across non-CJK terminals and avoids locale-dependent layout changes. The tradeoff is that some terminals configured for East Asian ambiguous-width behavior may render a few characters wider than the library predicts.

No Runtime Terminal Probing

The library does not ask the terminal emulator which Unicode version or width rules it uses. That would make rendering behavior highly host-dependent and would complicate reproducibility.

Instead, the library chooses a fixed, reviewed width table and applies it consistently.

Zero-Width Format Characters Are Treated as Non-Rendering

The Cf category is forced to width 0. That includes characters such as zero-width joiners and variation selectors.

The rationale is straightforward: formatting code points should not consume visible columns in terminal layout. The tradeoff is that some terminal emulators interpret certain format characters as part of emoji shaping or other grapheme-cluster behavior that this library does not model fully.

Limitations

If you work on advanced Unicode behavior, keep these current limits in mind.

No Full Grapheme-Cluster Engine

The library does not implement Unicode grapheme-cluster segmentation. It uses a simpler model based on one base character plus up to two zero-width followers.

Consequences include:

some complex emoji sequences are not treated as one visual unit
zero-width joiner sequences are not shaped as a single cluster by the layout code
characters requiring more than three code points cannot be stored in one Char

This is a conscious tradeoff in favor of predictable, lightweight terminal rendering.

Leading Combining Marks Are Dropped

When a width-0 code point appears at the beginning of a string, String::appendCodePoint() drops it instead of storing a detached combining mark.

That keeps the internal text model renderable in cell terms, but it also means malformed or intentionally unusual Unicode sequences are normalized aggressively.

Unassigned Code Points Default to Width 1

Because the generator starts with width 1 everywhere, currently unassigned code points are treated as narrow by default.

This is pragmatic:

it keeps the table complete and explicit
it avoids a separate “unknown width” state
it keeps the runtime code simple

The tradeoff is that future Unicode assignments are only reflected after updating the checked-in data files and regenerating the table.

Updating the Table

When you want to refresh the Unicode data, the intended workflow is:

Replace the checked-in UCD snapshot files in utilities/data/unicode.
Run utilities/generate_unicode_width_table.py.
Review the generated changes in src/erbsland/cterm/impl/UnicodeWidthTable.cpp.
Re-run the relevant tests and documentation checks.

Keeping the generator simple and the inputs checked in makes this update path easy to audit.