Unicode Width System
This page explains how the library determines whether a Unicode code point occupies zero, one, or two terminal cells. It covers the generated lookup table, the generator script, and the design tradeoffs behind the current model.
Relevant Source Files
You will find the implementation primarily in:
src/erbsland/cterm/impl/UnicodeWidth.hppsrc/erbsland/cterm/impl/UnicodeWidth.cppsrc/erbsland/cterm/impl/UnicodeWidthTable.cpputilities/generate_unicode_width_table.pyutilities/data/unicode/EastAsianWidth.txtutilities/data/unicode/UnicodeData.txtsrc/erbsland/cterm/Char.cppsrc/erbsland/cterm/String.cpp
Why the Library Uses Its Own Width System
Terminal layout depends on cell width, not on byte count and not on Unicode code-point count. You will notice that immediately when text contains:
East Asian wide characters
combining marks
zero-width format characters
mixed scripts within the same paragraph
The library therefore does not rely on simple ASCII assumptions when wrapping text, placing the cursor, or calculating buffer sizes.
Instead, it uses an internal width model with three explicit values:
0for zero-width code points1for ordinary narrow characters2for wide or full-width characters
The main reasons for implementing this internally are:
reproducible behavior across supported platforms
no dependence on platform-specific
wcwidth()behaviorno need for runtime Unicode database parsing
fast lookup during rendering
Where Width Information Is Used
The width system is not an isolated utility. It directly affects higher-level behavior throughout the library:
Char::displayWidth()sums the width of the stored base code point and any attached combining marks.String::displayWidth()sums the width of allCharvalues.String::splitCharacters()decides whether a code point becomes its ownChar, attaches to the previous character as a combining mark, or is dropped.Paragraph wrapping, buffer painting, cursor movement, and text alignment all depend on these widths.
If the width table changes, a significant part of the visible terminal layout changes with it.
The Runtime Lookup Format
At runtime, the library uses CharWidth entries with these fields:
begin: the first code point in the rangeend: the last code point in the rangewidth: the terminal-cell width for the whole inclusive range
charWidthTable() returns a std::span<const CharWidth> over a generated static array.
The generated file currently contains 925 contiguous ranges.
The table is intentionally stored as ranges rather than one entry per code point:
the Unicode scalar space contains 1,114,112 possible values
most adjacent code points share the same width
a range table is much smaller while still allowing fast lookup
How Lookup Works
consoleCharacterWidth() performs a binary search with std::ranges::lower_bound(),
using the end field of each range as the search key.
The lookup logic is:
Find the first range whose
endis not smaller than the requested code point.Check whether the code point is also greater than or equal to that range’s
begin.If yes, return that range’s width.
Otherwise, return
1as a defensive fallback.
In practice, the generated table covers the full Unicode range, so the fallback should normally never be used. It exists to keep the lookup function safe if table generation ever changes unexpectedly.
How the Table Is Built
The table is generated by utilities/generate_unicode_width_table.py.
The script does not download data at build time.
Instead, it uses checked-in Unicode Character Database snapshots from utilities/data/unicode.
This is an intentional design choice:
builds remain reproducible
contributors can review data updates in version control
packaging the library does not depend on network access
Step 1: Start with Width 1 Everywhere
load_widths() begins with a bytearray of length 0x110000.
Every entry is initialized to 1.
That means:
every Unicode scalar value from
U+0000throughU+10FFFFhas an explicit slotunassigned code points default to narrow width
later passes only need to overwrite exceptional cases
Step 2: Apply East Asian Wide and Full-Width Ranges
The script reads EastAsianWidth.txt and only reacts to the classes W and F.
For each matching entry:
the script parses either a single code point or a
begin..endrangeit overwrites those slots with width
2
This means the library intentionally treats:
East Asian wide characters as double-width
East Asian full-width characters as double-width
everything else as not wide unless a later pass changes it to zero-width
Notably, the script does not special-case the ambiguous A class.
Those code points remain at width 1.
Step 3: Apply Zero-Width General Categories
The script then reads UnicodeData.txt and inspects the general category of each code point.
It sets width 0 for:
all mark categories whose category string starts with
MCccontrol charactersCfformat characters
This second pass overrides earlier assignments. That override is important because a code point classified as a mark or formatting character must remain zero-width even if another dataset would otherwise suggest a visible width.
Step 4: Compress into Contiguous Ranges
After the per-code-point byte array is complete,
compress_widths() scans it from 0 to 0x10FFFF and groups adjacent entries
that share the same width.
Each output tuple contains:
the begin code point
the end code point
the width
This turns the dense byte array into a compact range representation.
Step 5: Emit the Generated C++ Table
write_output() writes src/erbsland/cterm/impl/UnicodeWidthTable.cpp.
The generated file contains:
a
static constexpr std::array<CharWidth, N>one
CharWidth{begin, end, width}entry per compressed rangethe
charWidthTable()accessor returning the array as a span
Because the generator emits ordinary C++ source, the runtime code has no parser, allocator, or file-I/O dependency for width lookup.
How This Integrates with Char and String
The width table is only useful because the higher-level text types are built around it.
String Construction
String::appendCodePoint() uses the width system when raw Unicode input is converted into terminal characters.
The rules are:
TAB and newline remain standalone
Charvalues.Other control characters are ignored.
Width-0 code points are attached to the previous
CharviawithCombining()if one exists.A leading width-0 code point is dropped because there is no base character it could modify.
Width-1 and width-2 code points become their own
Char.
This design means String is already normalized into renderable terminal cells before the layout code sees it.
Char Width Calculation
Char stores up to three code points:
one base code point
up to two attached combining code points
Char::displayWidth() sums the width of those stored code points and caches the result.
In the normal case, that produces:
1for an ordinary character2for a wide character1or2for a base character with attached combining marks0only for explicitly constructed zero-width characters
Since combining marks are expected to have width 0,
attaching them normally does not change the width of the base character.
calculateDisplayWidth()
The helper calculateDisplayWidth(std::string_view) provides a lower-level UTF-8 path.
It decodes raw UTF-8 with U8Buffer and sums consoleCharacterWidth() for each decoded code point directly.
This function is useful when you need width information for raw text
without first constructing a full String object.
Design Rationales
Several aspects of the current system are deliberate simplifications.
Only Three Width Classes
The terminal model used here only distinguishes widths 0, 1, and 2.
That matches the practical needs of terminal-cell layout.
The implementation does not attempt to model:
fractional widths
fonts with unusual glyph metrics
terminal-specific rendering quirks beyond the basic narrow-versus-wide distinction
Ambiguous Width Defaults to Narrow
Only East Asian width classes W and F become width 2.
Ambiguous characters remain width 1.
This keeps behavior stable across non-CJK terminals and avoids locale-dependent layout changes. The tradeoff is that some terminals configured for East Asian ambiguous-width behavior may render a few characters wider than the library predicts.
No Runtime Terminal Probing
The library does not ask the terminal emulator which Unicode version or width rules it uses. That would make rendering behavior highly host-dependent and would complicate reproducibility.
Instead, the library chooses a fixed, reviewed width table and applies it consistently.
Zero-Width Format Characters Are Treated as Non-Rendering
The Cf category is forced to width 0.
That includes characters such as zero-width joiners and variation selectors.
The rationale is straightforward: formatting code points should not consume visible columns in terminal layout. The tradeoff is that some terminal emulators interpret certain format characters as part of emoji shaping or other grapheme-cluster behavior that this library does not model fully.
Limitations
If you work on advanced Unicode behavior, keep these current limits in mind.
No Full Grapheme-Cluster Engine
The library does not implement Unicode grapheme-cluster segmentation. It uses a simpler model based on one base character plus up to two zero-width followers.
Consequences include:
some complex emoji sequences are not treated as one visual unit
zero-width joiner sequences are not shaped as a single cluster by the layout code
characters requiring more than three code points cannot be stored in one
Char
This is a conscious tradeoff in favor of predictable, lightweight terminal rendering.
Leading Combining Marks Are Dropped
When a width-0 code point appears at the beginning of a string,
String::appendCodePoint() drops it instead of storing a detached combining mark.
That keeps the internal text model renderable in cell terms, but it also means malformed or intentionally unusual Unicode sequences are normalized aggressively.
Unassigned Code Points Default to Width 1
Because the generator starts with width 1 everywhere,
currently unassigned code points are treated as narrow by default.
This is pragmatic:
it keeps the table complete and explicit
it avoids a separate “unknown width” state
it keeps the runtime code simple
The tradeoff is that future Unicode assignments are only reflected after updating the checked-in data files and regenerating the table.
Updating the Table
When you want to refresh the Unicode data, the intended workflow is:
Replace the checked-in UCD snapshot files in
utilities/data/unicode.Run
utilities/generate_unicode_width_table.py.Review the generated changes in
src/erbsland/cterm/impl/UnicodeWidthTable.cpp.Re-run the relevant tests and documentation checks.
Keeping the generator simple and the inputs checked in makes this update path easy to audit.