..
    Copyright (c) 2026 Tobias Erbsland - Erbsland DEV. https://erbsland.dev
    SPDX-License-Identifier: Apache-2.0

.. index::
    single: implementation notes; unicode width
    single: implementation notes; character width

********************
Unicode Width System
********************

This page explains how the library determines whether a Unicode code point occupies
zero, one, or two terminal cells.
It covers the generated lookup table, the generator script, and the design tradeoffs behind the current model.

Relevant Source Files
=====================

You will find the implementation primarily in:

* :file:`src/erbsland/cterm/impl/UnicodeWidth.hpp`
* :file:`src/erbsland/cterm/impl/UnicodeWidth.cpp`
* :file:`src/erbsland/cterm/impl/UnicodeWidthTable.cpp`
* :file:`utilities/generate_unicode_width_table.py`
* :file:`utilities/data/unicode/EastAsianWidth.txt`
* :file:`utilities/data/unicode/UnicodeData.txt`
* :file:`src/erbsland/cterm/Char.cpp`
* :file:`src/erbsland/cterm/String.cpp`

Why the Library Uses Its Own Width System
=========================================

Terminal layout depends on cell width, not on byte count and not on Unicode code-point count.
You will notice that immediately when text contains:

* East Asian wide characters
* combining marks
* zero-width format characters
* mixed scripts within the same paragraph

The library therefore does not rely on simple ASCII assumptions when wrapping text,
placing the cursor, or calculating buffer sizes.

Instead, it uses an internal width model with three explicit values:

* ``0`` for zero-width code points
* ``1`` for ordinary narrow characters
* ``2`` for wide or full-width characters

The main reasons for implementing this internally are:

* reproducible behavior across supported platforms
* no dependence on platform-specific ``wcwidth()`` behavior
* no need for runtime Unicode database parsing
* fast lookup during rendering

Where Width Information Is Used
===============================

The width system is not an isolated utility.
It directly affects higher-level behavior throughout the library:

* ``Char::displayWidth()`` sums the width of the stored base code point and any attached combining marks.
* ``String::displayWidth()`` sums the width of all ``Char`` values.
* ``String::splitCharacters()`` decides whether a code point becomes its own ``Char``,
  attaches to the previous character as a combining mark, or is dropped.
* Paragraph wrapping, buffer painting, cursor movement, and text alignment all depend on these widths.

If the width table changes, a significant part of the visible terminal layout changes with it.

The Runtime Lookup Format
=========================

At runtime, the library uses ``CharWidth`` entries with these fields:

* ``begin``: the first code point in the range
* ``end``: the last code point in the range
* ``width``: the terminal-cell width for the whole inclusive range

``charWidthTable()`` returns a ``std::span<const CharWidth>`` over a generated static array.
The generated file currently contains 925 contiguous ranges.

The table is intentionally stored as ranges rather than one entry per code point:

* the Unicode scalar space contains 1,114,112 possible values
* most adjacent code points share the same width
* a range table is much smaller while still allowing fast lookup

How Lookup Works
----------------

``consoleCharacterWidth()`` performs a binary search with ``std::ranges::lower_bound()``,
using the ``end`` field of each range as the search key.

The lookup logic is:

#. Find the first range whose ``end`` is not smaller than the requested code point.
#. Check whether the code point is also greater than or equal to that range's ``begin``.
#. If yes, return that range's width.
#. Otherwise, return ``1`` as a defensive fallback.

In practice, the generated table covers the full Unicode range,
so the fallback should normally never be used.
It exists to keep the lookup function safe if table generation ever changes unexpectedly.

How the Table Is Built
======================

The table is generated by :file:`utilities/generate_unicode_width_table.py`.
The script does not download data at build time.
Instead, it uses checked-in Unicode Character Database snapshots from :file:`utilities/data/unicode`.

This is an intentional design choice:

* builds remain reproducible
* contributors can review data updates in version control
* packaging the library does not depend on network access

Step 1: Start with Width 1 Everywhere
-------------------------------------

``load_widths()`` begins with a ``bytearray`` of length ``0x110000``.
Every entry is initialized to ``1``.

That means:

* every Unicode scalar value from ``U+0000`` through ``U+10FFFF`` has an explicit slot
* unassigned code points default to narrow width
* later passes only need to overwrite exceptional cases

Step 2: Apply East Asian Wide and Full-Width Ranges
---------------------------------------------------

The script reads :file:`EastAsianWidth.txt` and only reacts to the classes ``W`` and ``F``.

For each matching entry:

* the script parses either a single code point or a ``begin..end`` range
* it overwrites those slots with width ``2``

This means the library intentionally treats:

* East Asian wide characters as double-width
* East Asian full-width characters as double-width
* everything else as not wide unless a later pass changes it to zero-width

Notably, the script does **not** special-case the ambiguous ``A`` class.
Those code points remain at width ``1``.

Step 3: Apply Zero-Width General Categories
-------------------------------------------

The script then reads :file:`UnicodeData.txt` and inspects the general category of each code point.

It sets width ``0`` for:

* all mark categories whose category string starts with ``M``
* ``Cc`` control characters
* ``Cf`` format characters

This second pass overrides earlier assignments.
That override is important because a code point classified as a mark or formatting character
must remain zero-width even if another dataset would otherwise suggest a visible width.

Step 4: Compress into Contiguous Ranges
---------------------------------------

After the per-code-point byte array is complete,
``compress_widths()`` scans it from ``0`` to ``0x10FFFF`` and groups adjacent entries
that share the same width.

Each output tuple contains:

* the begin code point
* the end code point
* the width

This turns the dense byte array into a compact range representation.

Step 5: Emit the Generated C++ Table
------------------------------------

``write_output()`` writes :file:`src/erbsland/cterm/impl/UnicodeWidthTable.cpp`.
The generated file contains:

* a ``static constexpr std::array<CharWidth, N>``
* one ``CharWidth{begin, end, width}`` entry per compressed range
* the ``charWidthTable()`` accessor returning the array as a span

Because the generator emits ordinary C++ source,
the runtime code has no parser, allocator, or file-I/O dependency for width lookup.

How This Integrates with ``Char`` and ``String``
================================================

The width table is only useful because the higher-level text types are built around it.

``String`` Construction
-----------------------

``String::appendCodePoint()`` uses the width system when raw Unicode input is converted into terminal characters.

The rules are:

* TAB and newline remain standalone ``Char`` values.
* Other control characters are ignored.
* Width-0 code points are attached to the previous ``Char`` via ``withCombining()`` if one exists.
* A leading width-0 code point is dropped because there is no base character it could modify.
* Width-1 and width-2 code points become their own ``Char``.

This design means ``String`` is already normalized into renderable terminal cells before the layout code sees it.

``Char`` Width Calculation
--------------------------

``Char`` stores up to three code points:

* one base code point
* up to two attached combining code points

``Char::displayWidth()`` sums the width of those stored code points and caches the result.
In the normal case, that produces:

* ``1`` for an ordinary character
* ``2`` for a wide character
* ``1`` or ``2`` for a base character with attached combining marks
* ``0`` only for explicitly constructed zero-width characters

Since combining marks are expected to have width ``0``,
attaching them normally does not change the width of the base character.

``calculateDisplayWidth()``
---------------------------

The helper ``calculateDisplayWidth(std::string_view)`` provides a lower-level UTF-8 path.
It decodes raw UTF-8 with ``U8Buffer`` and sums ``consoleCharacterWidth()`` for each decoded code point directly.

This function is useful when you need width information for raw text
without first constructing a full ``String`` object.

Design Rationales
=================

Several aspects of the current system are deliberate simplifications.

Only Three Width Classes
------------------------

The terminal model used here only distinguishes widths ``0``, ``1``, and ``2``.
That matches the practical needs of terminal-cell layout.

The implementation does not attempt to model:

* fractional widths
* fonts with unusual glyph metrics
* terminal-specific rendering quirks beyond the basic narrow-versus-wide distinction

Ambiguous Width Defaults to Narrow
----------------------------------

Only East Asian width classes ``W`` and ``F`` become width ``2``.
Ambiguous characters remain width ``1``.

This keeps behavior stable across non-CJK terminals and avoids locale-dependent layout changes.
The tradeoff is that some terminals configured for East Asian ambiguous-width behavior may render a few characters
wider than the library predicts.

No Runtime Terminal Probing
---------------------------

The library does not ask the terminal emulator which Unicode version or width rules it uses.
That would make rendering behavior highly host-dependent and would complicate reproducibility.

Instead, the library chooses a fixed, reviewed width table and applies it consistently.

Zero-Width Format Characters Are Treated as Non-Rendering
---------------------------------------------------------

The ``Cf`` category is forced to width ``0``.
That includes characters such as zero-width joiners and variation selectors.

The rationale is straightforward: formatting code points should not consume visible columns in terminal layout.
The tradeoff is that some terminal emulators interpret certain format characters as part of emoji shaping
or other grapheme-cluster behavior that this library does not model fully.

Limitations
===========

If you work on advanced Unicode behavior, keep these current limits in mind.

No Full Grapheme-Cluster Engine
-------------------------------

The library does not implement Unicode grapheme-cluster segmentation.
It uses a simpler model based on one base character plus up to two zero-width followers.

Consequences include:

* some complex emoji sequences are not treated as one visual unit
* zero-width joiner sequences are not shaped as a single cluster by the layout code
* characters requiring more than three code points cannot be stored in one ``Char``

This is a conscious tradeoff in favor of predictable, lightweight terminal rendering.

Leading Combining Marks Are Dropped
-----------------------------------

When a width-0 code point appears at the beginning of a string,
``String::appendCodePoint()`` drops it instead of storing a detached combining mark.

That keeps the internal text model renderable in cell terms,
but it also means malformed or intentionally unusual Unicode sequences are normalized aggressively.

Unassigned Code Points Default to Width 1
-----------------------------------------

Because the generator starts with width ``1`` everywhere,
currently unassigned code points are treated as narrow by default.

This is pragmatic:

* it keeps the table complete and explicit
* it avoids a separate "unknown width" state
* it keeps the runtime code simple

The tradeoff is that future Unicode assignments are only reflected after updating the checked-in data files
and regenerating the table.

Updating the Table
==================

When you want to refresh the Unicode data,
the intended workflow is:

#. Replace the checked-in UCD snapshot files in :file:`utilities/data/unicode`.
#. Run :file:`utilities/generate_unicode_width_table.py`.
#. Review the generated changes in :file:`src/erbsland/cterm/impl/UnicodeWidthTable.cpp`.
#. Re-run the relevant tests and documentation checks.

Keeping the generator simple and the inputs checked in makes this update path easy to audit.