utf8proc
C library for processing UTF-8 Unicode data
Loading...
Searching...
No Matches
utf8proc_property_struct Struct Reference

#include <utf8proc.h>

Data Fields

utf8proc_propval_t category
 
utf8proc_propval_t combining_class
 
utf8proc_propval_t bidi_class
 
utf8proc_propval_t decomp_type
 
utf8proc_uint16_t decomp_seqindex
 
utf8proc_uint16_t casefold_seqindex
 
utf8proc_uint16_t uppercase_seqindex
 
utf8proc_uint16_t lowercase_seqindex
 
utf8proc_uint16_t titlecase_seqindex
 
utf8proc_uint16_t comb_index:10
 
utf8proc_uint16_t comb_length:5
 
utf8proc_uint16_t comb_issecond:1
 
unsigned bidi_mirrored:1
 
unsigned comp_exclusion:1
 
unsigned ignorable:1
 
unsigned control_boundary:1
 
unsigned charwidth:2
 
unsigned ambiguous_width:1
 
unsigned pad:1
 
unsigned boundclass:6
 
unsigned indic_conjunct_break:2
 

Detailed Description

Struct containing information about a codepoint.

Field Documentation

◆ ambiguous_width

unsigned utf8proc_property_struct::ambiguous_width

East Asian width class A

◆ bidi_class

utf8proc_propval_t utf8proc_property_struct::bidi_class

Bidirectional class.

See also
utf8proc_bidi_class_t.

◆ boundclass

unsigned utf8proc_property_struct::boundclass

Boundclass.

See also
utf8proc_boundclass_t.

◆ category

utf8proc_propval_t utf8proc_property_struct::category

Unicode category.

See also
utf8proc_category_t.

◆ charwidth

unsigned utf8proc_property_struct::charwidth

The width of the codepoint.

◆ comb_index

utf8proc_uint16_t utf8proc_property_struct::comb_index

Character combining table.

The character combining table is formally indexed by two characters, the first and second character that might form a combining pair. The table entry then contains the combined character. Most character pairs cannot be combined. There are about 1,000 characters that can be the first character in a combining pair, and for most, there are only a handful for possible second characters.

The combining table is stored as sparse matrix in the CSR (compressed sparse row) format. That is, it is stored as two arrays, utf8proc_uint32_t utf8proc_combinations_second[] and utf8proc_uint32_t utf8proc_combinations_combined[]. These contain the second combining characters and the combined character of every combining pair.

  • comb_index: Index into the combining table if this character is the first character in a combining pair, else 0x3ff
  • comb_length: Number of table entries for this first character
  • comb_is_second: As optimization we also record whether this character is the second combining character in any pair. If not, we can skip the table lookup.

A table lookup starts from a given character pair. It first checks whether the first character is stored in the table (checking whether the index is 0x3ff) and whether the second index is stored in the table (looking at comb_is_second). If so, the comb_length table entries will be checked sequentially for a match.

◆ decomp_type

utf8proc_propval_t utf8proc_property_struct::decomp_type

type.

See also
utf8proc_decomp_type_t.

◆ ignorable

unsigned utf8proc_property_struct::ignorable

Can this codepoint be ignored?

Used by utf8proc_decompose_char() when UTF8PROC_IGNORE is passed as an option.


The documentation for this struct was generated from the following file: