utf8proc
C library for processing UTF-8 Unicode data
|
#include <utf8proc.h>
Data Fields | |
utf8proc_propval_t | category |
utf8proc_propval_t | combining_class |
utf8proc_propval_t | bidi_class |
utf8proc_propval_t | decomp_type |
utf8proc_uint16_t | decomp_seqindex |
utf8proc_uint16_t | casefold_seqindex |
utf8proc_uint16_t | uppercase_seqindex |
utf8proc_uint16_t | lowercase_seqindex |
utf8proc_uint16_t | titlecase_seqindex |
utf8proc_uint16_t | comb_index:10 |
utf8proc_uint16_t | comb_length:5 |
utf8proc_uint16_t | comb_issecond:1 |
unsigned | bidi_mirrored:1 |
unsigned | comp_exclusion:1 |
unsigned | ignorable:1 |
unsigned | control_boundary:1 |
unsigned | charwidth:2 |
unsigned | ambiguous_width:1 |
unsigned | pad:1 |
unsigned | boundclass:6 |
unsigned | indic_conjunct_break:2 |
Struct containing information about a codepoint.
unsigned utf8proc_property_struct::ambiguous_width |
East Asian width class A
utf8proc_propval_t utf8proc_property_struct::bidi_class |
Bidirectional class.
unsigned utf8proc_property_struct::boundclass |
Boundclass.
utf8proc_propval_t utf8proc_property_struct::category |
Unicode category.
unsigned utf8proc_property_struct::charwidth |
The width of the codepoint.
utf8proc_uint16_t utf8proc_property_struct::comb_index |
Character combining table.
The character combining table is formally indexed by two characters, the first and second character that might form a combining pair. The table entry then contains the combined character. Most character pairs cannot be combined. There are about 1,000 characters that can be the first character in a combining pair, and for most, there are only a handful for possible second characters.
The combining table is stored as sparse matrix in the CSR (compressed sparse row) format. That is, it is stored as two arrays, utf8proc_uint32_t utf8proc_combinations_second[]
and utf8proc_uint32_t utf8proc_combinations_combined[]
. These contain the second combining characters and the combined character of every combining pair.
comb_index
: Index into the combining table if this character is the first character in a combining pair, else 0x3ffcomb_length
: Number of table entries for this first charactercomb_is_second
: As optimization we also record whether this character is the second combining character in any pair. If not, we can skip the table lookup.A table lookup starts from a given character pair. It first checks whether the first character is stored in the table (checking whether the index is 0x3ff) and whether the second index is stored in the table (looking at comb_is_second
). If so, the comb_length
table entries will be checked sequentially for a match.
utf8proc_propval_t utf8proc_property_struct::decomp_type |
unsigned utf8proc_property_struct::ignorable |
Can this codepoint be ignored?
Used by utf8proc_decompose_char() when UTF8PROC_IGNORE is passed as an option.