mirror of
https://https.git.savannah.gnu.org/git/gnulib.git
synced 2026-04-28 06:33:36 +00:00
1125 lines
28 KiB
Plaintext
1125 lines
28 KiB
Plaintext
@node Strings and Characters
|
||
@chapter Strings and Characters
|
||
|
||
@c Copyright (C) 2009--2026 Free Software Foundation, Inc.
|
||
|
||
@c Permission is granted to copy, distribute and/or modify this document
|
||
@c under the terms of the GNU Free Documentation License, Version 1.3 or
|
||
@c any later version published by the Free Software Foundation; with no
|
||
@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
|
||
@c copy of the license is at <https://www.gnu.org/licenses/fdl-1.3.en.html>.
|
||
|
||
@c Written by Bruno Haible.
|
||
|
||
This chapter describes the APIs for strings and characters, provided by Gnulib.
|
||
|
||
@menu
|
||
* Strings::
|
||
* Characters::
|
||
@end menu
|
||
|
||
@node Strings
|
||
@section Strings
|
||
|
||
Several possible representations exist for the representation of strings
|
||
in memory of a running C program.
|
||
|
||
@menu
|
||
* C strings::
|
||
* Iterating through strings::
|
||
* Strings with NUL characters::
|
||
* String Functions in C Locale::
|
||
* Comparison of string APIs::
|
||
@end menu
|
||
|
||
@node C strings
|
||
@subsection The C string representation
|
||
|
||
The classical representation of a string in C is a sequence of
|
||
characters, where each character takes up one or more bytes, followed by
|
||
a terminating NUL byte. This representation is used for strings that
|
||
are passed by the operating system (in the @code{argv} argument of
|
||
@code{main}, for example) and for strings that are passed to the
|
||
operating system (in system calls such as @code{open}). The C type to
|
||
hold such strings is @samp{char *} or, in places where the string shall
|
||
not be modified, @samp{const char *}. There are many C library
|
||
functions, standardized by ISO C and POSIX, that assume this
|
||
representation of strings.
|
||
|
||
A @emph{character encoding}, or @emph{encoding} for short, describes
|
||
how the elements of a character set are represented as a sequence of
|
||
bytes. For example, in the @code{ASCII} encoding, the UNDERSCORE
|
||
character is represented by a single byte, with value 0x5F. As another
|
||
example, the COPYRIGHT SIGN character is represented:
|
||
@itemize
|
||
@item
|
||
in the @code{ISO-8859-1} encoding, by the single byte 0xA9,
|
||
@item
|
||
in the @code{UTF-8} encoding, by the two bytes 0xC2 0xA9,
|
||
@item
|
||
in the @code{GB18030} encoding, by the four bytes 0x81 0x30 0x84 0x38.
|
||
@end itemize
|
||
|
||
@noindent
|
||
Note: The @samp{char} type may be signed or unsigned, depending on the
|
||
platform. When we talk about the "byte 0xA9" we actually mean the
|
||
@code{char} object whose value is @code{(char) 0xA9}; we omit the cast
|
||
to @code{char} in this documentation, for brevity.
|
||
|
||
In POSIX, the character encoding is determined by the locale. The
|
||
locale is some environmental attribute that the user can choose.
|
||
|
||
Depending on the encoding, in general, every character is represented by
|
||
one or more bytes (up to 4 bytes in practice -- but
|
||
use @code{MB_LEN_MAX} instead of the number 4 in the code).
|
||
@cindex unibyte locale
|
||
@cindex multibyte locale
|
||
When every character is represented by only 1 byte, we speak of an
|
||
``unibyte locale'', otherwise of a ``multibyte locale''.
|
||
|
||
It is important to realize that the majority of Unix installations
|
||
nowadays use UTF-8 as locale encoding; therefore, the
|
||
majority of users are using multibyte locales.
|
||
|
||
Three important facts to remember are:
|
||
|
||
@cartouche
|
||
@emph{A @samp{char} is a byte, not a character.}
|
||
@end cartouche
|
||
|
||
As a consequence:
|
||
@itemize @bullet
|
||
@item
|
||
The @posixheader{ctype.h} API, that was designed only with unibyte
|
||
encodings in mind, is useless nowadays for general text processing; it
|
||
does not work in multibyte locales.
|
||
@item
|
||
The @posixfunc{strlen} function does not return the number of characters
|
||
in a string. Nor does it return the number of screen columns occupied
|
||
by a string after it is output. It merely returns the number of
|
||
@emph{bytes} occupied by a string.
|
||
@item
|
||
Truncating a string, for example, with @posixfunc{strncpy}, can have the
|
||
effect of truncating it in the middle of a multibyte character. Such
|
||
a string will, when output, have a garbled character at its end, often
|
||
represented by a hollow box.
|
||
@end itemize
|
||
|
||
@cartouche
|
||
@emph{Multibyte does not imply UTF-8 encoding.}
|
||
@end cartouche
|
||
|
||
While UTF-8 is the most common multibyte encoding, GB18030 is also a
|
||
supported locale encoding on GNU systems (mostly because it is a Chinese
|
||
government standard, last revised in 2022).
|
||
|
||
@cartouche
|
||
@emph{Searching for a character in a string is not the same as searching
|
||
for a byte in the string.}
|
||
@end cartouche
|
||
|
||
Take the above example of COPYRIGHT SIGN in the @code{GB18030} encoding:
|
||
A byte search will find the bytes @code{'0'} and @code{'8'} in this
|
||
string. But a search for the @emph{character} "0" or "8" in the string
|
||
"@copyright{}" must, of course, report ``not found''.
|
||
|
||
As a consequence:
|
||
@itemize @bullet
|
||
@item
|
||
@posixfunc{strchr} and @posixfunc{strrchr} do not work with multibyte
|
||
strings if the locale encoding is GB18030 and the character to be
|
||
searched is a digit.
|
||
@item
|
||
@posixfunc{strstr} does not work with multibyte strings if the locale
|
||
encoding is different from UTF-8.
|
||
@item
|
||
@posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn} cannot work
|
||
correctly in multibyte locales: they assume the second argument is a
|
||
list of single-byte characters. Even in this simple case, they do not
|
||
work with multibyte strings if the locale encoding is GB18030 and one of
|
||
the characters to be searched is a digit.
|
||
@item
|
||
@posixfunc{strsep} and @posixfunc{strtok_r} do not work with multibyte
|
||
strings unless all of the delimiter characters are ASCII characters
|
||
< 0x30.
|
||
@item
|
||
The @posixfunc{strcasecmp}, @posixfunc{strncasecmp}, and
|
||
@posixfunc{strcasestr} functions do not work with multibyte strings.
|
||
@end itemize
|
||
|
||
Workarounds can be found in Gnulib, in the form of @code{mbs*} API
|
||
functions:
|
||
@itemize @bullet
|
||
@item
|
||
Gnulib has functions @func{mbslen} and @func{mbswidth} that can be used
|
||
instead of @posixfunc{strlen} when the number of characters or the
|
||
number of screen columns of a string is requested.
|
||
@item
|
||
Gnulib has functions @func{mbschr} and @func{mbsrrchr} that are like
|
||
@posixfunc{strchr} and @posixfunc{strrchr}, but work in multibyte
|
||
locales.
|
||
@item
|
||
Gnulib has a function @func{mbsstr} that is like @posixfunc{strstr}, but
|
||
works in multibyte locales.
|
||
@item
|
||
Gnulib has functions @func{mbscspn}, @func{mbspbrk}, @func{mbsspn} that
|
||
are like @posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn},
|
||
but work in multibyte locales.
|
||
@item
|
||
Gnulib has functions @func{mbssep} and @func{mbstok_r} that are like
|
||
@posixfunc{strsep} and @posixfunc{strtok_r} but work in multibyte
|
||
locales.
|
||
@item
|
||
Gnulib has functions @func{mbscasecmp}, @func{mbsncasecmp},
|
||
@func{mbspcasecmp}, and @func{mbscasestr} that are like
|
||
@posixfunc{strcasecmp}, @posixfunc{strncasecmp}, and
|
||
@posixfunc{strcasestr}, but work in multibyte locales. Still, the
|
||
function @code{ulc_casecmp} is preferable to these functions.
|
||
@end itemize
|
||
|
||
@cartouche
|
||
@emph{A C string can contain encoding errors.}
|
||
@end cartouche
|
||
|
||
Not every NUL-terminated byte sequence represents a valid multibyte
|
||
string. Byte sequences can contain encoding errors, that is, bytes or
|
||
byte sequences that are invalid and do not represent characters.
|
||
|
||
String functions like @code{mbscasecmp} and @code{strcoll} whose
|
||
behavior depends on encoding have unspecified behavior on strings
|
||
containing encoding errors, unless the behavior is specifically
|
||
documented. If an application needs a particular behavior on these
|
||
strings it can iterate through them itself, as described in the next
|
||
subsection.
|
||
|
||
@node Iterating through strings
|
||
@subsection Iterating through strings
|
||
|
||
For complex string processing, string functions may not be enough, and
|
||
you need to iterate through a string while processing each (possibly
|
||
multibyte) character or encoding error in turn. Gnulib has several
|
||
modules for iterating forward through a string in this way. Backward
|
||
iteration, that is, from the string's end to start, is not provided,
|
||
as it is too hairy in general.
|
||
|
||
@mindex mbiter
|
||
@mindex mbiterf
|
||
@mindex mbuiter
|
||
@mindex mbuiterf
|
||
@mindex mcel
|
||
@itemize
|
||
@item
|
||
The @code{mbiter} module iterates through a string whose length
|
||
is already known. The string can contain NULs and encoding errors.
|
||
@item
|
||
The @code{mbiterf} module is like @code{mbiter}
|
||
except it is more complex and typically faster.
|
||
@item
|
||
The @code{mbuiter} module iterates through a C string whose length
|
||
is not a-priori known. The string can contain encoding errors and is
|
||
terminated by the first NUL.
|
||
@item
|
||
The @code{mbuiterf} module is like @code{mbuiter}
|
||
except it is more complex and typically faster.
|
||
@item
|
||
The @code{mcel} module is simpler than @code{mbiter} and @code{mbuiter}
|
||
and can be faster than even @code{mbiterf} and @code{mbuiterf}.
|
||
It can iterate through either strings whose length is known, or
|
||
C strings, or strings terminated by other ASCII characters < 0x30.
|
||
@item
|
||
The @code{mcel-prefer} module is like @code{mcel} except that it
|
||
causes some other modules to be based on @code{mcel} instead of
|
||
on the @code{mbiter} family.
|
||
@end itemize
|
||
|
||
The choice of modules depends on the application's needs. The
|
||
@code{mbiter} module family is more suitable for applications that
|
||
treat some sequences of two or more bytes as a single encoding error,
|
||
and for applications that need to support obsolescent encodings on
|
||
non-GNU platforms, such as CP864, EBCDIC, Johab, and Shift JIS.
|
||
In this module family, @code{mbuiter} and @code{mbuiterf} are more
|
||
suitable than @code{mbiter} and @code{mbiterf} when arguments are C strings,
|
||
lengths are not already known, and it is highly likely that only the
|
||
first few multibyte characters need to be inspected.
|
||
|
||
The @code{mcel} module is simpler and can be faster than the
|
||
@code{mbiter} family, and is more suitable for applications that do
|
||
not need the @code{mbiter} family's special features.
|
||
|
||
@mindex mcel-prefer
|
||
The @code{mcel-prefer} module is like @code{mcel} except that it also
|
||
causes some other modules, such as @code{mbscasecmp}, to use
|
||
@code{mcel} rather than the @code{mbiter} family. This can be simpler
|
||
and faster. However, it does not support the obsolescent encodings,
|
||
and it may behave differently on data containing encoding errors where
|
||
behavior is unspecified or undefined, because in @code{mcel} each
|
||
encoding error is a single byte whereas in the @code{mbiter} family a
|
||
single encoding error can contain two or more bytes.
|
||
|
||
If a package uses @code{mcel-prefer}, it may also want to give
|
||
@command{gnulib-tool} one or more of the options
|
||
@option{--avoid=mbiter}, @option{--avoid=mbiterf},
|
||
@option{--avoid=mbuiter} and @option{--avoid=mbuiterf},
|
||
to avoid packaging modules that are not needed.
|
||
|
||
@node Strings with NUL characters
|
||
@subsection Strings with NUL characters
|
||
|
||
The GNU Coding Standards, section
|
||
@ifinfo
|
||
@ref{Semantics,,Writing Robust Programs,standards},
|
||
@end ifinfo
|
||
@ifnotinfo
|
||
@url{https://www.gnu.org/prep/standards/html_node/Semantics.html},
|
||
@end ifnotinfo
|
||
specifies:
|
||
@cartouche
|
||
Utilities reading files should not drop NUL characters, or any other
|
||
nonprinting characters.
|
||
@end cartouche
|
||
|
||
When it is a requirement to store NUL characters in strings, a variant
|
||
of the C strings is needed. Gnulib offers a ``string descriptor'' type
|
||
for this purpose. See @ref{Handling strings with NUL characters}.
|
||
|
||
All remarks regarding encodings and multibyte characters in the previous
|
||
section apply to string descriptors as well.
|
||
|
||
@include c-locale.texi
|
||
|
||
@node Comparison of string APIs
|
||
@subsection Comparison of string APIs
|
||
|
||
This table summarizes the API functions available for strings, in POSIX
|
||
and in Gnulib.
|
||
|
||
@mindex c-strtod
|
||
@mindex c-strtold
|
||
@mindex c32snrtombs
|
||
@mindex c32srtombs
|
||
@mindex c32stombs
|
||
@mindex c32swidth
|
||
@mindex mbscasecmp
|
||
@mindex mbscasestr
|
||
@mindex mbschr
|
||
@mindex mbscspn
|
||
@mindex mbslen
|
||
@mindex mbsncasecmp
|
||
@mindex mbsnlen
|
||
@mindex mbsnrtoc32s
|
||
@mindex mbsnrtowcs
|
||
@mindex mbspbrk
|
||
@mindex mbspcasecmp
|
||
@mindex mbsrchr
|
||
@mindex mbsrtoc32s
|
||
@mindex mbsrtowcs
|
||
@mindex mbssep
|
||
@mindex mbsspn
|
||
@mindex mbsstr
|
||
@mindex mbstoc32s
|
||
@mindex mbstok_r
|
||
@mindex mbstowcs
|
||
@mindex mbswidth
|
||
@mindex mbs_endswith
|
||
@mindex mbs_startswith
|
||
@mindex stpcpy
|
||
@mindex stpncpy
|
||
@mindex strcasecmp
|
||
@mindex strcasestr
|
||
@mindex strcspn
|
||
@mindex strdup
|
||
@mindex string-desc
|
||
@mindex strncasecmp
|
||
@mindex strncat
|
||
@mindex strndup
|
||
@mindex strnlen
|
||
@mindex strpbrk
|
||
@mindex strsep
|
||
@mindex strstr
|
||
@mindex strtod
|
||
@mindex strtof
|
||
@mindex strtoimax
|
||
@mindex strtok_r
|
||
@mindex strtol
|
||
@mindex strtold
|
||
@mindex strtoll
|
||
@mindex strtoul
|
||
@mindex strtoull
|
||
@mindex strtoumax
|
||
@mindex str_endswith
|
||
@mindex str_startswith
|
||
@mindex unicase/u32-casecmp
|
||
@mindex unistr/u32-mbsnlen
|
||
@mindex unistr/u32-endswith
|
||
@mindex unistr/u32-startswith
|
||
@mindex unistr/u32-stpcpy
|
||
@mindex unistr/u32-stpncpy
|
||
@mindex unistr/u32-strcat
|
||
@mindex unistr/u32-strchr
|
||
@mindex unistr/u32-strcmp
|
||
@mindex unistr/u32-strcoll
|
||
@mindex unistr/u32-strcpy
|
||
@mindex unistr/u32-strcspn
|
||
@mindex unistr/u32-strdup
|
||
@mindex unistr/u32-strlen
|
||
@mindex unistr/u32-strncat
|
||
@mindex unistr/u32-strncmp
|
||
@mindex unistr/u32-strncpy
|
||
@mindex unistr/u32-strnlen
|
||
@mindex unistr/u32-strpbrk
|
||
@mindex unistr/u32-strrchr
|
||
@mindex unistr/u32-strspn
|
||
@mindex unistr/u32-strstr
|
||
@mindex unistr/u32-strtok
|
||
@mindex uniwidth/u32-strwidth
|
||
@mindex wcpcpy
|
||
@mindex wcpncpy
|
||
@mindex wcscasecmp
|
||
@mindex wcscat
|
||
@mindex wcschr
|
||
@mindex wcscmp
|
||
@mindex wcscoll
|
||
@mindex wcscpy
|
||
@mindex wcscspn
|
||
@mindex wcsdup
|
||
@mindex wcslen
|
||
@mindex wcsncasecmp
|
||
@mindex wcsncat
|
||
@mindex wcsncmp
|
||
@mindex wcsncpy
|
||
@mindex wcsnlen
|
||
@mindex wcsnrtombs
|
||
@mindex wcspbrk
|
||
@mindex wcsrchr
|
||
@mindex wcsrtombs
|
||
@mindex wcsspn
|
||
@mindex wcsstr
|
||
@mindex wcstok
|
||
@mindex wcswidth
|
||
@mindex wcsxfrm
|
||
|
||
@multitable @columnfractions .17 .17 .17 .17 .16 .16
|
||
@headitem unibyte strings only
|
||
@tab assume C locale
|
||
@tab multibyte strings
|
||
@tab multibyte strings with NULs
|
||
@tab wide character strings
|
||
@tab 32-bit wide character strings
|
||
|
||
@item @code{strlen}
|
||
@tab @code{strlen}
|
||
@tab @code{mbslen}
|
||
@tab @code{sd_length}
|
||
@tab @code{wcslen}
|
||
@tab @code{u32_strlen}
|
||
|
||
@item @code{strnlen}
|
||
@tab @code{strnlen}
|
||
@tab @code{mbsnlen}
|
||
@tab --
|
||
@tab @code{wcsnlen}
|
||
@tab @code{u32_strnlen}, @code{u32_mbsnlen}
|
||
|
||
@item @code{strcmp}
|
||
@tab @code{strcmp}
|
||
@tab @code{strcmp}
|
||
@tab @code{sd_cmp}
|
||
@tab @code{wcscmp}
|
||
@tab @code{u32_strcmp}
|
||
|
||
@item @code{strncmp}
|
||
@tab @code{strncmp}
|
||
@tab @code{strncmp}
|
||
@tab --
|
||
@tab @code{wcsncmp}
|
||
@tab @code{u32_strncmp}
|
||
|
||
@item @code{strcasecmp}
|
||
@tab @code{strcasecmp}
|
||
@tab @code{mbscasecmp}
|
||
@tab --
|
||
@tab @code{wcscasecmp}
|
||
@tab @code{u32_casecmp}
|
||
|
||
@item @code{strncasecmp}
|
||
@tab @code{strncasecmp}
|
||
@tab @code{mbsncasecmp}, @code{mbspcasecmp}
|
||
@tab --
|
||
@tab @code{wcsncasecmp}
|
||
@tab @code{u32_casecmp}
|
||
|
||
@item @code{strcoll}
|
||
@tab @code{strcmp}
|
||
@tab @code{strcoll}
|
||
@tab --
|
||
@tab @code{wcscoll}
|
||
@tab @code{u32_strcoll}
|
||
|
||
@item @code{strxfrm}
|
||
@tab --
|
||
@tab @code{strxfrm}
|
||
@tab --
|
||
@tab @code{wcsxfrm}
|
||
@tab --
|
||
|
||
@item @code{strchr}
|
||
@tab @code{strchr}
|
||
@tab @code{mbschr}
|
||
@tab @code{sd_index}
|
||
@tab @code{wcschr}
|
||
@tab @code{u32_strchr}
|
||
|
||
@item @code{strrchr}
|
||
@tab @code{strrchr}
|
||
@tab @code{mbsrchr}
|
||
@tab @code{sd_last_index}
|
||
@tab @code{wcsrchr}
|
||
@tab @code{u32_strrchr}
|
||
|
||
@item @code{strstr}
|
||
@tab @code{strstr}
|
||
@tab @code{mbsstr}
|
||
@tab @code{sd_contains}
|
||
@tab @code{wcsstr}
|
||
@tab @code{u32_strstr}
|
||
|
||
@item @code{strcasestr}
|
||
@tab @code{strcasestr}
|
||
@tab @code{mbscasestr}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item
|
||
@c Special hack, to make this table line look reasonable in info mode.
|
||
@ifinfo
|
||
@code{str_
|
||
startswith}
|
||
@end ifinfo
|
||
@ifnotinfo
|
||
@code{str_startswith}
|
||
@end ifnotinfo
|
||
@tab
|
||
@ifinfo
|
||
@code{str_
|
||
startswith}
|
||
@end ifinfo
|
||
@ifnotinfo
|
||
@code{str_startswith}
|
||
@end ifnotinfo
|
||
@tab
|
||
@ifinfo
|
||
@code{mbs_
|
||
startswith}
|
||
@end ifinfo
|
||
@ifnotinfo
|
||
@code{mbs_startswith}
|
||
@end ifnotinfo
|
||
@tab
|
||
@ifinfo
|
||
@code{sd_
|
||
startswith}
|
||
@end ifinfo
|
||
@ifnotinfo
|
||
@code{sd_startswith}
|
||
@end ifnotinfo
|
||
@tab --
|
||
@tab
|
||
@ifinfo
|
||
@code{u32_
|
||
startswith}
|
||
@end ifinfo
|
||
@ifnotinfo
|
||
@code{u32_startswith}
|
||
@end ifnotinfo
|
||
|
||
@item @code{str_endswith}
|
||
@tab @code{str_endswith}
|
||
@tab @code{mbs_endswith}
|
||
@tab @code{sd_endswith}
|
||
@tab --
|
||
@tab @code{u32_endswith}
|
||
|
||
@item @code{strspn}
|
||
@tab @code{strspn}
|
||
@tab @code{mbsspn}
|
||
@tab --
|
||
@tab @code{wcsspn}
|
||
@tab @code{u32_strspn}
|
||
|
||
@item @code{strcspn}
|
||
@tab @code{strcspn}
|
||
@tab @code{mbscspn}
|
||
@tab --
|
||
@tab @code{wcscspn}
|
||
@tab @code{u32_strcspn}
|
||
|
||
@item @code{strpbrk}
|
||
@tab @code{strpbrk}
|
||
@tab @code{mbspbrk}
|
||
@tab --
|
||
@tab @code{wcspbrk}
|
||
@tab @code{u32_strpbrk}
|
||
|
||
@item @code{strtok_r}
|
||
@tab @code{strtok_r}
|
||
@tab @code{mbstok_r}
|
||
@tab --
|
||
@tab @code{wcstok}
|
||
@tab @code{u32_strtok}
|
||
|
||
@item @code{strsep}
|
||
@tab @code{strsep}
|
||
@tab @code{mbssep}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strcpy}
|
||
@tab @code{strcpy}
|
||
@tab @code{strcpy}
|
||
@tab @code{sd_copy}
|
||
@tab @code{wcscpy}
|
||
@tab @code{u32_strcpy}
|
||
|
||
@item @code{stpcpy}
|
||
@tab @code{stpcpy}
|
||
@tab @code{stpcpy}
|
||
@tab --
|
||
@tab @code{wcpcpy}
|
||
@tab @code{u32_stpcpy}
|
||
|
||
@item @code{strncpy}
|
||
@tab @code{strncpy}
|
||
@tab @code{strncpy}
|
||
@tab --
|
||
@tab @code{wcsncpy}
|
||
@tab @code{u32_strncpy}
|
||
|
||
@item @code{stpncpy}
|
||
@tab @code{stpncpy}
|
||
@tab @code{stpncpy}
|
||
@tab --
|
||
@tab @code{wcpncpy}
|
||
@tab @code{u32_stpncpy}
|
||
|
||
@item @code{strcat}
|
||
@tab @code{strcat}
|
||
@tab @code{strcat}
|
||
@tab @code{sd_concat}
|
||
@tab @code{wcscat}
|
||
@tab @code{u32_strcat}
|
||
|
||
@item @code{strncat}
|
||
@tab @code{strncat}
|
||
@tab @code{strncat}
|
||
@tab --
|
||
@tab @code{wcsncat}
|
||
@tab @code{u32_strncat}
|
||
|
||
@item @code{free}
|
||
@tab @code{free}
|
||
@tab @code{free}
|
||
@tab @code{sd_free}
|
||
@tab @code{free}
|
||
@tab @code{free}
|
||
|
||
@item @code{strdup}
|
||
@tab @code{strdup}
|
||
@tab @code{strdup}
|
||
@tab @code{sd_copy}
|
||
@tab @code{wcsdup}
|
||
@tab @code{u32_strdup}
|
||
|
||
@item @code{strndup}
|
||
@tab @code{strndup}
|
||
@tab @code{strndup}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{mbswidth}
|
||
@tab @code{mbswidth}
|
||
@tab @code{mbswidth}
|
||
@tab --
|
||
@tab @code{wcswidth}
|
||
@tab @code{c32swidth}, @code{u32_strwidth}
|
||
|
||
@item @code{strtol}
|
||
@tab @code{strtol}
|
||
@tab @code{strtol}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strtoul}
|
||
@tab @code{strtoul}
|
||
@tab @code{strtoul}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strtoll}
|
||
@tab @code{strtoll}
|
||
@tab @code{strtoll}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strtoull}
|
||
@tab @code{strtoull}
|
||
@tab @code{strtoull}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strtoimax}
|
||
@tab @code{strtoimax}
|
||
@tab @code{strtoimax}
|
||
@tab --
|
||
@tab @code{wcstoimax}
|
||
@tab --
|
||
|
||
@item @code{strtoumax}
|
||
@tab @code{strtoumax}
|
||
@tab @code{strtoumax}
|
||
@tab --
|
||
@tab @code{wcstoumax}
|
||
@tab --
|
||
|
||
@item @code{strtof}
|
||
@tab --
|
||
@tab @code{strtof}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strtod}
|
||
@tab @code{c_strtod}
|
||
@tab @code{strtod}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strtold}
|
||
@tab @code{c_strtold}
|
||
@tab @code{strtold}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strfromf}
|
||
@tab --
|
||
@tab @code{strfromf}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strfromd}
|
||
@tab --
|
||
@tab @code{strfromd}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item @code{strfroml}
|
||
@tab --
|
||
@tab @code{strfroml}
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
|
||
@item --
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
@tab @code{mbstowcs}
|
||
@tab @code{mbstoc32s}
|
||
|
||
@item --
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
@tab @code{mbsrtowcs}
|
||
@tab @code{mbsrtoc32s}
|
||
|
||
@item --
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
@tab @code{mbsnrtowcs}
|
||
@tab @code{mbsnrtoc32s}
|
||
|
||
@item --
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
@tab @code{wcstombs}
|
||
@tab @code{c32stombs}
|
||
|
||
@item --
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
@tab @code{wcsrtombs}
|
||
@tab @code{c32srtombs}
|
||
|
||
@item --
|
||
@tab --
|
||
@tab --
|
||
@tab --
|
||
@tab @code{wcsnrtombs}
|
||
@tab @code{c32snrtombs}
|
||
|
||
@end multitable
|
||
|
||
@node Characters
|
||
@section Characters
|
||
|
||
A @emph{character} is the elementary unit that strings are made of.
|
||
|
||
What is a character? ``A character is an element of a character set''
|
||
is sort of a circular definition, but it highlights the fact that it is
|
||
not merely a number. Although many characters are visually represented
|
||
by a single glyph, there are characters that, for example, have a
|
||
different glyph when used at the end of a word than when used inside a
|
||
word. A character is also not the minimal rendered text processing
|
||
unit; that is a grapheme cluster and in general consists of one or more
|
||
characters. If you want to know more about the concept of character and
|
||
various concepts associated with characters, refer to the Unicode
|
||
standard.
|
||
|
||
For the representation in memory of a character, various types have been
|
||
in use, and some of them were failures: @code{char} and @code{wchar_t}
|
||
were invented for this purpose, but are not the right types.
|
||
@code{char32_t} is the right type (successor of @code{wchar_t}); and
|
||
@code{mbchar_t} (defined by Gnulib) is an alternative for specific kinds
|
||
of processing.
|
||
|
||
@menu
|
||
* The char type::
|
||
* The wchar_t type::
|
||
* The char32_t type::
|
||
* The mbchar_t type::
|
||
* Comparison of character APIs::
|
||
@end menu
|
||
|
||
@node The char type
|
||
@subsection The @code{char} type
|
||
|
||
The @code{char} type is in the C language since the beginning in the
|
||
1970ies, but -- due to its limitation of 256 possible values -- is no
|
||
longer the adequate type for storing a character.
|
||
|
||
Technically, it is still adequate in unibyte locales. But since most
|
||
locales nowadays are multibyte locales, it makes no sense to write a
|
||
program that runs only in unibyte locales.
|
||
|
||
ISO C and POSIX standardized an API for characters of type @code{char},
|
||
in @code{<ctype.h>}. This API is nowadays useless and obsolete, when it
|
||
comes to general text processing.
|
||
|
||
The important lessons to remember are:
|
||
|
||
@cartouche
|
||
@emph{A @samp{char} is just the elementary storage unit for a string,
|
||
not a character.}
|
||
@end cartouche
|
||
|
||
@cartouche
|
||
@emph{Never use @code{<ctype.h>}!}
|
||
@end cartouche
|
||
|
||
@node The wchar_t type
|
||
@subsection The @code{wchar_t} type
|
||
|
||
The ISO C and POSIX standard creators made an attempt to overcome the
|
||
dead end regarding the @code{char} type. They introduced
|
||
@itemize @bullet
|
||
@item
|
||
a type @samp{wchar_t}, designed to encapsulate a character,
|
||
@item
|
||
a ``wide string'' type @samp{wchar_t *}, with some API functions
|
||
declared in @posixheader{wchar.h}, and
|
||
@item
|
||
functions declared in @posixheader{wctype.h} that were meant to supplant
|
||
the ones in @posixheader{ctype.h}.
|
||
@end itemize
|
||
|
||
Unfortunately, this API and its implementation has numerous problems:
|
||
|
||
@itemize @bullet
|
||
@item
|
||
On Windows platforms and on AIX in 32-bit mode, @code{wchar_t} is a
|
||
16-bit type. This means that it can never accommodate an entire Unicode
|
||
character. Either the @code{wchar_t *} strings are limited to
|
||
characters in UCS-2 (the ``Basic Multilingual Plane'' of Unicode), or
|
||
-- if @code{wchar_t *} strings are encoded in UTF-16 -- a
|
||
@code{wchar_t} represents only half of a character in the worst case,
|
||
making the @posixheader{wctype.h} functions pointless.
|
||
|
||
@item
|
||
On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
|
||
and undocumented. This means, if you want to know any property of a
|
||
@code{wchar_t} character, other than the properties defined by
|
||
@posixheader{wctype.h} -- such as whether it's a dash, currency symbol,
|
||
paragraph separator, or similar --, you have to convert it to
|
||
@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
|
||
|
||
@item
|
||
When you read a stream of wide characters, through the functions
|
||
@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input
|
||
stream/file is not in the expected encoding, you have no way to
|
||
determine the invalid byte sequence and do some corrective action. If
|
||
you use these functions, your program becomes ``garbage in - more
|
||
garbage out'' or ``garbage in - abort''.
|
||
@end itemize
|
||
|
||
As a consequence, it is better to use multibyte strings. Such multibyte
|
||
strings can bypass limitations of the @code{wchar_t} type, if you use
|
||
functions defined in Gnulib and GNU libunistring for text processing.
|
||
They can also faithfully transport malformed characters that were
|
||
present in the input, without requiring the program to produce garbage
|
||
or abort.
|
||
|
||
@node The char32_t type
|
||
@subsection The @code{char32_t} type
|
||
|
||
The ISO C and POSIX standard creators then introduced the
|
||
@code{char32_t} type. In ISO C 11, it was conceptually a ``32-bit wide
|
||
character'' type. In ISO C 23, its semantics has been further
|
||
specified: A @code{char32_t} value is a Unicode code point.
|
||
|
||
Thus, the @code{char32_t} type is not affected the problems that plague
|
||
the @code{wchar_t} type.
|
||
|
||
The @code{char32_t} type and its API are defined in the @code{<uchar.h>}
|
||
header file.
|
||
|
||
ISO C and POSIX specify only the basic functions for the @code{char32_t}
|
||
type, namely conversion of a single character (@func{mbrtoc32} and
|
||
@func{c32rtomb}). For convenience, Gnulib adds API for classification
|
||
and case conversion of characters.
|
||
|
||
GNU libunistring can also be used on @code{char32_t} values. Since
|
||
@code{char32_t} is the same as @code{uint32_t}, all @code{u32_*}
|
||
functions of GNU libunistring are applicable to arrays of
|
||
@code{char32_t} values.
|
||
|
||
On glibc systems, use of the 32-bit wide strings (@code{char32_t[]}) is
|
||
exactly as efficient as the use of the older wide strings
|
||
(@code{wchar_t[]}). This is possible because on glibc, @code{wchar_t}
|
||
values already always were 32-bit and Unicode code points.
|
||
@code{mbrtoc32} is just an alias of @code{mbrtowc}. The Gnulib
|
||
@code{*c32*} functions are optimized so that on glibc systems they
|
||
immediately redirect to the corresponding @code{*wc*} functions.
|
||
|
||
@mindex uchar-h-c23
|
||
Gnulib implements the ISO C 23 semantics of @code{char32_t} when you
|
||
import the @samp{uchar-h-c23} module. Without this module, it implements
|
||
only the ISO C 11 semantics; the effect is that on some platforms
|
||
(macOS, FreeBSD, NetBSD, Solaris) a @code{char32_t} value is the same
|
||
as a @code{wchar_t} value, not a Unicode code point. Thus, when you
|
||
want to pass @code{char32_t} values to GNU libunistring or to some Unicode
|
||
centric Gnulib functions, you need the @samp{uchar-h-c23} module in order
|
||
to do so without portability problems.
|
||
|
||
@node The mbchar_t type
|
||
@subsection The @code{mbchar_t} type
|
||
|
||
Gnulib defines an alternate way to encode a multibyte character:
|
||
@code{mbchar_t}. Its main feature is the ability to process a string or
|
||
stream with some malformed characters without reporting an error.
|
||
|
||
The type @code{mbchar_t}, defined in @code{"mbchar.h"}, holds a
|
||
character in both the multibyte and the 32-bit wide character
|
||
representation. In case of a malformed character only the multibyte
|
||
representation is used.
|
||
|
||
@menu
|
||
* Reading multibyte strings::
|
||
@end menu
|
||
|
||
@node Reading multibyte strings
|
||
@subsubsection Reading multibyte strings
|
||
|
||
@mindex mbfile
|
||
If you want to process (possibly multibyte) characters while reading
|
||
them from a @code{FILE *} stream, without reading them into a string
|
||
first, the @code{mbfile} module is made for this purpose.
|
||
|
||
@node Comparison of character APIs
|
||
@subsection Comparison of character APIs
|
||
|
||
This table summarizes the API functions available for characters, in
|
||
POSIX and in Gnulib.
|
||
|
||
@mindex c-ctype
|
||
@mindex c32isalnum
|
||
@mindex c32isalpha
|
||
@mindex c32isblank
|
||
@mindex c32iscntrl
|
||
@mindex c32isdigit
|
||
@mindex c32isgraph
|
||
@mindex c32islower
|
||
@mindex c32isprint
|
||
@mindex c32ispunct
|
||
@mindex c32isspace
|
||
@mindex c32isupper
|
||
@mindex c32isxdigit
|
||
@mindex c32tolower
|
||
@mindex c32toupper
|
||
@mindex c32width
|
||
@mindex c32_apply_mapping
|
||
@mindex c32_apply_type_test
|
||
@mindex c32_get_mapping
|
||
@mindex c32_get_type_test
|
||
@mindex c32tolower
|
||
@mindex c32toupper
|
||
@mindex isblank
|
||
@mindex iswblank
|
||
@mindex iswctype
|
||
@mindex iswdigit
|
||
@mindex iswpunct
|
||
@mindex iswxdigit
|
||
@mindex mbchar
|
||
@mindex towctrans
|
||
@mindex wctrans
|
||
@mindex wctype
|
||
@mindex wcwidth
|
||
|
||
@multitable @columnfractions .2 .2 .2 .2 .2
|
||
@headitem unibyte character
|
||
@tab assume C locale
|
||
@tab wide character
|
||
@tab 32-bit wide character
|
||
@tab mbchar_t character
|
||
|
||
@item @code{== '\0'}
|
||
@tab @code{== '\0'}
|
||
@tab @code{== L'\0'}
|
||
@tab @code{== 0}
|
||
@tab @code{mb_isnul}
|
||
|
||
@item @code{==}
|
||
@tab @code{==}
|
||
@tab @code{==}
|
||
@tab @code{==}
|
||
@tab @code{mb_equal}
|
||
|
||
@item @code{isalnum}
|
||
@tab @code{c_isalnum}
|
||
@tab @code{iswalnum}
|
||
@tab @code{c32isalnum}
|
||
@tab @code{mb_isalnum}
|
||
|
||
@item @code{isalpha}
|
||
@tab @code{c_isalpha}
|
||
@tab @code{iswalpha}
|
||
@tab @code{c32isalpha}
|
||
@tab @code{mb_isalpha}
|
||
|
||
@item @code{isblank}
|
||
@tab @code{c_isblank}
|
||
@tab @code{iswblank}
|
||
@tab @code{c32isblank}
|
||
@tab @code{mb_isblank}
|
||
|
||
@item @code{iscntrl}
|
||
@tab @code{c_iscntrl}
|
||
@tab @code{iswcntrl}
|
||
@tab @code{c32iscntrl}
|
||
@tab @code{mb_iscntrl}
|
||
|
||
@item @code{isdigit}
|
||
@tab @code{c_isdigit}
|
||
@tab @code{iswdigit}
|
||
@tab @code{c32isdigit}
|
||
@tab @code{mb_isdigit}
|
||
|
||
@item @code{isgraph}
|
||
@tab @code{c_isgraph}
|
||
@tab @code{iswgraph}
|
||
@tab @code{c32isgraph}
|
||
@tab @code{mb_isgraph}
|
||
|
||
@item @code{islower}
|
||
@tab @code{c_islower}
|
||
@tab @code{iswlower}
|
||
@tab @code{c32islower}
|
||
@tab @code{mb_islower}
|
||
|
||
@item @code{isprint}
|
||
@tab @code{c_isprint}
|
||
@tab @code{iswprint}
|
||
@tab @code{c32isprint}
|
||
@tab @code{mb_isprint}
|
||
|
||
@item @code{ispunct}
|
||
@tab @code{c_ispunct}
|
||
@tab @code{iswpunct}
|
||
@tab @code{c32ispunct}
|
||
@tab @code{mb_ispunct}
|
||
|
||
@item @code{isspace}
|
||
@tab @code{c_isspace}
|
||
@tab @code{iswspace}
|
||
@tab @code{c32isspace}
|
||
@tab @code{mb_isspace}
|
||
|
||
@item @code{isupper}
|
||
@tab @code{c_isupper}
|
||
@tab @code{iswupper}
|
||
@tab @code{c32isupper}
|
||
@tab @code{mb_isupper}
|
||
|
||
@item @code{isxdigit}
|
||
@tab @code{c_isxdigit}
|
||
@tab @code{iswxdigit}
|
||
@tab @code{c32isxdigit}
|
||
@tab @code{mb_isxdigit}
|
||
|
||
@item --
|
||
@tab --
|
||
@tab @code{wctype}
|
||
@tab @code{c32_get_type_test}
|
||
@tab --
|
||
|
||
@item --
|
||
@tab --
|
||
@tab @code{iswctype}
|
||
@tab @code{c32_apply_type_test}
|
||
@tab --
|
||
|
||
@item @code{tolower}
|
||
@tab @code{c_tolower}
|
||
@tab @code{towlower}
|
||
@tab @code{c32tolower}
|
||
@tab --
|
||
|
||
@item @code{toupper}
|
||
@tab @code{c_toupper}
|
||
@tab @code{towupper}
|
||
@tab @code{c32toupper}
|
||
@tab --
|
||
|
||
@item --
|
||
@tab --
|
||
@tab @code{wctrans}
|
||
@tab @code{c32_get_mapping}
|
||
@tab --
|
||
|
||
@item --
|
||
@tab --
|
||
@tab @code{towctrans}
|
||
@tab @code{c32_apply_mapping}
|
||
@tab --
|
||
|
||
@item --
|
||
@tab --
|
||
@tab @code{wcwidth}
|
||
@tab @code{c32width}
|
||
@tab @code{mb_width}
|
||
|
||
@end multitable
|