mirror of
https://https.git.savannah.gnu.org/git/gnulib.git
synced 2026-04-28 06:33:36 +00:00
130 lines
5.2 KiB
Plaintext
130 lines
5.2 KiB
Plaintext
@node Handling strings with NUL characters
|
|
@section Handling strings with NUL characters
|
|
|
|
@c Copyright (C) 2023--2026 Free Software Foundation, Inc.
|
|
|
|
@c Permission is granted to copy, distribute and/or modify this document
|
|
@c under the terms of the GNU Free Documentation License, Version 1.3 or
|
|
@c any later version published by the Free Software Foundation; with no
|
|
@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
|
|
@c copy of the license is at <https://www.gnu.org/licenses/fdl-1.3.en.html>.
|
|
|
|
@c Written by Bruno Haible.
|
|
|
|
Strings in C are usually represented by a character sequence with a
|
|
terminating NUL character. A @samp{char *}, pointer to the first byte
|
|
of this character sequence, is what gets passed around as function
|
|
argument or return value.
|
|
|
|
The major restriction of this string representation is that it cannot
|
|
handle strings that contain NUL characters: such strings will appear
|
|
shorter than they were meant to be. In most application areas, this is
|
|
not a problem, and the @code{char *} type is well usable.
|
|
|
|
A second problem of this string representation is that
|
|
taking a substring is not cheap:
|
|
it either requires a memory allocation
|
|
or a destructive modification of the string.
|
|
The former has a runtime cost;
|
|
the latter complicates the logic of the program.
|
|
This matters for application areas that analyze text, such as parsers.
|
|
|
|
In areas where strings with embedded NUL characters need to be handled
|
|
or where taking substrings is a recurrent operation,
|
|
the common approach is to use a @code{char *ptr} pointer variable
|
|
together with a @code{size_t nbytes} variable (or an @code{idx_t nbytes}
|
|
variable, if you want to avoid problems due to integer overflow). This
|
|
works fine in code that constructs or manipulates strings with embedded
|
|
NUL characters. But when it comes to @emph{storing} them, for example
|
|
in an array or as key or value of a hash table, one needs a type that
|
|
combines these two fields.
|
|
|
|
@mindex string-desc
|
|
@mindex xstring-desc
|
|
@mindex string-desc-quotearg
|
|
The Gnulib modules @code{string-desc}, @code{xstring-desc}, and
|
|
@code{string-desc-quotearg} provide such a type. We call it a
|
|
``string descriptor'' and name it @code{string_desc_t}.
|
|
|
|
The type @code{string_desc_t} is a struct that contains a pointer to the
|
|
first byte and the number of bytes of the memory region that make up the
|
|
string. An additional terminating NUL byte, that may be present in
|
|
memory, is not included in this byte count. This type implements the
|
|
same concept as @code{std::string_view} in C++, or the @code{String}
|
|
type in Java.
|
|
|
|
@code{string_desc_t} is a string descriptor to a string that cannot
|
|
be written to. There is also a type @code{rw_string_desc_t}, that is
|
|
a descriptor for a writable string.
|
|
@code{rw_string_desc_t} compares to @code{string_desc_t}, like the
|
|
pointer type @samp{char *} compares to the pointer type
|
|
@samp{const char *}.
|
|
|
|
A @code{string_desc_t} or @code{rw_string_desc_t}
|
|
can be passed to a function as an argument, or
|
|
can be the return value of a function. This is type-safe: If, by
|
|
mistake, a programmer passes a @code{string_desc_t} to a function that
|
|
expects a @code{char *} argument, or vice versa, or assigns a
|
|
@code{string_desc_t} value to a variable of type @code{char *}, or
|
|
vice versa, the compiler will report an error.
|
|
|
|
Unfortunately, @code{string_desc_t} and @code{rw_string_desc_t}
|
|
being different types, there is no implicit conversion from
|
|
@code{rw_string_desc_t} to @code{string_desc_t}. In places
|
|
where such a conversion is desired, the (inline) function
|
|
@code{sd_readonly} needs to be called.
|
|
|
|
Functions related to string descriptors are provided:
|
|
@itemize
|
|
@item
|
|
Side-effect-free operations in @code{"string-desc.h"},
|
|
@item
|
|
Memory-allocating operations in @code{"string-desc.h"},
|
|
@item
|
|
Memory-allocating operations with out-of-memory checking in
|
|
@code{"xstring-desc.h"},
|
|
@item
|
|
Operations with side effects in @code{"string-desc.h"}.
|
|
@end itemize
|
|
|
|
For outputting a string descriptor, the @code{*printf} family of
|
|
functions cannot be used directly. A format string directive such as
|
|
@code{"%.*s"} would not work:
|
|
@itemize
|
|
@item
|
|
it would stop the output at the first encountered NUL character,
|
|
@item
|
|
it would require to cast the number of bytes to @code{int}, and thus
|
|
would not work for strings longer than @code{INT_MAX} bytes.
|
|
@end itemize
|
|
@c @noindent Other format string directives don't work either, because
|
|
@c the only way to produce a NUL character in @code{*printf}'s output
|
|
@c is through a dedicated @code{%c} or @code{%lc} directive.
|
|
|
|
Therefore Gnulib offers
|
|
@itemize
|
|
@item
|
|
a function @code{sd_fwrite} that outputs a string descriptor to
|
|
a @code{FILE} stream,
|
|
@item
|
|
a function @code{sd_write} that outputs a string descriptor to
|
|
a file descriptor,
|
|
@item
|
|
and for those applications where the NUL characters should become
|
|
visible as @samp{\0}, a family of @code{quotearg} based functions, that
|
|
allow to specify the escaping rules in detail.
|
|
@end itemize
|
|
|
|
The functionality is thus split across three modules as follows:
|
|
@itemize
|
|
@item
|
|
The module @code{string-desc}, under LGPL, defines the type and
|
|
elementary functions.
|
|
@item
|
|
The module @code{xstring-desc}, under GPL, defines the memory-allocating
|
|
functions with out-of-memory checking.
|
|
@item
|
|
The module @code{string-desc-quotearg}, under GPL, defines the
|
|
@code{quotearg} based functions.
|
|
@end itemize
|