It is possible to convert between any two of the following encodings:
Alias for UTF-8.
Big-endian.
Little-endian.
Switches between BE and LE on FFFE/FEFF byte order marks, which can be everywhere in the stream. The default is big-endian.
Analogous to UTF-16.
8-bit encoding as specified in Annex D.7 of the PDF spec. The codepoints 0x7f, 0x9f and 0xad are left undefined.
void pdfout_char_conv_buffer (fz_context *ctx, const char *fromcode,
const char *tocode, const char *src, int srclen,
fz_buffer *buf)
This is the most generic function and used internally by all functions below.
Convert the string src
of length srclen
from encoding fromcode
to encoding tocode
. The resulting bytes are appended to buf
.
Throw FZ_ERROR_ABORT if a codepoint is not valid in the target encoding. Throw FZ_ERROR_GENERIC on all other errors.
char *pdfout_char_conv (fz_context *ctx, const char *fromcode, const char *tocode,
const char *src, int srclen, int *lengthp);
Like pdfout_char_conv_buffer
, but returns it's result as a char-pointer. The length of the result is stored in *lengthp
. The resulting string is always null-terminated.
pdf_obj *pdfout_utf8_to_str_obj (fz_context *ctx, pdf_document *doc,
const char *inbuf, int inbuf_len);
Convert UTF-8 string to PDF string object.
char *pdfout_str_obj_to_utf8 (fz_context *ctx, pdf_obj *string, int *len);
Convert PDF string object to UTF-8 string.
char *pdfout_pdf_to_utf8 (fz_context *ctx, const char *inbuf, int inbuf_len,
int *outbuf_len);
Convert PDF string (either PDFDOC or UTF-16BE) to null-terminated UTF-8.
char *pdfout_utf8_to_pdf (fz_context *ctx, const char *inbuf, int inbuf_len,
int *outbuf_len);
Convert UTF-8 to PDF string. If possible, use PDFDOCENCODING. If that fails, use UTF-16.
char *pdfout_check_utf8 (const char *s, size_t n);
Check if the string s
of length n
contains valid UTF-8. If it contains invalid UTF-8, return a pointer to the first invalid unit. Return NULL
, if the string is valid.
int pdfout_uctomb (fz_context *ctx, uint8_t *buf, ucs4_t uc, int n);
Convert the Unicode codepoint uc
to UTF-8. Store the result in buf
, which has length n
. Throw for invalid codepoint or if n
is too small. Always use n = 4
.