aboutsummaryrefslogtreecommitdiffstats
path: root/contrib/libs/expat/CMakeLists.txt
diff options
context:
space:
mode:
authordpotapov <dpotapov@yandex-team.com>2023-01-16 21:39:14 +0300
committerdpotapov <dpotapov@yandex-team.com>2023-01-16 21:39:14 +0300
commit328635a6bd949596c49a33c9c2b67d00cc2704db (patch)
tree84104ccf9cd6c8cf47e1ac329076bf47dfabb052 /contrib/libs/expat/CMakeLists.txt
parentbfa024664d4edef47218bc0af66af681cfad9a88 (diff)
downloadydb-328635a6bd949596c49a33c9c2b67d00cc2704db.tar.gz
charset: do not allow surrogate pairs in UTF-8
By [RFC3629 section 3](https://datatracker.ietf.org/doc/html/rfc3629#section-3): ``` The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. ``` Current implementation of `ReadUTF8CharAndAdvance` allows `UTF-8` encoded surrogate characters like 0xED 0xA0 0xBD or 0xED 0xB3 0x9A leaving them in the strings that cannot be processed afterwards by external programs like `iconv`. This patch provides `strict` template flag that disables this leniency. This flag is not enabled by default, because Arcadia already has hundreds of tests with inputs containing such surrogate pairs and these tests breaks in strict mode and there is a chance that prod might affected too. SSE4 implementation doesn't perform any validation at all, so it is left unchanged.
Diffstat (limited to 'contrib/libs/expat/CMakeLists.txt')
0 files changed, 0 insertions, 0 deletions