KIKIMR-17061: limit allowed symbols for names in scheme

Previously, we used C function isalnum to check if the symbols in every path part of tables (and, I guess, other objects too) in schemes were alphanumeric. It depends on the global C locale of the process YDB is run in. YDB does not set the global locale for itself currently, so it should be "C", if nobody called setlocale yet (this assumption seems to be true for our code as of now). In theory we could have left the code as it was (technically I use the same "C" locale, but explicitly) and promise to ourselves that we will never ever call setlocale (or call std::locale::global) to set global locale to anything different from "C". However, I see a couple of reasons to push this change: 1) It is undefined behaviour to call C function isalnum with signed chars (x86, I believe, has signed chars) with values not representable as unsigned chars or EOF (C++ variant does not suffer from this issue). See: a) this note from cppreference: [https://en.cppreference.com/w/cpp/string/byte/isalnum] Notes Like all other functions from <cctype>, the behavior of std::isalnum is undefined if the argument's value is neither representable as unsigned char nor equal to EOF. To use these functions safely with plain chars (or signed chars), the argument should first be converted to unsigned char: bool my_isalnum(char ch) { return std::isalnum(static_cast<unsigned char>(ch)); } b) and this note from Microsoft which is a bit more specific in terms of what does "representable as unsigned char" mean exactly: [https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/isalnum-iswalnum-isalnum-l-iswalnum-l?view=msvc-170] The behavior of isalnum and _isalnum_l is undefined if c isn't EOF or in the range 0 through 0xFF, inclusive. When a debug CRT library is used and c isn't one of these values, the functions raise an assertion. 2) Even if we do control the code of YDB, it is still slightly better to explicitly set the expected locale. The reader can reason about the behaviour of such explicit function faster, without checking if somebody calls setlocale elsewhere. As you will see in the code, I propose to use std::isalnum from C++ instead of the plain isalnum from C with specific hardcoded locale ("C"). One thing I need to mention about the unit tests is that the old implementation of PathPartBrokenAt does pass them successfully. It happens because isalnum uses global C locale, not the global C++ locale (which theoretically might be different from the global C locale, see std::locale::global implementation for details) and I can't create mock C locale, we can choose only from the locales installed in the system. We can reasonable expect that there is some utf-8 locale installed in the system. I hope "C.UTF-8" is installed on every machine on which we would like these unit tests to work correctly. However, there is no such char value (and the old implementation classifies symbols char by char, not as code points) that is classified differently by isalnum in "C" locale and isalnum in the utf-8 locale. I could use KOI8-R (or some other 1 byte encoding that has different isalnum behaviour than the "C" locale) in tests to break the old implementation, but we can't rely on it being installed on every testing machine.
author: deminds <[email protected]> 2023-09-29 14:40:48 +0300
committer: deminds <[email protected]> 2023-09-29 15:12:14 +0300
commit: a543cac8b62dfcdad87f08d2080b78eb1e3a8e00 (patch)
tree: 16fced5d95a218cccccee2bad49f03218b40ca9c
parent: 958dddf914ab334e16df1d0f5d12c0eec7b96943 (diff)
3 files changed, 153 insertions, 1 deletions
diff --git a/ydb/core/base/path.cpp b/ydb/core/base/path.cpp
index 6510980b37a..4e7c91bc086 100644
--- a/ydb/core/base/path.cpp
+++ b/ydb/core/base/path.cpp
@@ -3,6 +3,8 @@
 #include <util/string/builder.h>
 #include <util/string/printf.h>
 
+#include <locale>
+
 namespace NKikimr {
 
 TVector<TString> SplitPath(TString path) {
@@ -206,7 +208,7 @@ bool IsPathPartContainsOnlyDots(const TString &part) {
 TString::const_iterator PathPartBrokenAt(const TString &part, const TStringBuf extraSymbols) {
     static constexpr TStringBuf basicSymbols = "-_.";
     for (auto it = part.begin(); it != part.end(); ++it) {
-        if (!isalnum(*it)
+        if (!std::isalnum(*it, std::locale::classic())
                 && !basicSymbols.Contains(*it)
                 && !extraSymbols.Contains(*it)) {
             return it;
diff --git a/ydb/core/base/ut/path_ut.cpp b/ydb/core/base/ut/path_ut.cpp
index 52e1af3a7a8..8cb4c4759dd 100644
--- a/ydb/core/base/ut/path_ut.cpp
+++ b/ydb/core/base/ut/path_ut.cpp
@@ -1,6 +1,9 @@
 #include "path.h"
 #include <library/cpp/testing/unittest/registar.h>
 
+#include <array>
+#include <locale>
+
 //#define UT_PERF_TEST
 
 using namespace NKikimr;
@@ -107,6 +110,101 @@ Y_UNIT_TEST_SUITE(Path) {
     }
 #endif
 
+    Y_UNIT_TEST(Name_EnglishAlphabet) {
+        const TString pathPart = "NameInEnglish";
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart), pathPart.end());
+    }
+
+    Y_UNIT_TEST(Name_RussianAlphabet) {
+        const TString pathPart = "НазваниеНаРусском";
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart), pathPart.begin());
+    }
+
+    class TLocaleGuard {
+    public:
+        explicit TLocaleGuard(const std::locale& targetLocale)
+            : OriginalLocale_(std::locale::global(targetLocale))
+        {
+        }
+        ~TLocaleGuard() {
+            std::locale::global(OriginalLocale_);
+        }
+    
+    private:
+        const std::locale OriginalLocale_;
+    };
+
+    Y_UNIT_TEST(Name_RussianAlphabet_SetLocale_C) {
+        TLocaleGuard localeGuard(std::locale("C"));
+        const TString pathPart = "НазваниеНаРусском";
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart), pathPart.begin());
+    }
+
+    Y_UNIT_TEST(Name_RussianAlphabet_SetLocale_C_UTF8) {
+        try {
+            TLocaleGuard localeGuard(std::locale("C.UTF-8"));
+            const TString pathPart = "НазваниеНаРусском";
+            UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart), pathPart.begin());
+        } catch (std::runtime_error) {
+            // basic utf-8 locale is absent in the system, abort the test
+        }
+    }
+
+    Y_UNIT_TEST(Name_AllSymbols) {
+        const auto isAllowed = [](char symbol) {
+            constexpr std::array<char, 3> allowedSymbols = {'-', '_', '.'};
+            return std::isalnum(symbol, std::locale::classic())
+                || std::find(allowedSymbols.begin(), allowedSymbols.end(), symbol) != allowedSymbols.end();
+        };
+
+        for (char symbol = std::numeric_limits<char>::min(); ; ++symbol) {
+            const TString pathPart(1, symbol);
+            UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart), isAllowed(symbol) ? pathPart.end() : pathPart.begin());
+            
+            if (symbol == std::numeric_limits<char>::max()) {
+                break;
+            }
+        }
+    }
+
+    // This ctype facet classifies 'z' letter as not alphabetic.
+    // Code is taken from https://en.cppreference.com/w/cpp/locale/ctype_char.
+    struct TWeirdCtypeFacet : std::ctype<char>
+    {
+        static const mask* MakeTable()
+        {
+            // make a copy of the "C" locale table
+            static std::vector<mask> weirdTable(classic_table(), classic_table() + table_size);
+
+            // reclassify 'z'
+            weirdTable['z'] &= ~alpha;
+            return &weirdTable[0];
+        }
+    
+        TWeirdCtypeFacet(std::size_t refs = 0) : ctype(MakeTable(), false, refs) {}
+    };
+
+    Y_UNIT_TEST(Name_WeirdLocale_RegularName) {
+        TLocaleGuard localeGuard(std::locale(std::locale::classic(), new TWeirdCtypeFacet));
+        const TString regularName = "a";
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(regularName), regularName.end());
+
+    }
+
+    Y_UNIT_TEST(Name_WeirdLocale_WeirdName) {
+        TLocaleGuard localeGuard(std::locale(std::locale::classic(), new TWeirdCtypeFacet));
+        UNIT_ASSERT(!std::isalnum('z', std::locale()));
+
+        const TString weirdName = "z";
+        // path part should not be considered to be broken, we should ignore the global locale
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(weirdName), weirdName.end());
+    }
+
+    Y_UNIT_TEST(Name_ExtraSymbols) {
+        const TString pathPart = "this string contains whitespaces";
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart), std::find(pathPart.begin(), pathPart.end(), ' '));
+        UNIT_ASSERT_EQUAL(PathPartBrokenAt(pathPart, " "), pathPart.end());
+    }
 }
 
 }
diff --git a/ydb/core/tx/schemeshard/ut_base/ut_base.cpp b/ydb/core/tx/schemeshard/ut_base/ut_base.cpp
index d33a669908a..23b26cab78e 100644
--- a/ydb/core/tx/schemeshard/ut_base/ut_base.cpp
+++ b/ydb/core/tx/schemeshard/ut_base/ut_base.cpp
@@ -6,6 +6,8 @@
 #include <util/generic/size_literals.h>
 #include <util/string/cast.h>
 
+#include <locale>
+
 using namespace NKikimr;
 using namespace NSchemeShard;
 using namespace NSchemeShardUT_Private;
@@ -115,6 +117,56 @@ Y_UNIT_TEST_SUITE(TSchemeShardTest) {
                             NLs::ShardsInsideDomain(0)});
     }
 
+    Y_UNIT_TEST(PathName) {
+        TTestBasicRuntime runtime;
+        TTestEnv env(runtime);
+        ui64 txId = 100;
+
+        TestMkDir(runtime, ++txId, "/MyRoot", "NameInEnglish");
+        TestMkDir(runtime, ++txId, "/MyRoot", "НазваниеНаРусском", {NKikimrScheme::StatusSchemeError});
+
+        env.TestWaitNotification(runtime, xrange(txId - 1, txId + 1));
+
+        TestDescribeResult(DescribePath(runtime, "/MyRoot"),
+                           {NLs::Finished});
+        TestDescribeResult(DescribePath(runtime, "/MyRoot/NameInEnglish"),
+                           {NLs::Finished});
+        TestDescribeResult(DescribePath(runtime, "/MyRoot/НазваниеНаРусском"),
+                           {NLs::PathNotExist});
+    }
+
+    class TLocaleGuard {
+    public:
+        explicit TLocaleGuard(const std::locale& targetLocale)
+            : OriginalLocale_(std::locale::global(targetLocale))
+        {
+        }
+        ~TLocaleGuard() {
+            std::locale::global(OriginalLocale_);
+        }
+    
+    private:
+        const std::locale OriginalLocale_;
+    };
+
+    Y_UNIT_TEST(PathName_SetLocale) {
+        TTestBasicRuntime runtime;
+        TTestEnv env(runtime);
+        ui64 txId = 100;
+
+        try {
+            TLocaleGuard localeGuard(std::locale("C.UTF-8"));
+            TestMkDir(runtime, ++txId, "/MyRoot", "НазваниеНаРусском", {NKikimrScheme::StatusSchemeError});
+
+            env.TestWaitNotification(runtime, txId);
+
+            TestDescribeResult(DescribePath(runtime, "/MyRoot/НазваниеНаРусском"),
+                            {NLs::PathNotExist});
+        } catch (std::runtime_error) {
+            // basic utf-8 locale is absent in the system, abort the test
+        }
+    }
+
     using TRuntimeTxFn = std::function<void(TTestBasicRuntime&, ui64)>;
 
     void DropTwice(const TString& path, TRuntimeTxFn createFn, TRuntimeTxFn dropFn) {
author	deminds <[email protected]>	2023-09-29 14:40:48 +0300
committer	deminds <[email protected]>	2023-09-29 15:12:14 +0300
commit	a543cac8b62dfcdad87f08d2080b78eb1e3a8e00 (patch)
tree	16fced5d95a218cccccee2bad49f03218b40ca9c
parent	958dddf914ab334e16df1d0f5d12c0eec7b96943 (diff)