Add scheme inference documentation (#20012)

Co-authored-by: lopatinevgeny <lopatinevgeny@yandex-team.ru>
author: Nikolay Perfilov <pnv1@yandex-team.ru> 2025-07-21 04:17:31 +0300
committer: GitHub <noreply@github.com> 2025-07-21 08:17:31 +0700
commit: d7d2a8e5ca9fcbfd19004003a6d3532ed5af7d1a (patch)
tree: b8f05167ce542f5e282495564591becb32ae5162
parent: 0b136dd04fba520bfa85e85f3fe1b3ec27053387 (diff)
download: ydb-d7d2a8e5ca9fcbfd19004003a6d3532ed5af7d1a.tar.gz
8 files changed, 425 insertions, 0 deletions
diff --git a/ydb/docs/en/core/reference/ydb-cli/_includes/commands.md b/ydb/docs/en/core/reference/ydb-cli/_includes/commands.md
index 5555604077b..9e79f8d5701 100644
--- a/ydb/docs/en/core/reference/ydb-cli/_includes/commands.md
+++ b/ydb/docs/en/core/reference/ydb-cli/_includes/commands.md
@@ -68,6 +68,7 @@ Any command can be run from the command line with the `--help` option to get hel
 | [table ttl reset](../table-ttl-reset.md) | Resetting TTL parameters |
 | [tools copy](../tools-copy.md) | Copying tables |
 | [tools dump](../export-import/tools-dump.md) | Dumping a directory or table to the file system |
+| [tools infer csv](../tools-infer.md) | Generate a `CREATE TABLE` SQL query from a CSV file |
 | [tools rename](../commands/tools/rename.md) | Renaming tables |
 | [tools restore](../export-import/tools-restore.md) | Restoring data from the file system |
 | [topic create](../topic-create.md) | Creating a topic |
diff --git a/ydb/docs/en/core/reference/ydb-cli/export-import/_includes/import-file.md b/ydb/docs/en/core/reference/ydb-cli/export-import/_includes/import-file.md
index 3cb03ed0ae3..45c5f042d9b 100644
--- a/ydb/docs/en/core/reference/ydb-cli/export-import/_includes/import-file.md
+++ b/ydb/docs/en/core/reference/ydb-cli/export-import/_includes/import-file.md
@@ -10,6 +10,14 @@ If the table already includes data, it's replaced by imported data on primary ke
 
 The imported file must be in the [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. Line feeds aren't supported in the data field.
 
+{% note info %}
+
+If the table doesn't exist yet, you can use the [{{ ydb-cli }} tools infer csv](../../tools-infer.md) command to generate the `CREATE TABLE` statement based on an existing CSV file.
+
+You can also try to import into a non-existent table. In that case, the command will suggest running [{{ ydb-cli }} tools infer csv](../../tools-infer.md) with the correct options.
+
+{% endnote %}
+
 General format of the command:
 
 ```bash
diff --git a/ydb/docs/en/core/reference/ydb-cli/toc_i.yaml b/ydb/docs/en/core/reference/ydb-cli/toc_i.yaml
index fa4f218ed28..3754ef46562 100644
--- a/ydb/docs/en/core/reference/ydb-cli/toc_i.yaml
+++ b/ydb/docs/en/core/reference/ydb-cli/toc_i.yaml
@@ -37,6 +37,8 @@ items:
       href: table-ttl-reset.md
     - name: Deleting a table
       href: table-drop.md
+    - name: Generating CREATE TABLE from a file
+      href: tools-infer.md
   - name: Operations with data
     items:
     - name: Getting a query execution plan and AST
diff --git a/ydb/docs/en/core/reference/ydb-cli/tools-infer.md b/ydb/docs/en/core/reference/ydb-cli/tools-infer.md
new file mode 100644
index 00000000000..9879976cc23
--- /dev/null
+++ b/ydb/docs/en/core/reference/ydb-cli/tools-infer.md
@@ -0,0 +1,200 @@
+# Table schema inference from data files
+
+You can use the `{{ ydb-cli }} tools infer csv` command to generate a `CREATE TABLE` statement from a CSV data file. This can be helpful when you want to [import](./export-import/import-file.md) data into a database and the table has not been created yet.
+
+Command syntax:
+
+```bash
+{{ ydb-cli }} [global options...] tools infer csv [options...] <input files...>
+```
+
+- `global options` – [global options](commands/global-options.md).
+- `options` – [subcommand options](#options).
+
+To get the most up-to-date information about the command, use the `--help` option:
+
+```bash
+{{ ydb-cli }} tools infer csv --help
+```
+
+## Subcommand options {#options}
+
+Option Name | Description
+---|---
+`-p, --path` | Database path to the table that should be created. Default: `table`.
+`--columns` | Explicitly specifies table column names, as a comma-separated list.
+`--gen-columns` | Explicitly indicates that table column names should be generated automatically (column1, column2, ...).
+`--header` | Explicitly indicates that the first row in the CSV contains column names.
+`--rows-to-analyze` | Number of rows to analyze. 0 means unlimited. Reading will stop as soon as this number of rows is read. Default: `500000`.
+`--execute` | Execute the `CREATE TABLE` request immediately after generation.
+
+{% note info %}
+
+If none of the `--columns`, `--gen-columns`, or `--header` options are explicitly specified, the following algorithm is used:
+
+The values of the first row in the file are checked for the following conditions:
+
+* The values meet the [requirements for column names](../../yql/reference/syntax/create_table/index.md#column-naming-rules).
+* The types of the values in the first row are different from the data types in the other rows of the file.
+
+If both conditions are met, the values from the first row are used as the table's column names. Otherwise, column names are generated automatically (as `column1`, `column2`, etc.). See the [example](#example-default) below for more details.
+
+{% endnote %}
+
+## Column type inference algorithm {#column-type-inference}
+
+For each column, the command determines the least general type that fits all its values. The most general type is `Text`: if any value in a column is a string (for example, `abc`), the entire column will be inferred as `Text`.
+
+All integer values are inferred as `Int64` if they fit within the `Int64` range. If any value exceeds this range, the type is set to `Double`.
+
+Floating-point numbers are always inferred as `Double`.
+
+## Current Limitation {#current-limitation}
+
+The first column is always chosen as the primary key. You may need to change the primary key to one that is more appropriate for your use case. For recommendations, see [{#T}](../../dev/primary-key/index.md).
+
+## Examples {#examples}
+
+{% include [ydb-cli-profile](../../_includes/ydb-cli-profile.md) %}
+
+### Column names in the first row, no options specified {#example-default}
+
+The `key` and `value` values in the first row match the table column name requirements and do not match the data types in the other rows (`Int64` and `Text`).
+So the command uses the first row of the file as column names.
+
+```bash
+$ cat data_with_header.csv
+key,value
+123,abc
+456,def
+
+{{ ydb-cli }} tools infer csv data_with_header.csv
+CREATE TABLE table (
+    key Int64,
+    value Text,
+    PRIMARY KEY (key) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+{% note info %}
+
+The `WITH` block lists some useful table options. Uncomment the ones you need and remove the rest as appropriate.
+
+{% endnote %}
+
+### Column names in the first row, using `--header` option {#example-header}
+
+In this example, the `key` and `value` values in the first row match the data types (`Text`) in the other rows.
+In this case, without the `--header` option, the command would not use the first row of the file as column names but would generate column names automatically.
+To use the first row as column names in this situation, use the `--header` option explicitly.
+
+```bash
+$ cat data_with_header_text.csv
+key,value
+aaa,bbb
+ccc,ddd
+
+{{ ydb-cli }} tools infer csv data_with_header_text.csv --header
+CREATE TABLE table (
+    key Text,
+    value Text,
+    PRIMARY KEY (key) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+### Explicit column list {#example-columns}
+
+```bash
+cat ~/data_no_header.csv
+123,abc
+456,def
+
+{{ ydb-cli }} tools infer csv -p newtable ~/data_no_header.csv --columns my_key,my_value
+CREATE TABLE newtable (
+    my_key Int64,
+    my_value Text,
+    PRIMARY KEY (my_key)
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+### Automatically generate column names {#example-gen-columns}
+
+```bash
+cat ~/data_no_header.csv
+123,abc
+456,def
+
+{{ ydb-cli }} tools infer csv -p newtable ~/data_no_header.csv --gen-columns
+CREATE TABLE newtable (
+    column1 Int64,
+    column2 Text,
+    PRIMARY KEY (f0) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+### Executing generated statement using the `--execute` option {#example-execute}
+
+In this example, the `CREATE TABLE` statement is actually executed right after generation.
+
+```bash
+$ cat data_with_header.csv
+key,value
+123,abc
+456,def
+
+{{ ydb-cli }} -p quickstart tools infer csv data_with_header.csv --execute
+Executing request:
+
+CREATE TABLE table (
+    key Int64,
+    value Text,
+    PRIMARY KEY (key) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+
+Query executed successfully.
+```
diff --git a/ydb/docs/ru/core/reference/ydb-cli/_includes/commands.md b/ydb/docs/ru/core/reference/ydb-cli/_includes/commands.md
index 1558550c6af..93e9d3641fc 100644
--- a/ydb/docs/ru/core/reference/ydb-cli/_includes/commands.md
+++ b/ydb/docs/ru/core/reference/ydb-cli/_includes/commands.md
@@ -68,6 +68,7 @@ table attribute drop | Удаление атрибута у строковой �
 [table ttl reset](../table-ttl-reset.md) | Сброс параметров TTL для строковых и колоночных таблиц
 [tools copy](../tools-copy.md) | Копирование таблиц
 [tools dump](../export-import/tools-dump.md) | Выгрузка директории или таблиц в файловую систему
+[tools infer csv](../tools-infer.md) | Генерация текста запроса `CREATE TABLE SQL` из CSV файла
 {% if ydb-cli == "ydb" %}
 [tools pg-convert](../../../postgresql/import.md#pg-convert) | Конвертация дампа PostgreSQL, полученного утилитой pg_dump, в формат, понятный YDB
 {% endif %}
diff --git a/ydb/docs/ru/core/reference/ydb-cli/export-import/_includes/import-file.md b/ydb/docs/ru/core/reference/ydb-cli/export-import/_includes/import-file.md
index d2b8285e329..1e5bc6c04ec 100644
--- a/ydb/docs/ru/core/reference/ydb-cli/export-import/_includes/import-file.md
+++ b/ydb/docs/ru/core/reference/ydb-cli/export-import/_includes/import-file.md
@@ -10,6 +10,14 @@
 
 Импортируемый файл должен быть в кодировке [UTF-8](https://ru.wikipedia.org/wiki/UTF-8). Обработка переноса строки внутри поля данных не поддерживается.
 
+{% note info %}
+
+Если таблица еще не создана, можно использовать команду [{{ ydb-cli }} tools infer csv](../../tools-infer.md) для получения текста `CREATE TABLE` на основе имеющегося CSV-файла.
+
+При попытке выполнить импорт в несуществующую таблицу будет предложено выполнить [{{ ydb-cli }} tools infer csv](../../tools-infer.md) с нужными опциями.
+
+{% endnote %}
+
 Общий вид команды:
 
 ```bash
diff --git a/ydb/docs/ru/core/reference/ydb-cli/toc_i.yaml b/ydb/docs/ru/core/reference/ydb-cli/toc_i.yaml
index 99940dd8047..fce56e2b145 100644
--- a/ydb/docs/ru/core/reference/ydb-cli/toc_i.yaml
+++ b/ydb/docs/ru/core/reference/ydb-cli/toc_i.yaml
@@ -37,6 +37,8 @@ items:
       href: table-ttl-reset.md
     - name: Удаление таблицы
       href: table-drop.md
+    - name: Генерация скрипта создания таблицы
+      href: tools-infer.md
   - name: Работа с данными
     items:
     - name: Получение плана исполнения запроса и AST
diff --git a/ydb/docs/ru/core/reference/ydb-cli/tools-infer.md b/ydb/docs/ru/core/reference/ydb-cli/tools-infer.md
new file mode 100644
index 00000000000..462c3ef67b6
--- /dev/null
+++ b/ydb/docs/ru/core/reference/ydb-cli/tools-infer.md
@@ -0,0 +1,203 @@
+# Генерация скрипта создания таблицы
+
+Команда `{{ ydb-cli }} tools infer csv` позволяет на основе имеющегося CSV-файла с данными сгенерировать скрипт для создания таблицы.
+
+Общий вид команды:
+
+```bash
+{{ ydb-cli }} [global options...] tools infer csv [options...] <input files...>
+```
+
+* `global options` — [глобальные параметры](commands/global-options.md).
+* `options` — [параметры подкоманды](#options).
+
+Получить описание параметров команды можно с помощью опции `--help`:
+
+```bash
+{{ ydb-cli }} tools infer csv --help
+```
+
+## Параметры подкоманды {#options}
+
+Имя параметра | Описание параметра
+---|---
+`-p, --path` | Путь в базе данных, по которому должна быть создана новая таблица. Значение по умолчанию: `table`.
+`--columns` | Список имен колонок таблицы, разделенных запятыми.
+`--gen-columns` | Имена колонок таблицы необходимо сгенерировать автоматически (column1, column2, ...).
+`--header` | Имена колонок таблицы необходимо считать из первой строчки CSV-файла.
+`--rows-to-analyze` | Количество первых строк CSV-файла, подлежащих анализу для автоматического определения типов колонок. `0` - будут прочитаны и проанализированы все строки из файла. Значение по умолчанию: `500 000`.
+`--execute` | Выполнить создание таблицы по результатам генерации скрипта.
+
+{% note info %}
+
+Если ни одна из опций `--columns`, `--gen-names` или `--header` явно не указана, то применяется следующий алгоритм:
+
+Берётся первая строка из файла, и указанные в ней значения проверяются на следующие условия:
+
+* значения соответствуют [требованиям к наименованию колонок](../../yql/reference/syntax/create_table/index.md#column-naming-rules);
+* типы значений в первой строке отличаются от типов значений данных из других строк файла
+
+При выполнении обоих условий значения из первой строки файла используются в качестве имён колонок таблицы. В противном случае, имена колонок генерируются автоматически (как `column1`, `column2`, и т.д.). Подробнее см. [пример](#example-default) ниже.
+
+{% endnote %}
+
+## Алгоритм вывода типов колонок {#column-type-inference}
+
+Для каждой колонки команда определяет наименее общий тип, подходящий для всех её значений. Наиболее универсальным типом считается строка (`Text`): если среди числовых значений встречается хотя бы одно строковое (например, `abc`), тип всей колонки будет определён как `Text`.
+
+Все целые числа относятся к типу `Int64`, если они укладываются в его диапазон. При превышении границ используется тип `Double`.
+
+Числа с плавающей запятой также определяются как `Double`.
+
+## Текущее ограничение {#current-limitation}
+
+В качестве первичного ключа таблицы выбирается первая колонка. При необходимости можно изменить состав колонок первичного ключа на требуемый.
+
+[{#T}](../../dev/primary-key/index.md)
+
+## Примеры {#examples}
+
+{% include [ydb-cli-profile](../../_includes/ydb-cli-profile.md) %}
+
+### Имена колонок заданы в первой строке CSV-файла {#example-default}
+
+В этом примере ни одна из опций `--columns`, `--gen-names` или `--header` явно не указана.
+Значения `key` и `value` из первой строки файла соответствуют требованиям к именам колонок, а их типы (`Text` и `Text`) не совпадают с типами в остальных строках (`Int64` и `Text`). Таким образом, `key` и `value` будут выбраны в качестве имен колонок.
+
+```bash
+$ cat data_with_header.csv
+key,value
+123,abc
+456,def
+
+{{ ydb-cli }} tools infer csv data_with_header.csv
+CREATE TABLE table (
+    key Int64,
+    value Text,
+    PRIMARY KEY (key) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+{% note info %}
+
+При генерации скрипта автоматически добавляется блок `WITH` с дополнительными опциями для создаваемой таблицы. Все опции, кроме `STORE`, имеют значения по умолчанию и закомментированы.
+Вы можете указать требуемые значения нужных дополнительных опций самостоятельно.
+
+{% endnote %}
+
+### Имена колонок в первой строке CSV-файла, используется опция `--header` {#example-header}
+
+В этом примере значения `key` и `value` в первой строке совпадают с типами данных (`Text`) в остальных строках. Поэтому без опции `--header` команда не будет использовать первую строку как имена колонок, а сгенерирует их автоматически.
+Чтобы использовать первую строку как имена колонок в таком случае, явно укажите опцию `--header`:
+
+```bash
+$ cat data_with_header_text.csv
+key,value
+aaa,bbb
+ccc,ddd
+
+{{ ydb-cli }} tools infer csv data_with_header_text.csv --header
+CREATE TABLE table (
+    key Text,
+    value Text,
+    PRIMARY KEY (key) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+### Явное указание списка имен колонок {#example-columns}
+
+```bash
+cat ~/data_no_header.csv
+123,abc
+456,def
+
+{{ ydb-cli }} tools infer csv -p newtable ~/data_no_header.csv --columns my_key,my_value
+CREATE TABLE newtable (
+    my_key Int64,
+    my_value Text,
+    PRIMARY KEY (my_key)
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+### Автоматическая генерация имён колонок {#example-gen-columns}
+
+```bash
+cat ~/data_no_header.csv
+123,abc
+456,def
+
+{{ ydb-cli }} tools infer csv -p newtable ~/data_no_header.csv --gen-columns
+CREATE TABLE newtable (
+    column1 Int64,
+    column2 Text,
+    PRIMARY KEY (f0) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+```
+
+### Выполнение сгенерированного запроса с помощью опции `--execute` {#example-execute}
+
+В этом примере запрос `CREATE TABLE` выполняется сразу после генерации.
+
+```bash
+$ cat data_with_header.csv
+key,value
+123,abc
+456,def
+
+{{ ydb-cli }} -p quickstart tools infer csv data_with_header.csv --execute
+Executing request:
+
+CREATE TABLE table (
+    key Int64,
+    value Text,
+    PRIMARY KEY (key) -- First column is chosen. Probably need to change this.
+)
+WITH (
+    STORE = ROW -- or COLUMN
+    -- Other useful table options to consider:
+    --, AUTO_PARTITIONING_BY_SIZE = ENABLED
+    --, AUTO_PARTITIONING_BY_LOAD = ENABLED
+    --, UNIFORM_PARTITIONS = 100 -- Initial number of partitions
+    --, AUTO_PARTITIONING_MIN_PARTITIONS_COUNT = 100
+    --, AUTO_PARTITIONING_MAX_PARTITIONS_COUNT = 1000
+);
+
+Query executed successfully.
+```
+
author	Nikolay Perfilov <pnv1@yandex-team.ru>	2025-07-21 04:17:31 +0300
committer	GitHub <noreply@github.com>	2025-07-21 08:17:31 +0700
commit	d7d2a8e5ca9fcbfd19004003a6d3532ed5af7d1a (patch)
tree	b8f05167ce542f5e282495564591becb32ae5162
parent	0b136dd04fba520bfa85e85f3fe1b3ec27053387 (diff)
download	ydb-d7d2a8e5ca9fcbfd19004003a6d3532ed5af7d1a.tar.gz