diff options
author | alextarazanov <alextarazanov@yandex-team.com> | 2022-11-07 10:47:59 +0300 |
---|---|---|
committer | alextarazanov <alextarazanov@yandex-team.com> | 2022-11-07 10:47:59 +0300 |
commit | 442a3c1a91f48ba45072ac76f4baf4f6c038957a (patch) | |
tree | 751ad4ce378790b5dc4d9d28827a5e7b7b085a68 | |
parent | 23f0e57aceec4880bbbb4495ea2365bdd7cd5b66 (diff) | |
download | ydb-442a3c1a91f48ba45072ac76f4baf4f6c038957a.tar.gz |
[review] [YDB] Check SelfHeal translate
Лог локальной сборки без ошибок
-rw-r--r-- | ydb/docs/en/core/_assets/pencil.svg | 3 | ||||
-rw-r--r-- | ydb/docs/en/core/maintenance/manual/selfheal.md | 241 | ||||
-rw-r--r-- | ydb/docs/en/core/maintenance/manual/toc_i.yaml | 2 | ||||
-rw-r--r-- | ydb/docs/ru/core/maintenance/manual/toc_i.yaml | 2 |
4 files changed, 141 insertions, 107 deletions
diff --git a/ydb/docs/en/core/_assets/pencil.svg b/ydb/docs/en/core/_assets/pencil.svg new file mode 100644 index 00000000000..be1c4581d60 --- /dev/null +++ b/ydb/docs/en/core/_assets/pencil.svg @@ -0,0 +1,3 @@ +<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" viewBox="-2 -1 16 16"><g stroke="none" fill="currentColor" stroke-width="1px">
+ <path d="M10.022 7.172L4.578 13.61a.505.505 0 0 1-.446.175.812.812 0 0 1-.087.005H.5a.498.498 0 0 1-.202-.043.439.439 0 0 1-.298-.415v-3.44c0-.124.048-.237.127-.32a.477.477 0 0 1 .043-.06l5.257-6.195 4.595 3.855zm1.291-1.527L6.721 1.792 8.01.274a.467.467 0 0 1 .068-.066A.428.428 0 0 1 8.721.1l3.89 3.262a.483.483 0 0 1 .019.723.372.372 0 0 1-.022.028l-1.295 1.532zM1 10.574v2.216h2.51a2.231 2.231 0 0 0-2.23-2.216H1z" stroke="none" fill="currentColor" stroke-width="1px"></path>
+</g></svg>
diff --git a/ydb/docs/en/core/maintenance/manual/selfheal.md b/ydb/docs/en/core/maintenance/manual/selfheal.md index 012c02908ee..5defdfbf0fb 100644 --- a/ydb/docs/en/core/maintenance/manual/selfheal.md +++ b/ydb/docs/en/core/maintenance/manual/selfheal.md @@ -1,110 +1,141 @@ -# Enabling/disabling SelfHeal {#selfheal} +# Working with SelfHeal -During cluster operation, individual block store volumes used by YDB or entire nodes may fail. To maintain the cluster's uptime and fault tolerance when it's impossible to promptly fix the failed nodes or volumes, YDB provides SelfHeal. +While a clusters are running, entire nodes or individual block devices that {{ ydb-short-name }} runs on can fail. -SelfHeal includes two parts. Detecting faulty disks and moving them carefully to avoid data loss and disintegration of storage groups. +SelfHeal ensures a cluster's continuous performance and fault tolerance if malfunctioning nodes or devices cannot be repaired quickly. -SelfHeal is enabled by default. -Below are instructions how to enable or disable SelfHeal. - -1. Enabling detection - - Open the page - - ```http://localhost:8765/cms#show=config-items-25``` - - It can be enabled via the viewer -> Cluster Management System -> CmsConfigItems - - Status field: Enable - - Or via the CLI - - * Go to any node - - * Create a file with modified configurations - - Sample config.txt file - - ``` - Actions { - AddConfigItem { - ConfigItem { - Config { - CmsConfig { - SentinelConfig { - Enable: true - } - } - } - } - } - } - ``` - - * Update the config on the cluster - - ```bash - kikimr admin console configs update config.txt - ``` - -2. Enable SelfHeal - - ```bash - kikimr -s <endpoint> admin bs config invoke --proto 'Command{EnableSelfHeal{Enable: true}}' - ``` +SelfHeal can: +* Detect faulty system elements. +* Transfer faulty elements carefully without data loss and disintegration of storage groups. -Disabled in a similar way by setting the value to false. - -### SelfHeal settings - -viewer -> Cluster Management System -> CmsConfigItems -If there are no settings yet, click Create, if there are, click the "pencil" icon in the corner. - -* **Status**: Enables/disables Self Heal in the CMS. -* **Dry run**: Enables/disables the mode in which the CMS doesn't change the BSC setting. -* **Config update interval (sec.)**: BSC config update interval. -* **Retry interval (sec.)**: Config update retry interval. -* **State update interval (sec.)**: PDisk state update interval, the State is what we're monitoring (through a whiteboard, for example) -* **Timeout (sec.)**: PDisk state update timeout -* **Change status retries**: The number of retries to change the PDisk Status in the BSC, the Status is what is stored in the BSC (ACTIVE, FAULTY, BROKEN, and so on). -* **Change status retry interval (sec.)**: Interval between retries to change the PDisk Status in the BSC. The CMS is monitoring the disk state with the **State update interval**. If the disk remains in the same state for several **Status update interval** cycles, the CMS changes its Status in the BSC. -Next are the settings for the number of update cycles through which the CMS will change the disk Status. If the disk State is Normal, the disk is switched to the ACTIVE Status. Otherwise, the disk is switched to the FAULTY status. The 0 value disables changing the Status for the state (this is done for Unknown by default). -For example, with the default settings, if the CMS is monitoring the state of the Initial disk for 5 Status update interval cycles of 60 seconds each, the disk Status will be changed to FAULTY. -* **Default state limit**: For States with no setting specified, this default value can be used. For unknown PDisk States that have no setting, this value is used, too. This value is used if no value is set for States such as Initial, InitialFormatRead, InitialSysLogRead, InitialCommonLogRead, and Normal. -* **Initial**: PDisk starts initializing. Transition to FAULTY. -* **InitialFormatRead**: PDisk is reading its format. Transition to FAULTY. -* **InitialFormatReadError**: PDisk has received an error when reading its format. Transition to FAULTY. -* **InitialSysLogRead**: PDisk is reading the system log. Transition to FAULTY. -* **InitialSysLogReadError**: PDisk has received an error when reading the system log. Transition to FAULTY. -* **InitialSysLogParseError**: PDisk has received an error when parsing and checking the consistency of the system log. Transition to FAULTY. -* **InitialCommonLogRead**: PDisk is reading the common VDisk log. Transition to FAULTY. -* **InitialCommonLogReadError**: PDisk has received an error when reading the common VDisk log. Transition to FAULTY. -* **InitialCommonLogParseError**: PDisk has received an error when parsing and checking the consistency of the common log. Transition to FAULTY. -* **CommonLoggerInitError**: PDisk has received an error when initializing internal structures to be logged to the common log. Transition to FAULTY. -* **Normal**: PDisk has completed initialization and is running normally. Transition to ACTIVE will occur after this number of Cycles (that is, by default, if the disk is Normal for 5 minutes, it's switched to ACTIVE). -* **OpenFileError**: PDisk has received an error when opening a disk file. Transition to FAULTY. -* **Missing**: The node responds, but this PDisk is missing from its list. Transition to FAULTY. -* **Timeout**: The node didn't respond within the specified timeout. Transition to FAULTY. -* **NodeDisconnected**: The node has disconnected. Transition to FAULTY. -* **Unknown**: Something unexpected, for example, the TEvUndelivered response to the state request. Transition to FAULTY. - -## Enabling/disabling donor disks - -If donor disks are disabled, when moving the VDisk, its data is lost and has to be restored according to the selected erasure. - -The recovery operation is more expensive than regular data transfers. Data loss also occurs, which may lead to data loss when going beyond the failure model. - -To prevent the above problems, there are donor disks. - -When transferring disks with donor disks enabled, the old VDisk remains alive until the new one transfers all the data from it to itself. - -The donor disk is the old VDisk after the transfer, which continues to store its data and only responds to read requests from the new VDisk. - -When receiving a request to read data that the new VDisk has not yet transferred, it redirects the request to the donor disk. - -To enable the donor disks, run the following command: - -`$ kikimr admin bs config invoke --proto 'Command { UpdateSettings { EnableDonorMode: true } }'` - -Similarly, when changing the setting to `false`, the command disables the mode. +SelfHeal is enabled by default. +## Enabling and disabling SelfHeal {#on-off} + +{% list tabs %} + +- Enable SelfHeal + + 1. To enable fault detection, go to `http://localhost:8765/cms#show=config-items-25`. + 1. Go to any node. + 1. Create an updated configuration file with the parameter `SentinelConfig { Enable: true }`. + + Sample `config.txt` file: + + ```text + Actions { + AddConfigItem { + ConfigItem { + Config { + CmsConfig { + SentinelConfig { + Enable: true + } + } + } + } + } + } + ``` + + 1. Run the command: + + ```bash + kikimr admin console configs update config.txt + ``` + + 1. To enable data transfer, run the command: + + ```bash + kikimr -s <endpoint> admin bs config invoke --proto 'Command{EnableSelfHeal{Enable: true}}' + ``` + +- Disable SelfHeal + + 1. To disable fault detection, go to `http://localhost:8765/cms#show=config-items-25`. + 1. Go to any node. + 1. Create an updated configuration file with the parameter `SentinelConfig { Enable: false }`. + + Sample `config.txt` file: + + ```text + Actions { + AddConfigItem { + ConfigItem { + Config { + CmsConfig { + SentinelConfig { + Enable: false + } + } + } + } + } + } + ``` + + 1. Run the command: + + ```bash + kikimr admin console configs update config.txt + ``` + + 1. To disable data transfer, run the command: + + ```bash + kikimr -s <endpoint> admin bs config invoke --proto 'Command{EnableSelfHeal{Enable: false}}' + ``` + +{% endlist %} + +## SelfHeal settings {#settings} + +You can configure SelfHeal in **Viewer** → **Cluster Management System** → **CmsConfigItems**. + +To create the initial settings, click **Create**. If you want to update the current settings, click . + +You can use the following settings: + +| **Parameter** | **Description** | +|:---------------------------------------- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Status** | Enabling and disabling SelfHeal in CMS. | +| **Dry run** | Enables/disables the mode in which the CMS doesn't change the BSC setting. | +| **Config update interval (sec.)** | BSC configuration update interval. | +| **Retry interval (sec.)** | Interval of configuration update attempts. | +| **State update interval (sec.)** | PDisk state update interval. | +| **Timeout (sec.)** | PDisk state update timeout. | +| **Change status retries** | Number of retries to change the PDisk status for BSC (`ACTIVE`, `FAULTY`, `BROKEN`, and so on). | +| **Change status retry interval (sec.)** | Delay between retries to update the PDisk status in BSC. CMS monitors the status of the disk with the interval **State update inverval**. If the disk remains in one **Status update interval** state during several cycles, the CMS changes its status to BSC.<br>Next are the settings for the number of update cycles after which the CMS changes the disk status. If the disk state is `Normal`, the disk status changes to `ACTIVE`. In other states, the disk switches to `FAULTY`.<br>The `0` value disables status changes for the state (by default, this is set for `Unknown`).<br>For example, with the default settings, if the CMS detects the `Initial` disk state for five `Status update interval` cycles which are 60 seconds each, the disk status changes to `FAULTY`. | +| **Default state limit** | For states with no setting specified, this value can be used by default. This value is also used for unknown PDisk states that don't have any settings. It's used if no value is set for states such as `Initial`, `InitialFormatRead`, `InitialSysLogRead`, `InitialCommonLogRead`, and `Normal`. | +| **Initial** | PDisk starts initializing. Transition to `FAULTY`. | +| **InitialFormatRead** | PDisk is reading its format. Transition to `FAULTY`. | +| **InitialFormatReadError** | PDisk received an error when reading its format. Transition to `FAULTY`. | +| **InitialSysLogRead** | PDisk is reading the system log. Transition to `FAULTY`. | +| **InitialSysLogReadError** | PDisk received an error when reading the system log. Transition to `FAULTY`. | +| **InitialSysLogParseError** | PDisk received an error when parsing and checking the consistency of the system log. Transition to `FAULTY`. | +| **InitialCommonLogRead** | PDisk is reading the common VDisk log. Transition to `FAULTY`. | +| **InitialCommonLogReadError** | PDisk received an error when reading the common VDisk log. Transition to `FAULTY`. | +| **InitialCommonLogParseError** | PDisk received an error when parsing and checking the consistency of the common log. Transition to `FAULTY`. | +| **CommonLoggerInitError** | PDisk received an error when initializing internal structures to be logged to the common log. Transition to `FAULTY`. | +| **Normal** | PDisk completed initialization and is running normally. Transition to `ACTIVE` will occur after a specified number of cycles (for example, if the disk is `Normal` for 5 minutes, it switches to `ACTIVE`). | +| **OpenFileError** | PDisk received an error when opening a disk file. Transition to `FAULTY`. | +| **Missing** | The node responds, but this PDisk is missing from its list. Transition to `FAULTY`. | +| **Timeout** | The node didn't respond within the specified timeout. Transition to `FAULTY`. | +| **NodeDisconnected** | The node has disconnected. Transition to `FAULTY`. | +| **Unknown** | Unexpected response, for example, `TEvUndelivered` to the state request. Transition to `FAULTY`. | + +## Working with donor disks {#disks} + +To prevent data loss when moving a VDisk, enable donor disks: + +```bash +kikimr admin bs config invoke --proto 'Command { UpdateSettings { EnableDonorMode: true } }' +``` + +To disable donor disks, set `EnableDonorMode` to `false` in the same command: + +```bash +kikimr admin bs config invoke --proto 'Command { UpdateSettings { EnableDonorMode: false } }' +``` + +The donor disk is the previous VDisk after the data transfer, which continues to store its data and only responds to read requests from the new VDisk. When data is transfered with donor disks enabled, previous VDisks continue to function until the data is fully moved to the new disks.
\ No newline at end of file diff --git a/ydb/docs/en/core/maintenance/manual/toc_i.yaml b/ydb/docs/en/core/maintenance/manual/toc_i.yaml index 8f62cd8cc4c..73c1dd41965 100644 --- a/ydb/docs/en/core/maintenance/manual/toc_i.yaml +++ b/ydb/docs/en/core/maintenance/manual/toc_i.yaml @@ -11,7 +11,7 @@ items: href: adding_storage_groups.md - name: Safe restart and shutdown of nodes href: node_restarting.md -- name: Enabling/disabling SelfHeal +- name: Working with SelfHeal href: selfheal.md - name: Enabling/disabling Scrubbing href: scrubbing.md diff --git a/ydb/docs/ru/core/maintenance/manual/toc_i.yaml b/ydb/docs/ru/core/maintenance/manual/toc_i.yaml index c47a3143d31..5df3fa3408f 100644 --- a/ydb/docs/ru/core/maintenance/manual/toc_i.yaml +++ b/ydb/docs/ru/core/maintenance/manual/toc_i.yaml @@ -11,7 +11,7 @@ items: href: adding_storage_groups.md - name: Безопасные рестарт и выключение узлов href: node_restarting.md -- name: Включение и выключение SelfHeal +- name: Работа с SelfHeal href: selfheal.md - name: Включение/выключение Scrubbing href: scrubbing.md |