[review] Check translate Blobstorage configuration

author: alextarazanov <alextarazanov@yandex-team.com> 2022-08-10 15:43:40 +0300
committer: alextarazanov <alextarazanov@yandex-team.com> 2022-08-10 15:43:40 +0300
commit: 47782bc45c02bda33735f8d275730cb6227199f8 (patch)
tree: efb4dfb5be95c6a8abb5aa3b990b4629cd45e86c
parent: d305bd814a40254aef0919b960bc9f24a7e9569f (diff)
download: ydb-47782bc45c02bda33735f8d275730cb6227199f8.tar.gz
5 files changed, 87 insertions, 6 deletions
diff --git a/ydb/docs/en/core/administration/production-storage-config.md b/ydb/docs/en/core/administration/production-storage-config.md
new file mode 100644
index 00000000000..6e2a89f39ec
--- /dev/null
+++ b/ydb/docs/en/core/administration/production-storage-config.md
@@ -0,0 +1,78 @@
+# BlobStorage production configurations
+
+To ensure the required fault tolerance of {{ ydb-short-name }}, configure the [cluster disk subsystem](../concepts/cluster/distributed_storage.md) properly: select the appropriate [fault tolerance mode](#fault-tolerance) and [hardware configuration](#requirements) for your cluster.
+
+## Fault tolerance modes {#fault-tolerance}
+
+We recommend using the following [fault tolerance modes](../cluster/topology.md) for {{ ydb-short-name }} production installations:
+
+* `block-4-2`: For a cluster hosted in a single availability zone.
+* `mirror-3-dc`: For a cluster hosted in three availability zones.
+
+A fail model of {{ ydb-short-name }} is based on concepts such as fail domain and fail realm.
+
+Fail domain
+
+: A set of hardware that may fail concurrently.
+
+  For example, a fail domain includes disks of the same server (as all server disks may be unavailable if the server PSU or network controller is down). A fail domain also includes servers located in the same server rack (as the entire hardware in the rack may be unavailable if there is a power outage or some issue with the network hardware in the same rack).
+
+  Any domain fail is handled automatically, without shutting down the system.
+
+Fail realm
+
+: A set of fail domains that may fail concurrently.
+
+  An example of a fail realm is hardware located in the same data center that may fail as a result of a natural disaster.
+
+Usually a fail domain is a server rack, while a fail realm is a data center.
+
+When creating a [storage group](../concepts/databases.md#storage-groups), {{ ydb-short-name }} groups VDisks that are located on PDisks from different fail domains. For `block-4-2` mode, a PDisk should be distributed across at least 8 fail domains, and for `mirror-3-dc` mode, across 3 fail realms with at least 3 fail domains in each of them.
+
+## Hardware configuration {#requirements}
+
+If a disk fails, {{ ydb-short-name }} may automatically reconfigure a storage group so that, instead of the VDisk located on the failed hardware, a new VDisk is used that the system tries to place on hardware that is running while the group is being reconfigured. In this case, the same rule applies as when creating a group: a VDisk is created in a fail domain that is different from the fail domains of any other VDisk in this group (and in the same fail realm as that of the failed VDisk for `mirror-3-dc`).
+
+This causes some issues when a cluster's hardware is distributed across the minimum required amount of fail domains:
+
+* If the entire fail domain is down, reconfiguration no longer makes sense, since a new VDisk can only be located in the fail domain that is down.
+* If part of a fail domain is down, reconfiguration is possible, but the load that was previously handled by the failed hardware will only be redistributed across hardware in the same fail domain.
+
+If the number of fail domains in a cluster exceeds the minimum amount required for creating storage groups at least by one (that is, 9 domains for `block-4-2` and 4 domains in each fail realm for `mirror-3-dc)`, in case some hardware fails, the load can be redistributed across all the hardware that is still running.
+
+The system can work with fail domains of any size. However, if there are few domains and a different number of disks in different domains, the amount of storage groups that you can create will be limited. In this case, some hardware in fail domains that are too large may be underutilized. If the hardware is used in full, significant distortions in domain sizes may make reconfiguration impossible.
+
+> For example, there are 15 racks in a cluster with `block-4-2` fault tolerance mode. The first of the 15 racks hosts 20 servers and the other 14 racks host 10 servers each. To fully utilize all the 20 servers from the first rack, {{ ydb-short-name }} will create groups so that 1 disk from this largest fail domain is used in each group. As a result, if any other fail domain's hardware is down, the load can't be distributed to the hardware in the first rack.
+
+{{ ydb-short-name }} can group disks of different vendors, capacity, and speed. The resulting characteristics of a group depend on a set of the worst characteristics of the hardware that is serving the group. Usually the best results can be achieved if you use same-type hardware. When creating large clusters, keep in mind that hardware from the same batch is more likely to have the same defect and fail simultaneously.
+
+Therefore, we recommend the following optimal hardware configurations for production installations:
+
+* **A cluster hosted in 1 availability zone**: It uses `block4-2` fault tolerance mode and consists of 9 or more racks with the same amount of identical servers in each rack.
+* **A cluster hosted in 3 availability zones**: It uses `mirror3-dc` fault tolerance mode and is distributed across 3 data centers with 4 or more racks in each of them, the racks being equipped with the same amount of identical servers.
+
+See also [{#T}](#reduced).
+
+## Redundancy recovery {#rebuild}
+
+Auto reconfiguration of storage groups reduces the risk of data loss in the event of multiple failures that occur within intervals sufficient to recover the redundancy. By default, reconfiguration is done one hour after {{ ydb-short-name }} detects a failure.
+
+Once a group is reconfigured, a new VDisk is automatically populated with data to restore the required storage redundancy in the group. This increases the load on other VDisks in the group and the network. To reduce the impact of redundancy recovery on the system performance, the total data replication speed is limited both on the source and target VDisks.
+
+The time it takes to restore the redundancy depends on the amount of data and hardware performance. For example, replication on fast NVMe SSDs may take an hour, while on large HDDs more than 24 hours. To make reconfiguration possible in general, a cluster should have free slots for creating VDisks in different fail domains. When determining the number of slots to be kept free, factor in the risk of hardware failure, the time it takes to replicate data and replace the failed hardware.
+
+## Simplified hardware configurations {#reduced}
+
+If it's not possible to use the [recommended amount](#requirements) of hardware, you can divide servers within a single rack into two dummy fail domains. In this configuration, a failure of 1 rack means a failure of 2 domains and not a single one. In [both fault tolerance modes](#fault-tolerance), {{ ydb-short-name }} will keep running if 2 domains fail. If you use the configuration with dummy fail domains, the minimum number of racks in a cluster is 5 for `block-4-2` mode and 2 in each data center for `mirror-3-dc` mode.
+
+## Fault tolerance level {#reliability}
+
+The table below describes fault tolerance levels for different fault tolerance modes and hardware configurations of a {{ ydb-short-name }} cluster:
+
+Fault tolerance<br>mode | Fail<br>domain | Fail<br>realm | Number of<br>data centers | Number of<br>server racks | Fault tolerance<br>level
+:--- | :---: | :---: | :---: | :---: | :---
+`block-4-2` | Rack | Data center | 1 | 9 or more | Can stand a failure of 2 racks
+`block-4-2` | ½ a rack | Data center | 1 | 5 or more | Can stand a failure of 1 rack
+`block-4-2` | Server | Data center | 1 | Doesn't matter | Can stand a failure of 2 servers
+`mirror-3-dc` | Rack | Data center | 3 | 4 in each data center | Can stand a failure of a data center and 1 rack in one of the two other data centers
+`mirror-3-dc` | Server | Data center | 3 | Doesn't matter | Can stand a failure of a data center and 1 server in one of the two other data centers
diff --git a/ydb/docs/en/core/concepts/cluster/_includes/distributed_storage/distributed_storage_interface.md b/ydb/docs/en/core/concepts/cluster/_includes/distributed_storage/distributed_storage_interface.md
index 22d0f87a7ca..0f0937693af 100644
--- a/ydb/docs/en/core/concepts/cluster/_includes/distributed_storage/distributed_storage_interface.md
+++ b/ydb/docs/en/core/concepts/cluster/_includes/distributed_storage/distributed_storage_interface.md
@@ -25,7 +25,7 @@ When performing reads, the blob ID is specified, which can be arbitrary, but pre
 
 Blobs are written in a logical entity called *group*. A special actor called DS proxy is created on every node for each group that is written to. This actor is responsible for performing all operations related to the group. The actor is created automatically through the NodeWarden service that will be described below.
 
-Physically, a group is a set of multiple physical devices (OS block devices) that are located on different nodes so that the failure of one device correlates as little as possible with the failure of another device. These devices are usually located in different racks or different datacenters. On each of these devices, some space is allocated for the group, which is managed by a special service called *VDisk*. Each VDisk runs on top of a block storage device from which it is separated by another service called *PDisk*. Blobs are broken into fragments based on *erasure coding* with these fragments written to VDisks. Before splitting into fragments, optional encryption of the data in the group can be performed.
+Physically, a group is a set of multiple physical devices (OS block devices) that are located on different nodes so that the failure of one device correlates as little as possible with the failure of another device. These devices are usually located in different racks or different datacenters. On each of these devices, some space is allocated for the group, which is managed by a special service called *VDisk*. Each VDisk runs on top of a block storage device from which it is separated by another service called *PDisk*. Blobs are broken into fragments based on [erasure coding](https://en.wikipedia.org/wiki/Erasure_code) with these fragments written to VDisks. Before splitting into fragments, optional encryption of the data in the group can be performed.
 
 This scheme is shown in the figure below.
 
@@ -39,7 +39,7 @@ A group can be treated as a set of VDisks:
 
 Each VDisk within a group has a sequence number, and disks are numbered 0 to N-1, where N is the number of disks in the group.
 
-In addition, disks are combined into fail domains, and fail domains are combined into fail realms. As a rule, each fail domain comprises exactly one disk (although, in theory, it may have more but this has found no practical application), and multiple fail realms are only used for groups that host their data in three datacenters at once. Thus, in addition to a group sequence number, each VDisk is assigned an ID that consists of a fail realm index, the index that a fail domain has in a fail realm, and the index that a VDisk has in the fail domain. In string form, this ID is written as `VDISK[GroupId:GroupGeneration:FailRealm:FailDomain:VDisk]`.
+In addition, the group disks are grouped into fail domains and fail domains into fail realms. Each fail domain usually has exactly one disk inside (although, in theory, it may have more, but this is not used in practice), while multiple fail realms are only used for groups whose data is stored in all three data centers. Thus, in addition to a group sequence number, each VDisk is assigned an ID that consists of a fail realm index, the index that a fail domain has in a fail realm, and the index that a VDisk has in the fail domain. In string form, this ID is written as `VDISK[GroupId:GroupGeneration:FailRealm:FailDomain:VDisk]`.
 
 All fail realms have the same number of fail domains, and all fail domains include the same number of disks. The number of the fail realms, the number of the fail domains inside the fail realm, and the number of the disks inside the fail domain make up the geometry of the group. The geometry depends on the way the data is encoded in the group. For example, for block-4-2 numFailRealms = 1, numFailDomainsInFailRealm >= 8 (only 8 fail realms are used in practice), numVDisksInFailDomain >= 1 (strictly 1 fail domain is used in practice). For mirror-3-dc numFailRealms >= 3, numFailDomainsInFailRealm >= 3, and numVDisksInFailDomain >= 1 (3x3x1 are used).
 
@@ -58,7 +58,7 @@ A special concept of a *subgroup* is introduced for each blob. It is an ordered
 Each disk in the subgroup corresponds to a disk in the group, but is limited by the allowed number of stored blobs. For example, for block-4-2 encoding with four data parts and two parity parts, the functional purpose of the disks in a subgroup is as follows:
 
 | Number in the subgroup | Possible PartIds |
-| ------------------- | ------------------- |
+|-------------------|-------------------|
 | 0 | 1 |
 | 1 | 2 |
 | 2 | 3 |
@@ -68,7 +68,7 @@ Each disk in the subgroup corresponds to a disk in the group, but is limited by
 | 6 | 1,2,3,4,5,6 |
 | 7 | 1,2,3,4,5,6 |
 
-In this case, PartID=1..4 corresponds to the data parts (which are obtained by splitting the original blob into 4 equal parts), and PartID=5..6 are parity fragments. Disks numbered 6 and 7 in the subgroup are called *handoff disks*. Any part, either one or more, can be written to them. Disks 0..5 can only store the corresponding blob parts.
 
-In practice, when performing writes, the system tries to write 6 parts to the first 6 disks of the subgroup and, in the vast majority of cases, these attempts are successful. However, if any of the disks is not available, a write operation cannot succeed, which is when handoff disks kick in receiving the parts belonging to the disks that did not respond in time. It may turn out that several fragments of the same blob are sent to the same handoff disk as a result of complicated brakes and races. This is acceptable although it makes no sense in terms of storage: each fragment must have its own unique disk.
+In this case, PartId=1..4 corresponds to data fragments (resulting from dividing the original blob into 4 equal parts), while PartId=5..6 stands for parity fragments. Disks numbered 6 and 7 in the subgroup are called *handoff disks*. Any part, either one or more, can be written to them. You can only write the respective blob parts to disks 0..5.
 
+In practice, when performing writes, the system tries to write 6 parts to the first 6 disks of the subgroup and, in the vast majority of cases, these attempts are successful. However, if any of the disks is not available, a write operation cannot succeed, which is when handoff disks kick in receiving the parts belonging to the disks that did not respond in time. It may turn out that several fragments of the same blob are sent to the same handoff disk as a result of complicated brakes and races. This is acceptable although it makes no sense in terms of storage: each fragment must have its own unique disk.
diff --git a/ydb/docs/en/core/deploy/_includes/index.md b/ydb/docs/en/core/deploy/_includes/index.md
index c8326c85fd1..bf2f6ecdecf 100644
--- a/ydb/docs/en/core/deploy/_includes/index.md
+++ b/ydb/docs/en/core/deploy/_includes/index.md
@@ -7,5 +7,6 @@ This section provides information on deploying and configuring multi-node YDB cl
 * [Deployment in Kubernetes](../orchestrated/concepts.md).
 * [Deployment on virtual and physical servers](../manual/deploy-ydb-on-premises.md).
 * [Configuration](../configuration/config.md).
+* [BlobStorage production configurations](../../administration/production-storage-config.md).
 
 Step-by-step scenarios for rapidly deploying a local single-node cluster for development and testing are given in the [Getting started](../../getting_started/self_hosted/index.md) section.
diff --git a/ydb/docs/en/core/deploy/toc_i.yaml b/ydb/docs/en/core/deploy/toc_i.yaml
index 6f8313141da..2cc76d971d6 100644
--- a/ydb/docs/en/core/deploy/toc_i.yaml
+++ b/ydb/docs/en/core/deploy/toc_i.yaml
@@ -5,4 +5,6 @@ items:
   href: manual/deploy-ydb-on-premises.md
 - name: Configuration
   href: configuration/config.md
+- name: BlobStorage production configurations
+  href: ../administration/production-storage-config.md
 
diff --git a/ydb/docs/ru/core/deploy/_includes/index.md b/ydb/docs/ru/core/deploy/_includes/index.md
index 4d80c846807..018c1bcf30c 100644
--- a/ydb/docs/ru/core/deploy/_includes/index.md
+++ b/ydb/docs/ru/core/deploy/_includes/index.md
@@ -7,6 +7,6 @@
 * [Развертывание в Kubernetes](../orchestrated/concepts.md).
 * [Развертывание на виртуальных и железных серверах](../manual/deploy-ydb-on-premises.md).
 * [Конфигурирование](../configuration/config.md).
-- [Промышленные конфигурации BlobStorage](../../administration/production-storage-config.md).
+* [Промышленные конфигурации BlobStorage](../../administration/production-storage-config.md).
 
 Пошаговые сценарии быстрого развертывания локального одноузлового кластера для целей разработки и тестирования функциональности приведены в разделе [Начало работы](../../getting_started/self_hosted/index.md).
author	alextarazanov <alextarazanov@yandex-team.com>	2022-08-10 15:43:40 +0300
committer	alextarazanov <alextarazanov@yandex-team.com>	2022-08-10 15:43:40 +0300
commit	47782bc45c02bda33735f8d275730cb6227199f8 (patch)
tree	efb4dfb5be95c6a8abb5aa3b990b4629cd45e86c
parent	d305bd814a40254aef0919b960bc9f24a7e9569f (diff)
download	ydb-47782bc45c02bda33735f8d275730cb6227199f8.tar.gz