Added the Troubleshooting section (#10888)

Co-authored-by: Ivan Blinkov <[email protected]>
ydb-platform · Nov 28, 2024 · 7827174 · 7827174
1 parent 776b371
commit 7827174
Show file tree

Hide file tree

Showing 69 changed files with 1,049 additions and 1 deletion.
diff --git a/ydb/docs/en/core/concepts/glossary.md b/ydb/docs/en/core/concepts/glossary.md
@@ -101,6 +101,16 @@ Together, these mechanisms allow {{ ydb-short-name }} to provide [strict consist
 
 The implementation of distributed transactions is covered in a separate article [{#T}](../contributor/datashard-distributed-txs.md), while below there's a list of several [related terms](#distributed-transaction-implementation).
 
+### Interactive transactions {#interactive-transaction}
+
+The term **interactive transactions** refers to transactions that are split into multiple queries and involve data processing by an application between these queries. For example:
+
+1. Select some data.
+1. Process the selected data in the application.
+1. Update some data in the database.
+1. Commit the transaction in a separate query.
+
+
 ### Multi-version concurrency control {#mvcc}
 
 [**Multi-version concurrency control**](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) or **MVCC** is a method {{ ydb-short-name }} used to allow multiple concurrent transactions to access the database simultaneously without interfering with each other. It is described in more detail in a separate article [{#T}](mvcc.md).
@@ -255,6 +265,20 @@ The **actor system interconnect** or **interconnect** is the [cluster's](#cluste
 
 A **Local** is an [actor service](#actor-service) running on each [node](#node). It directly manages the [tablets](#tablet) on its node and interacts with [Hive](#hive). It registers with Hive and receives commands to launch tablets.
 
+#### Actor system pool {#actor-system-pool}
+
+The **actor system pool** is a [thread pool](https://en.wikipedia.org/wiki/Thread_pool) used to run [actors](#actor). Each [node](#node) operates multiple pools to coarsely separate resources between different types of activities. A typical set of pools includes:
+
+- **System**: A pool that handles internal operations within {{ ydb-short-name }} node. It serves system [tablets](#tablet), [state storage](#state-storage), [distributed storage](#distributed-storage) I/O, and so on.
+
+- **User**: A pool dedicated to user-generated load, such as running non-system tablets or queries executed by the [KQP](#kqp).
+
+- **Batch**: A pool for tasks without strict execution deadlines, including heavy queries handled by the [KQP](#kqp) background operations like backups, data compaction, and garbage collection.
+
+- **IO**: A pool for tasks involving blocking operations, such as authentication or writing logs to files.
+
+- **IC**: A pool for [interconnect](#actor-system-interconnect), responsible for system calls related to data transfers across the network, data serialization, message splitting and merging.
+
 ### Tablet implementation {#tablet-implementation}
 
 A [**tablet**](#tablet) is an [actor](#actor) with a persistent state. It includes a set of data for which this tablet is responsible and a finite state machine through which the tablet's data (or state) changes. The tablet is a fault-tolerant entity because tablet data is stored in a [Distributed storage](#distributed-storage) that survives disk and node failures. The tablet is automatically restarted on another [node](#node) if the previous one is down or overloaded. The data in the tablet changes in a consistent manner because the system infrastructure ensures that there is no more than one [tablet leader](#tablet-leader) through which changes to the tablet data are carried out.
@@ -558,7 +582,7 @@ MiniKQL is a low-level language. The system's end users only see queries in the
 
 #### KQP {#kqp}
 
-**KQP** is a {{ ydb-short-name }} component responsible for the orchestration of user query execution and generating the final response.
+**KQP** or **Query Processor** is a {{ ydb-short-name }} component responsible for the orchestration of user query execution and generating the final response.
 
 ### Global schema {#global-schema}
 

diff --git a/ydb/docs/en/core/dev/index.md b/ydb/docs/en/core/dev/index.md
@@ -27,4 +27,6 @@ Main resources:
   - [{#T}](../postgresql/intro.md)
   - [{#T}](../reference/kafka-api/index.md)
 
+- [{#T}](troubleshooting/index.md)
+
 If you're interested in developing {{ ydb-short-name }} core or satellite projects, refer to the [documentation for contributors](../contributor/index.md).
diff --git a/ydb/docs/en/core/dev/toc_p.yaml b/ydb/docs/en/core/dev/toc_p.yaml
@@ -18,6 +18,11 @@ items:
     path: primary-key/toc_p.yaml
 - name: Secondary indexes
   href: secondary-indexes.md
+- name: Troubleshooting
+  href: troubleshooting/index.md
+  include:
+    mode: link
+    path: troubleshooting/toc_p.yaml
 - name: Query plans optimization
   href: query-plans-optimization.md
 - name: Batch upload

diff --git a/ydb/docs/en/core/dev/troubleshooting/index.md b/ydb/docs/en/core/dev/troubleshooting/index.md
@@ -0,0 +1,5 @@
+# Troubleshooting
+
+This section of the {{ ydb-short-name }} documentation provides guidance on troubleshooting issues related to {{ ydb-short-name }} databases and the applications that interact with them.
+
+- [{#T}](performance/index.md)
diff --git a/...ocs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-batch-pool.png b/...ocs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-batch-pool.png
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-by-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-by-pool.png
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-ic-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-ic-pool.png
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-io-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-io-pool.png
diff --git a/...e/dev/troubleshooting/performance/hardware/_assets/cpu-read-only-tx-latency.png b/...e/dev/troubleshooting/performance/hardware/_assets/cpu-read-only-tx-latency.png
diff --git a/.../en/core/dev/troubleshooting/performance/hardware/_assets/cpu-row-read-rows.png b/.../en/core/dev/troubleshooting/performance/hardware/_assets/cpu-row-read-rows.png
diff --git a/...cs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-system-pool.png b/...cs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-system-pool.png
diff --git a/...docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-user-pool.png b/...docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-user-pool.png
diff --git a/...troubleshooting/performance/hardware/_assets/disk-time-available--disk-cost.png b/...troubleshooting/performance/hardware/_assets/disk-time-available--disk-cost.png
diff --git a/...ev/troubleshooting/performance/hardware/_assets/embedded-ui-cpu-system-pool.png b/...ev/troubleshooting/performance/hardware/_assets/embedded-ui-cpu-system-pool.png
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/microbursts.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/microbursts.png
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/request-size.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/request-size.png
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/requests.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/requests.png
diff --git a/...docs/en/core/dev/troubleshooting/performance/hardware/_assets/response-size.png b/...docs/en/core/dev/troubleshooting/performance/hardware/_assets/response-size.png
diff --git a/.../dev/troubleshooting/performance/hardware/_assets/storage-groups-disk-space.png b/.../dev/troubleshooting/performance/hardware/_assets/storage-groups-disk-space.png
diff --git a/...cs/en/core/dev/troubleshooting/performance/hardware/_includes/cpu-bottleneck.md b/...cs/en/core/dev/troubleshooting/performance/hardware/_includes/cpu-bottleneck.md
@@ -0,0 +1,59 @@
+1. Use **Diagnostics** in the [Embedded UI](../../../../../reference/embedded-ui/index.md) to analyze CPU utilization in all pools:
+
+    1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database.
+
+    1. On the **Navigation** tab, ensure the required database is selected.
+
+    1. Open the **Diagnostics** tab.
+
+    1. On the **Info** tab, click the **CPU** button and see if any pools show high CPU usage.
+
+        ![](../_assets/embedded-ui-cpu-system-pool.png)
+
+1. Use Grafana charts to analyze CPU utilization in all pools:
+
+    1. Open the **[CPU](../../../../../reference/observability/metrics/grafana-dashboards.md#cpu)** dashboard in Grafana.
+
+    1. See if the following charts show any spikes:
+
+        - **CPU by execution pool** chart
+
+            ![](../_assets/cpu-by-pool.png)
+
+        - **User pool - CPU by host** chart
+
+            ![](../_assets/cpu-user-pool.png)
+
+        - **System pool - CPU by host** chart
+
+            ![](../_assets/cpu-system-pool.png)
+
+        - **Batch pool - CPU by host** chart
+
+            ![](../_assets/cpu-batch-pool.png)
+
+        - **IC pool - CPU by host** chart
+
+            ![](../_assets/cpu-ic-pool.png)
+
+        - **IO pool - CPU by host** chart
+
+            ![](../_assets/cpu-io-pool.png)
+
+1. If the spike is in the user pool, analyze changes in the user load that might have caused the CPU bottleneck. See the following charts on the **DB overview** dashboard in Grafana:
+
+    - **Requests** chart
+
+        ![](../_assets/requests.png)
+
+    - **Request size** chart
+
+        ![](../_assets/request-size.png)
+
+    - **Response size** chart
+
+        ![](../_assets/response-size.png)
+
+    Also, see all of the charts in the **Operations** section of the **DataShard** dashboard.
+
+2. If the spike is in the batch pool, check if there are any backups running.
diff --git a/...docs/en/core/dev/troubleshooting/performance/hardware/_includes/io-bandwidth.md b/...docs/en/core/dev/troubleshooting/performance/hardware/_includes/io-bandwidth.md
@@ -0,0 +1,18 @@
+1. Open the **[Distributed Storage Overview](../../../../../reference/observability/metrics/grafana-dashboards.md)** dashboard in Grafana.
+
+1. On the **DiskTimeAvailable and total Cost relation** chart, see if the **Total Cost** spikes cross the **DiskTimeAvailable** level.
+
+    ![](../_assets/disk-time-available--disk-cost.png)
+
+    This chart shows the estimated total bandwidth capacity of the storage system in conventional units (green) and the total usage cost (blue). When the total usage cost exceeds the total bandwidth capacity, the {{ ydb-short-name }} storage system becomes overloaded, leading to increased latencies.
+
+1. On the **Total burst duration** chart, check for any load spikes on the storage system. This chart displays microbursts of load on the storage system, measured in microseconds.
+
+    ![](../_assets/microbursts.png)
+
+    {% note info %}
+
+    This chart might show microbursts of the load that are not detected by the average usage cost in the **Cost and DiskTimeAvailable relation** chart.
+
+    {% endnote %}
+
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/cpu-bottleneck.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/cpu-bottleneck.md
@@ -0,0 +1,14 @@
+# CPU bottleneck
+
+High CPU usage can lead to slow query processing and increased response times. When CPU resources are constrained, the database may have difficulty handling complex queries or large transaction volumes.
+
+{{ ydb-short-name }} nodes primarily consume CPU resources for running [actors](../../../../concepts/glossary.md#actor). On each node, actors are executed using multiple [actor system pools](../../../../concepts/glossary.md#actor-system-pools). The resource consumption of each pool is measured separately which allows to identify what kind of activity changed its behavior.
+
+## Diagnostics
+
+<!-- The include is added to allow partial overrides in overlays  -->
+{% include notitle [#](_includes/cpu-bottleneck.md) %}
+
+## Recommendation
+
+Add additional [database nodes](../../../../concepts/glossary.md#database-node) to the cluster or allocate more CPU cores to the existing nodes. If that's not possible, consider distributing CPU cores between pools differently.
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/disk-space.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/disk-space.md
@@ -0,0 +1,29 @@
+# Disk space
+
+A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This can also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively.
+
+## Diagnostics
+
+1. See if the **[DB overview > Storage](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** charts in Grafana show any spikes.
+
+1. In [Embedded UI](../../../../reference/embedded-ui/index.md), on the **Storage** tab, analyze the list of available storage groups and nodes and their disk usage.
+
+    {% note tip %}
+
+    Use the **Out of Space** filter to list only the storage groups with full disks.
+
+    {% endnote %}
+
+    ![](_assets/storage-groups-disk-space.png)
+
+{% note info %}
+
+It is also recommended to use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) to get this information.
+
+{% endnote %}
+
+## Recommendations
+
+Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database.
+
+If the cluster doesn't have spare storage groups, configure them first. Add additional [storage nodes](../../../../concepts/glossary.md#storage-node), if necessary.
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/insufficient-memory.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/insufficient-memory.md
@@ -0,0 +1,58 @@
+# Insufficient memory (RAM)
+
+If [swap](https://en.wikipedia.org/wiki/Memory_paging#Unix_and_Unix-like_systems) (paging of anonymous memory) is disabled on the server running {{ ydb-short-name }}, insufficient memory activates another kernel feature called the [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory), which terminates the most memory-intensive processes (often the database itself). This feature also interacts with [cgroups](https://en.wikipedia.org/wiki/Cgroups) if multiple cgroups are configured.
+
+If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory.
+
+{% note warning %}
+
+If {{ ydb-short-name }} nodes are running on servers with swap enabled, disable it. {{ ydb-short-name }} is a distributed system, so if a node restarts due to lack of memory, the client will simply connect to another node and continue accessing data as if nothing happened. Swap would allow the query to continue on the same node but with degraded performance from increased disk I/O, which is generally less desirable.
+
+{% endnote %}
+
+Even though the reasons and mechanics of performance degradation due to insufficient memory might differ, the symptoms of increased latencies during query execution and data retrieval are similar in all cases.
+
+Additionally, which components within the  {{ ydb-short-name }} process consume memory may also be significant.
+
+## Diagnostics
+
+1. Determine whether any {{ ydb-short-name }} nodes recently restarted for unknown reasons. Exclude cases of {{ ydb-short-name }} version upgrades and other planned maintenance. This could reveal nodes terminated by OOM killer and restarted by `systemd`.
+
+    1. Open [Embedded UI](../../../../reference/embedded-ui/index.md).
+
+    1. On the **Nodes** tab, look for nodes that have low uptime.
+
+    1. Chose a recently restarted node and log in to the server hosting it. Run the `dmesg` command to check if the kernel has recently activated the OOM killer mechanism.
+
+        Look for the lines like this:
+
+            [ 2203.393223] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=ydb,pid=1332,uid=1000
+            [ 2203.393263] Out of memory: Killed process 1332 (ydb) total-vm:14219904kB, anon-rss:1771156kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4736kB oom_score_adj:0
+
+    Additionally, review the `ydbd` logs for relevant details.
+
+
+1. Determine whether memory usage reached 100% of capacity.
+
+    1. Open the **[DB overview](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** dashboard in Grafana.
+
+    1. Analyze the charts in the **Memory** section.
+
+1. Determine whether the user load on {{ ydb-short-name }} has increased. Analyze the following charts on the **[DB overview](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** dashboard in Grafana:
+
+    - **Requests** chart
+    - **Request size** chart
+    - **Response size** chart
+
+1. Determine whether new releases or data access changes occurred in your applications working with {{ ydb-short-name }}.
+
+## Recommendation
+
+Consider the following solutions for addressing insufficient memory:
+
+- If the load on {{ ydb-short-name }} has increased due to new usage patterns or increased query rate, try optimizing the application to reduce the load on {{ ydb-short-name }} or add more {{ ydb-short-name }} nodes.
+
+- If the load on {{ ydb-short-name }} has not changed but nodes are still restarting, consider adding more {{ ydb-short-name }} nodes or raising the hard memory limit for the nodes. For more information about memory management in {{ ydb-short-name }}, see [{#T}](../../../../reference/configuration/index.md#memory-controller).
+
+
+
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/io-bandwidth.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/io-bandwidth.md
@@ -0,0 +1,15 @@
+# I/O bandwidth
+
+A high rate of read and write operations can overwhelm the disk subsystem, leading to increased data access latencies. When the system cannot read or write data quickly enough, queries that rely on disk access will experience delays.
+
+## Diagnostics
+
+<!-- The include is added to allow partial overrides in overlays  -->
+{% include notitle [io-bandwidth](./_includes/io-bandwidth.md) %}
+
+## Recommendations
+
+Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database.
+
+In cases of high microburst rates, balancing the load across storage groups might help.
+
diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/toc_p.yaml
@@ -0,0 +1,9 @@
+items:
+    - name: CPU
+      href: cpu-bottleneck.md
+    - name: Memory
+      href: insufficient-memory.md
+    - name: I/O bandwidth
+      href: io-bandwidth.md
+    - name: Disk space
+      href: disk-space.md