-
Notifications
You must be signed in to change notification settings - Fork 598
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added the Troubleshooting section (#10888)
Co-authored-by: Ivan Blinkov <[email protected]>
- Loading branch information
1 parent
776b371
commit 7827174
Showing
69 changed files
with
1,049 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Troubleshooting | ||
|
||
This section of the {{ ydb-short-name }} documentation provides guidance on troubleshooting issues related to {{ ydb-short-name }} databases and the applications that interact with them. | ||
|
||
- [{#T}](performance/index.md) |
Binary file added
BIN
+80.8 KB
...ocs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-batch-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+109 KB
ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-by-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+76.9 KB
ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-ic-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+78.5 KB
ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-io-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+49.2 KB
...e/dev/troubleshooting/performance/hardware/_assets/cpu-read-only-tx-latency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+47.8 KB
.../en/core/dev/troubleshooting/performance/hardware/_assets/cpu-row-read-rows.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+71.1 KB
...cs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-system-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+77.9 KB
...docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-user-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+79.1 KB
...troubleshooting/performance/hardware/_assets/disk-time-available--disk-cost.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+347 KB
...ev/troubleshooting/performance/hardware/_assets/embedded-ui-cpu-system-pool.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+60 KB
ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/microbursts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+112 KB
ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/request-size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+129 KB
ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/requests.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+120 KB
...docs/en/core/dev/troubleshooting/performance/hardware/_assets/response-size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+347 KB
.../dev/troubleshooting/performance/hardware/_assets/storage-groups-disk-space.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
59 changes: 59 additions & 0 deletions
59
...cs/en/core/dev/troubleshooting/performance/hardware/_includes/cpu-bottleneck.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
1. Use **Diagnostics** in the [Embedded UI](../../../../../reference/embedded-ui/index.md) to analyze CPU utilization in all pools: | ||
|
||
1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database. | ||
|
||
1. On the **Navigation** tab, ensure the required database is selected. | ||
|
||
1. Open the **Diagnostics** tab. | ||
|
||
1. On the **Info** tab, click the **CPU** button and see if any pools show high CPU usage. | ||
|
||
![](../_assets/embedded-ui-cpu-system-pool.png) | ||
|
||
1. Use Grafana charts to analyze CPU utilization in all pools: | ||
|
||
1. Open the **[CPU](../../../../../reference/observability/metrics/grafana-dashboards.md#cpu)** dashboard in Grafana. | ||
|
||
1. See if the following charts show any spikes: | ||
|
||
- **CPU by execution pool** chart | ||
|
||
![](../_assets/cpu-by-pool.png) | ||
|
||
- **User pool - CPU by host** chart | ||
|
||
![](../_assets/cpu-user-pool.png) | ||
|
||
- **System pool - CPU by host** chart | ||
|
||
![](../_assets/cpu-system-pool.png) | ||
|
||
- **Batch pool - CPU by host** chart | ||
|
||
![](../_assets/cpu-batch-pool.png) | ||
|
||
- **IC pool - CPU by host** chart | ||
|
||
![](../_assets/cpu-ic-pool.png) | ||
|
||
- **IO pool - CPU by host** chart | ||
|
||
![](../_assets/cpu-io-pool.png) | ||
|
||
1. If the spike is in the user pool, analyze changes in the user load that might have caused the CPU bottleneck. See the following charts on the **DB overview** dashboard in Grafana: | ||
|
||
- **Requests** chart | ||
|
||
![](../_assets/requests.png) | ||
|
||
- **Request size** chart | ||
|
||
![](../_assets/request-size.png) | ||
|
||
- **Response size** chart | ||
|
||
![](../_assets/response-size.png) | ||
|
||
Also, see all of the charts in the **Operations** section of the **DataShard** dashboard. | ||
|
||
2. If the spike is in the batch pool, check if there are any backups running. |
18 changes: 18 additions & 0 deletions
18
...docs/en/core/dev/troubleshooting/performance/hardware/_includes/io-bandwidth.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
1. Open the **[Distributed Storage Overview](../../../../../reference/observability/metrics/grafana-dashboards.md)** dashboard in Grafana. | ||
|
||
1. On the **DiskTimeAvailable and total Cost relation** chart, see if the **Total Cost** spikes cross the **DiskTimeAvailable** level. | ||
|
||
![](../_assets/disk-time-available--disk-cost.png) | ||
|
||
This chart shows the estimated total bandwidth capacity of the storage system in conventional units (green) and the total usage cost (blue). When the total usage cost exceeds the total bandwidth capacity, the {{ ydb-short-name }} storage system becomes overloaded, leading to increased latencies. | ||
|
||
1. On the **Total burst duration** chart, check for any load spikes on the storage system. This chart displays microbursts of load on the storage system, measured in microseconds. | ||
|
||
![](../_assets/microbursts.png) | ||
|
||
{% note info %} | ||
|
||
This chart might show microbursts of the load that are not detected by the average usage cost in the **Cost and DiskTimeAvailable relation** chart. | ||
|
||
{% endnote %} | ||
|
14 changes: 14 additions & 0 deletions
14
ydb/docs/en/core/dev/troubleshooting/performance/hardware/cpu-bottleneck.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# CPU bottleneck | ||
|
||
High CPU usage can lead to slow query processing and increased response times. When CPU resources are constrained, the database may have difficulty handling complex queries or large transaction volumes. | ||
|
||
{{ ydb-short-name }} nodes primarily consume CPU resources for running [actors](../../../../concepts/glossary.md#actor). On each node, actors are executed using multiple [actor system pools](../../../../concepts/glossary.md#actor-system-pools). The resource consumption of each pool is measured separately which allows to identify what kind of activity changed its behavior. | ||
|
||
## Diagnostics | ||
|
||
<!-- The include is added to allow partial overrides in overlays --> | ||
{% include notitle [#](_includes/cpu-bottleneck.md) %} | ||
|
||
## Recommendation | ||
|
||
Add additional [database nodes](../../../../concepts/glossary.md#database-node) to the cluster or allocate more CPU cores to the existing nodes. If that's not possible, consider distributing CPU cores between pools differently. |
29 changes: 29 additions & 0 deletions
29
ydb/docs/en/core/dev/troubleshooting/performance/hardware/disk-space.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Disk space | ||
|
||
A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This can also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively. | ||
|
||
## Diagnostics | ||
|
||
1. See if the **[DB overview > Storage](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** charts in Grafana show any spikes. | ||
|
||
1. In [Embedded UI](../../../../reference/embedded-ui/index.md), on the **Storage** tab, analyze the list of available storage groups and nodes and their disk usage. | ||
|
||
{% note tip %} | ||
|
||
Use the **Out of Space** filter to list only the storage groups with full disks. | ||
|
||
{% endnote %} | ||
|
||
![](_assets/storage-groups-disk-space.png) | ||
|
||
{% note info %} | ||
|
||
It is also recommended to use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) to get this information. | ||
|
||
{% endnote %} | ||
|
||
## Recommendations | ||
|
||
Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database. | ||
|
||
If the cluster doesn't have spare storage groups, configure them first. Add additional [storage nodes](../../../../concepts/glossary.md#storage-node), if necessary. |
58 changes: 58 additions & 0 deletions
58
ydb/docs/en/core/dev/troubleshooting/performance/hardware/insufficient-memory.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Insufficient memory (RAM) | ||
|
||
If [swap](https://en.wikipedia.org/wiki/Memory_paging#Unix_and_Unix-like_systems) (paging of anonymous memory) is disabled on the server running {{ ydb-short-name }}, insufficient memory activates another kernel feature called the [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory), which terminates the most memory-intensive processes (often the database itself). This feature also interacts with [cgroups](https://en.wikipedia.org/wiki/Cgroups) if multiple cgroups are configured. | ||
|
||
If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory. | ||
|
||
{% note warning %} | ||
|
||
If {{ ydb-short-name }} nodes are running on servers with swap enabled, disable it. {{ ydb-short-name }} is a distributed system, so if a node restarts due to lack of memory, the client will simply connect to another node and continue accessing data as if nothing happened. Swap would allow the query to continue on the same node but with degraded performance from increased disk I/O, which is generally less desirable. | ||
|
||
{% endnote %} | ||
|
||
Even though the reasons and mechanics of performance degradation due to insufficient memory might differ, the symptoms of increased latencies during query execution and data retrieval are similar in all cases. | ||
|
||
Additionally, which components within the {{ ydb-short-name }} process consume memory may also be significant. | ||
|
||
## Diagnostics | ||
|
||
1. Determine whether any {{ ydb-short-name }} nodes recently restarted for unknown reasons. Exclude cases of {{ ydb-short-name }} version upgrades and other planned maintenance. This could reveal nodes terminated by OOM killer and restarted by `systemd`. | ||
|
||
1. Open [Embedded UI](../../../../reference/embedded-ui/index.md). | ||
|
||
1. On the **Nodes** tab, look for nodes that have low uptime. | ||
|
||
1. Chose a recently restarted node and log in to the server hosting it. Run the `dmesg` command to check if the kernel has recently activated the OOM killer mechanism. | ||
|
||
Look for the lines like this: | ||
|
||
[ 2203.393223] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=ydb,pid=1332,uid=1000 | ||
[ 2203.393263] Out of memory: Killed process 1332 (ydb) total-vm:14219904kB, anon-rss:1771156kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4736kB oom_score_adj:0 | ||
|
||
Additionally, review the `ydbd` logs for relevant details. | ||
|
||
|
||
1. Determine whether memory usage reached 100% of capacity. | ||
|
||
1. Open the **[DB overview](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** dashboard in Grafana. | ||
|
||
1. Analyze the charts in the **Memory** section. | ||
|
||
1. Determine whether the user load on {{ ydb-short-name }} has increased. Analyze the following charts on the **[DB overview](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** dashboard in Grafana: | ||
|
||
- **Requests** chart | ||
- **Request size** chart | ||
- **Response size** chart | ||
|
||
1. Determine whether new releases or data access changes occurred in your applications working with {{ ydb-short-name }}. | ||
|
||
## Recommendation | ||
|
||
Consider the following solutions for addressing insufficient memory: | ||
|
||
- If the load on {{ ydb-short-name }} has increased due to new usage patterns or increased query rate, try optimizing the application to reduce the load on {{ ydb-short-name }} or add more {{ ydb-short-name }} nodes. | ||
|
||
- If the load on {{ ydb-short-name }} has not changed but nodes are still restarting, consider adding more {{ ydb-short-name }} nodes or raising the hard memory limit for the nodes. For more information about memory management in {{ ydb-short-name }}, see [{#T}](../../../../reference/configuration/index.md#memory-controller). | ||
|
||
|
||
|
15 changes: 15 additions & 0 deletions
15
ydb/docs/en/core/dev/troubleshooting/performance/hardware/io-bandwidth.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# I/O bandwidth | ||
|
||
A high rate of read and write operations can overwhelm the disk subsystem, leading to increased data access latencies. When the system cannot read or write data quickly enough, queries that rely on disk access will experience delays. | ||
|
||
## Diagnostics | ||
|
||
<!-- The include is added to allow partial overrides in overlays --> | ||
{% include notitle [io-bandwidth](./_includes/io-bandwidth.md) %} | ||
|
||
## Recommendations | ||
|
||
Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database. | ||
|
||
In cases of high microburst rates, balancing the load across storage groups might help. | ||
|
9 changes: 9 additions & 0 deletions
9
ydb/docs/en/core/dev/troubleshooting/performance/hardware/toc_p.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
items: | ||
- name: CPU | ||
href: cpu-bottleneck.md | ||
- name: Memory | ||
href: insufficient-memory.md | ||
- name: I/O bandwidth | ||
href: io-bandwidth.md | ||
- name: Disk space | ||
href: disk-space.md |
Oops, something went wrong.