Skip to content

Commit

Permalink
Working on tech review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
anton-bobkov committed Oct 29, 2024
1 parent 9bf7038 commit bbf9f83
Show file tree
Hide file tree
Showing 24 changed files with 147 additions and 97 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ The CPU resources are mainly used by the actor system. Depending on the type, al
- **System**: A pool that is designed for running quick internal operations in YDB (it serves system tablets, state storage, distributed storage I/O, and erasure coding).

- **User**: A pool that serves the user load (user tablets, queries run in the Query Processor).
Batch: A pool that serves tasks with no strict limit on the execution time, background operations like garbage collection and heavy queries run in the Query Processor.

- **Batch**: A pool that serves tasks with no strict limit on the execution time, background operations like backups, garbage collection, and heavy queries run in the Query Processor.

- **IO**: A pool responsible for performing any tasks with blocking operations (such as authentication or writing logs to a file).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ A lack of available disk space can prevent the database from storing new data, r

## Diagnostics

<!-- TODO: Mention the limits metric, if it's operational -->

1. See if the **DB overview > Storage** charts in Grafana show any spikes.

1. In [Embedded UI](../../../../reference/embedded-ui/index.md), on the **Storage** tab, analyze the list of available storage groups and nodes and their disk usage.
Expand All @@ -20,7 +18,7 @@ A lack of available disk space can prevent the database from storing new data, r

{% note info %}

You can also use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) in your application to get this information.
It is also recommended to use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) to get this information.

{% endnote %}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,35 @@

If [swap](https://en.wikipedia.org/wiki/Memory_paging#Unix_and_Unix-like_systems) (paging of anonymous memory) is disabled on the server running {{ ydb-short-name }}, insufficient memory activates another kernel feature called the [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory), which terminates the most memory-intensive processes (often the database itself). This feature also interacts with [cgroups](https://en.wikipedia.org/wiki/Cgroups) if multiple cgroups are configured.

If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory. This can result in increased latencies during query execution and data retrieval.
If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory.

{% note info %}

It's recommended to disable swap on {{ ydb-short-name }} servers.

{% endnote %}

Even though the reasons and mechanics of performance degradation due to insufficient memory might differ, the symptoms of increased latencies during query execution and data retrieval are similar in all cases.

Additionally, which components within the {{ ydb-short-name }} process consume memory may also be significant.

## Diagnostics

1. Determine whether any {{ ydb-short-name }} nodes recently restarted for unknown reasons. Exclude cases of {{ ydb-short-name }} upgrades.

{% note info %}

This step might reveal nodes terminated by OOM killer and restarted by {{ ydb-short-name }}.

{% endnote %}

1. Open [Embedded UI](../../../../reference/embedded-ui/index.md).

1. On the **Nodes** tab, look for nodes that have low uptime.

1. Log in to the recently restarted nodes and run the `dmesg` command to diagnose the reasons for the restart.


1. Determine whether memory usage reached 100%.

1. Open the **DB overview** dashboard in Grafana.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
items:
- name: CPU
href: cpu-bottleneck.md
- name: Memory
href: insufficient-memory.md
- name: I/O bandwidth
href: io-bandwidth.md
- name: Disk space
href: disk-space.md
19 changes: 8 additions & 11 deletions ydb/docs/en/core/dev/troubleshooting/performance/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,15 @@ Database performance issues can be classified into several categories based on t

- **Hardware infrastructure issues**.

- **[Network issues](infrastructure/network.md)**. Insufficient bandwidth or network congestion in data centers can significantly affect {{ ydb-short-name }} performance.
- **[Network issues](infrastructure/network.md)**. Network congestion in data centers and especially between data centers can significantly affect {{ ydb-short-name }} performance.

- **[Data center outages](infrastructure/dc-outage.md)**: Disruptions in data center operations that can cause service or data unavailability. These outages may result from various factors, such as power failures, natural disasters, or cyber-attacks. A common fault-tolerant setup for {{ ydb-short-name }} spans three data centers or availability zones (AZs). {{ ydb-short-name }} can continue operating without interruption, even if one data center and a server rack in another are lost. However, it will initiate the relocation of tablets from the offline AZ to the remaining online nodes, temporarily leading to higher query latencies. Distributed transactions involving tablets that are moving to other nodes might experience increased latencies.

- **[Data center maintenance and drills](infrastructure/dc-drills.md)**. Planned maintenance or drills, exercises conducted to prepare personnel for potential emergencies or outages, can also affect query performance. Depending on the maintenance scope or drill scenario, some {{ ydb-short-name }} servers might become unavailable, which leads to the same impact as an outage.

- **[Server hardware issues](infrastructure/hardware.md)**. Malfunctioning CPU, memory modules, and network cards, until replaced, significantly impact database performance up to the total unavailability of the affected server.

- **Insufficient resources**. These issues refer to situations when the workload demands more physical resources — such as CPU, memory, disk space, and network bandwidth — than allocated to a database.
- **Insufficient resources**. These issues refer to situations when the workload demands more physical resources — such as CPU, memory, disk space, and network bandwidth — than allocated to a database. In some cases, suboptimal allocation of resources, for example poorly configured control groups (cgroups), may also result in insufficient resources for {{ ydb-short-name }} and increase query latencies even though physical hardware resources are still available on the database server.

- **[CPU bottlenecks](hardware/cpu-bottleneck.md)**. High CPU usage can result in slow query processing and increased response times. When CPU resources are limited, the database may struggle to handle complex queries or large transaction loads.

Expand All @@ -45,21 +45,20 @@ Database performance issues can be classified into several categories based on t

- **OS issues**

- **Hardware resource allocation issues**. Suboptimal allocation of resources, for example poorly configured control groups (cgroups), may result in insufficient resources for {{ ydb-short-name }} and increase query latencies even though physical hardware resources are still available on the database server.

- **[System clock drift](system/system-clock-drift.md)**. If the system clocks on the {{ ydb-short-name }} servers start to drift apart, it will lead to increased distributed transaction latencies. In severe cases, {{ ydb-short-name }} might even refuse to process distributed transactions and return errors.

- Other processes running on the same nodes as YDB, such as antiviruses, observability agents, etc.
- Other processes running on the same servers or virtual machines as {{ ydb-short-name }}, such as antiviruses, observability agents, etc.

- Kernel misconfiguration.

- **YDB-related issues**
- **{{ ydb-short-name }}-related issues**

- **[Rolling restart](system/ydb-updates.md)**. {{ ydb-short-name }} is a distributed system that supports rolling restart, when database administrators update {{ ydb-short-name }} nodes one by one. This helps keep the {{ ydb-short-name }} cluster up and running during the update process or some {{ ydb-short-name }} configuration changes. However, when a YDB node is being restarted, Hive moves the tables that run on this node to other nodes, and that may lead to increased latencies for queries that are processed by the moving tables.
- **[Rolling restart](ydb/ydb-updates.md)**. Database administrators (DBAs) can keep the {{ ydb-short-name }} cluster up and running during the update process or some {{ ydb-short-name }} configuration changes. This is possible because {{ ydb-short-name }} is a distributed system that supports rolling restart, and DBAs can update {{ ydb-short-name }} nodes one by one. However, when a {{ ydb-short-name }} node is being restarted, [Hive](../../../concepts/glossary.md#hive) moves the tablets that run on this node to other nodes, and that may lead to increased latencies for queries that are processed by the moving tablets.

- Actor system pools misconfiguration.

- SDK usage issues (maybe worth being a separate category).
- SDK usage issues.

- **Schema design issues**. These issues stem from inefficient decisions made during the creation of tables and indices. They can significantly impact query performance.

Expand All @@ -73,15 +72,13 @@ If any known changes occurred in the system around the time the performance issu

1. [Overloaded shards](schemas/overloaded-shards.md)
1. [Excessive tablet splits and merges](schemas/splits-merges.md)
1. [Frequent tablet moves between nodes](system/tablets-moved.md)
1. [Frequent tablet moves between nodes](ydb/tablets-moved.md)
1. Insufficient hardware resources:
- [Disk I/O bandwidth](hardware/io-bandwidth.md)
- [Disk space](hardware/disk-space.md)
- [Insufficient CPU](hardware/cpu-bottleneck.md)
1. [Hardware issues](infrastructure/hardware.md) and [data center outages](infrastructure/dc-outage.md)
1. [Network issues](infrastructure/network.md)
1. [{{ ydb-short-name }} updates](system/ydb-updates.md)
1. [{{ ydb-short-name }} updates](ydb/ydb-updates.md)
1. [System clock drift](system/system-clock-drift.md)



Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,10 @@ To determine if one of the data centers of the {{ ydb-short-name }} cluster is n
![](../_assets/cluster-nodes.png)

If all of the nodes in one of the DCs (data centers) are not available, this data center is most likely offline.

{% note info %}

Also analyze the **Rack** column to check if {{ ydb-short-name }} nodes are not available in one or several racks in a DC. This might indicate that these racks are offline.

{% endnote %}

Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,15 @@ You can also use the **Healthcheck** in [Embedded UI](../../../../reference/embe

- **Storage issues**

On the **Storage** tab, select the **Degraded** filter to list storage groups or nodes that contain degraded or failed storage.
1. On the **Storage** tab, select the **Degraded** filter to list storage groups or nodes that contain degraded or failed storage.

1. Check for any degradation in the storage system performance on the **Distributed Storage Overview** and **PDisk Device single disk** dashboards in Grafana.

- **Network issues**

<!-- The include is added to allow partial overrides in overlays -->
{% include notitle [network issues](./_includes/network.md) %}

- **Availability of nodes on racks**

On the **Nodes** tab, see if nodes on specific racks are not available. Analyze the health indicators in the **Host** and **Rack** columns.

## Recommendations

Contact the support team of your data center.
Contact the responsible party for the affected hardware to resolve the underlying issue. If you are part of a larger organization, this could be an in-house team managing low-level infrastructure. Otherwise, contact the cloud service or hosting provider's support service.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
items:
- name: Network issues
href: network.md
- name: Data center outages
href: dc-outage.md
- name: Data center maintenance and drills
href: dc-drills.md
- name: Hardware issues
href: hardware.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
items:
- name: Transaction lock invalidation
href: transaction-lock-invalidation.md
- name: OVERLOADED errors
href: overloaded-errors.md
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
{% note info %}

This procedure applies only to row-oriented tables.

{% endnote %}


1. Analyze the **Overloaded shard count** chart in the **DB overview** Grafana dashboard.

![](../_assets/overloaded-shards-dashboard.png)
Expand Down Expand Up @@ -42,6 +49,12 @@

If the table does not have these options, see [Recommendations for table configuration](../overloaded-shards.md#table-config).

2. Analyze whether primary key values increment monotonically.
1. Analyze whether primary key values increment monotonically:

- Check the data type of the primary key column. {{ ydb-short-name }} `serial` data types are used for autoincrementing values.

- Check the application logic.

- Calculate the difference between the minimum and maximum values of the primary key column. Then compare this value to the number of rows in a given table. If these values match, the primary key might be incrementing monotonically.

If they do, see [Recommendations for the imbalanced primary key](../overloaded-shards.md#pk-recommendations).
If primary key values do increase monotonically, see [Recommendations for the imbalanced primary key](../overloaded-shards.md#pk-recommendations).
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,25 @@

1. Check whether the user load increased when the tablet splits and merges spiked.

<!-- TODO: Add user load charts -->
[//]: # (TODO: Add user load charts)

- Review the diagrams on the **DataShard** dashboard in Grafana for any changes in the volume of data read or written by queries.

- Examine the **Requests** chart on the **Query engine** dashboard in Grafana for any spikes in the number of requests.

1. To identify recently split or merged tables, follow these steps:
1. To identify recently split or merged tablets, follow these steps:

1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Nodes** tab and select a node.
1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), click the **Developer UI** link in the upper right corner.

1. Open the **Tablets** tab.
1. Navigate to **Node Table Monitor** > **All tablets of the cluster**.

1. Sort the tablets by the **Uptime** column and review tablets, which uptime values coincide with the spikes on the **Split / Merge partitions** chart.
1. To show only data shard tablets, in the **TabletType** filter, specify `DataShard`.

1. To identify the table associated with the DataShard, hover over the Tablet link in the DataShard row and click the **Developer UI** icon.
![](../_assets/node-tablet-monitor-data-shard.png)

![](../_assets/splits-merges-tablets-devui.png)
1. Sort the tablets by the **ChangeTime** column and review tablets, which change time values coincide with the spikes on the **Split / Merge partitions** chart.

1. To identify the table associated with the data shard, in the data shard row, click the link in the **TabletID** column.

1. On the **Tablets** page, click the **App** link.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
items:
- name: Overloaded shards
href: overloaded-shards.md
- name: Excessive tablet splits and merges
href: splits-merges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
items:
- name: System clock drift
href: system-clock-drift.md

This file was deleted.

54 changes: 18 additions & 36 deletions ydb/docs/en/core/dev/troubleshooting/performance/toc_p.yaml
Original file line number Diff line number Diff line change
@@ -1,43 +1,25 @@
items:
- name: Infrastructure issues
items:
- name: Network issues
href: infrastructure/network.md
- name: Data center outages
href: infrastructure/dc-outage.md
- name: Data center maintenance and drills
href: infrastructure/dc-drills.md
- name: Hardware issues
href: infrastructure/hardware.md
include:
mode: link
path: infrastructure/toc_p.yaml
- name: Reaching resource limits
items:
- name: CPU
href: hardware/cpu-bottleneck.md
- name: Memory
href: hardware/insufficient-memory.md
- name: I/O bandwidth
href: hardware/io-bandwidth.md
- name: Disk space
href: hardware/disk-space.md
include:
mode: link
path: hardware/toc_p.yaml
- name: OS issues
items:
- name: System clock drift
href: system/system-clock-drift.md
include:
mode: link
path: system/toc_p.yaml
- name: YDB configuration issues
items:
- name: Rolling restart
href: system/ydb-updates.md
- name: Frequent tablet moves between nodes
href: system/tablets-moved.md
include:
mode: link
path: ydb/toc_p.yaml
- name: Schema design issues
items:
- name: Overloaded shards
href: schemas/overloaded-shards.md
- name: Excessive tablet splits and merges
href: schemas/splits-merges.md
include:
mode: link
path: schemas/toc_p.yaml
- name: Client application issues
items:
- name: Transaction lock invalidation
href: queries/transaction-lock-invalidation.md
- name: OVERLOADED errors
href: queries/overloaded-errors.md
include:
mode: link
path: queries/toc_p.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,6 @@

![cpu balancer](../_assets/cpu-balancer.jpg)

1. Additionally, to see the recently moved tablets, follow these steps:

1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Nodes** tab and select a node.

1. Open the **Tablets** tab.

1. Hover over the Tablet link in the Hive row and click the **Developer UI** icon.

1. On the **Tablets** page, click the **App** link.

1. Click the **Balancer** button.
1. Additionally, to see the recently moved tablets, click the **Balancer** button.

The **Balancer** window will appear. The list of recently moved tablets is displayed in the **Latest tablet moves** section.
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

{{ ydb-short-name }} automatically balances the load by moving tablets from overloaded nodes to other nodes. This process is managed by [Hive](../../../../concepts/glossary.md#hive). When Hive moves tablets, queries affecting those tablets might experience increased latencies while they wait for the tablet to get initialized on the new node.

<!-- This information is taken from a draft topic Concepts > Hive. -->
<!-- TODO: When the above-mentioned topic is merged, remove the info from here and add a link. -->
[//]: # (This information is taken from a draft topic Concepts > Hive.)
[//]: # (TODO: When the above-mentioned topic is merged, remove the info from here and add a link.)

{{ ydb-short-name }} considers usage of the following hardware resources for balancing nodes:

- CPU
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
items:
- name: Rolling restart
href: ydb-updates.md
- name: Frequent tablet moves between nodes
href: tablets-moved.md
Loading

0 comments on commit bbf9f83

Please sign in to comment.