Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多个seata-server实例下,业务应用的tcc-commit方法收到了多个重试提交请求 #7090

Open
wxrqforever opened this issue Jan 2, 2025 · 11 comments
Labels

Comments

@wxrqforever
Copy link

wxrqforever commented Jan 2, 2025

一、背景
有两个seata-server实例,客户端业务应用实例一个,故意在tcc模式中commit方法中抛出了一个异常

public void commit(BusinessActionContext actionContext) {
    int I = 1/0;         
}

此时db中global_table全局事务的status字段为3,server会进行事务提交的重试,这个时候我发现客户端应用收到了两个seata-server分别发来的重试请求。
server端日志如下
实例a:
12:00:32.264 INFO --- [ RetryCommitting_1_1] [.core.rpc.netty.ChannelManager] [ getChannel] [xxxxxx:2387511350085918818] : Choose [id: 0x6613e85e, L:/xxxxxx:9527 - R:/xxxxxx:54702] on the same IP[xxxxxx] as alternative of xxxxxx: xxxxxx:46452
12:00:32.442 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : receive msg[single]: BranchCommitResponse{xid='xxxxxx:2387511350085918818', branchId=2387511350085918823, branchStatus=PhaseTwo_CommitFailed_Retryable, resultCode=Success, msg='null'}, clientIp: xxxxxx, vgroup: xxxxxx
12:00:32.442 ERROR --- [ RetryCommitting_1_1] [server.coordinator.DefaultCore] [bda$doGlobalCommit$1] [11.166.244.1:9527:2387511350085918818] : Committing global transaction[11.166.244.1:9527:2387511350085918818] failed, caused by branch transaction[2387511350085918823] commit failed, will retry later.
实例b:
12:00:24.130 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : receive msg[single]: BranchCommitResponse{xid='xxxxxx:2387511350085918818', branchId=2387511350085918823, branchStatus=PhaseTwo_CommitFailed_Retryable, resultCode=Success, msg='null'}, clientIp: xxxxxx, vgroup: xxxxxx
12:00:24.130 ERROR --- [ RetryCommitting_1_1] [server.coordinator.DefaultCore] [bda$doGlobalCommit$1] [xxxxxx:9527:2387511350085918818] : Committing global transaction[xxxxxx:2387511350085918818] failed, caused by branch transaction[2387511350085918823] commit failed, will retry later.

server端配置如下:
#Transaction rule configuration, only for the server
server.recovery.committingRetryPeriod=60000
server.recovery.asynCommittingRetryPeriod=1000
server.recovery.rollbackingRetryPeriod=60000
server.recovery.timeoutRetryPeriod=1000
server.maxCommitRetryTimeout=-1
server.maxRollbackRetryTimeout=-1
server.rollbackRetryTimeoutUnlockEnable=false
server.distributedLockExpireTime=10000
server.session.branchAsyncQueueSize=5000
server.session.enableBranchAsyncRemove=false
server.enableParallelRequestHandle=true
server.enableParallelHandleBranch=false
server.applicationDataLimit=64000
server.applicationDataLimitCheck=false

二、问题
我的疑问在于在多个seata-server的场景下,tc发起重试,不是应该仅有一个tc实际生效发起重试,这里是多个server均发起了重试请求是符合预期的吗?分布式锁控制tc的重试在这个场景是不生效的吗?

@funky-eyes
Copy link
Contributor

funky-eyes commented Jan 2, 2025

首先分布式锁的表是否有创建,对应的配置是否有增加?如果不增加该配置,就不会启动分布式锁来协调
First, is the table of the distribution lock created, and is the corresponding configuration increased? If the configuration is not increased, the distribution lock will not be started to coordinate
store.db.distributedLockTable=distributed_lock
如果配置中心type为file

seata:
  store:
    db:
      distributed-lock-table: distributed_lock

@wxrqforever
Copy link
Author

distributed-lock-table

配置和表都是具备的,分布式锁的表里也是时不时就有数据:

#Transaction storage configuration, only for the server. The file, db, and redis configuration values are optional.
store.mode=db
store.lock.mode=db
store.session.mode=db

#These configurations are required if the `store mode` is `db`. If `store.mode,store.lock.mode,store.session.mode` are not equal to `db`, you can remove the configuration block.
store.db.datasource=druid
store.db.dbType=mysql
store.db.driverClassName=com.mysql.jdbc.Driver
store.db.url=xxxxx
store.db.user=xxxxx
store.db.password= xxxxx
store.db.minConn=5
store.db.maxConn=30
store.db.globalTable=global_table
store.db.branchTable=branch_table
store.db.distributedLockTable=distributed_lock
store.db.vgroupTable=vgroup-table
store.db.queryLimit=100
store.db.lockTable=lock_table
store.db.maxWait=5000

我推测了一下,怀疑是这样的,这两个实例启动定时任务的时机有些差别,如上面日志所示,可能相差有8秒,目前配置的超时时间是1秒,而执行重试又跑的很快,所以并不是分布式锁没生效,而是两个实例都正常获取到了上一次实例跑完释放的锁了。

@funky-eyes
Copy link
Contributor

select for update 加对数据进行查询过期时间,过期时间为60秒,所以不存在你所说的问题。
The SELECT FOR UPDATE query includes an expiration time for the data, which is set to 60 seconds, so the issue you mentioned does not exist.

@funky-eyes
Copy link
Contributor

你这个日志明显不在同一秒里进行,是你的client提交失败了,导致后续接着重试而已

@wxrqforever
Copy link
Author

你这个日志明显不在同一秒里进行,是你的client提交失败了,导致后续接着重试而已

我感觉我没太理解你的意思,从我这观察到的现象感觉不是这样的。这里的重试是由seata-server定时任务发起的,不是client的重试,上面分别是两个seata-server实例的日志。原因如下:
1.我是重启了业务应用容器后,发现业务应用一直不断收到commit方法的调用
2.原先的调用频率是一秒一次,当我调整server.recovery.committingRetryPeriod=60000配置后就commit的触发变成一分钟一次,我也找到了相关定时任务代码。

我上面想表达的是,由于serverA和serverB不是同时启动的,比如ServerB晚于ServerA数秒,然后ServerA在这个期间就跑完了定时任务,并释放了分布式锁,那么到ServerB执行的时候不是照样可以获取锁。我没明白你说的“select for update 加对数据进行查询过期时间,过期时间为60秒”,代码里的这个配置默认不是1秒吗?我目前的配置也是1秒。

@funky-eyes
Copy link
Contributor

定时任务按运行时进行,运行时没有并发执行就是达到预期,也不会存在幂等性问题
Timed tasks are performed at run time, and no concurrent execution at run time is expected and there is no idemicity problem

@funky-eyes
Copy link
Contributor

你设置回一秒执行一次即可看出有没有并发执行
You can see if there is concurrent execution by executing it once a second

@wxrqforever
Copy link
Author

定时任务按运行时进行,运行时没有并发执行就是达到预期,也不会存在幂等性问题
Timed tasks are performed at run time, and no concurrent execution at run time is expected and there is no idemicity problem

是的,我也是这个意思,老哥之前这个issuce也帮忙看看:
#7047

@funky-eyes
Copy link
Contributor

你的理解是不是以周期性为目的,比如60秒一次定时重试,那么60秒内只应该有一次?因为如果多个实例部署,每一个实例的60秒的时机并不相同,导致60秒内其实有多次,而seata的分布式锁只是为了防止并发,而忽略了周期内次数的要求。
Your understanding seems to be focused on periodicity, such as retrying every 60 seconds. In this case, there should only be one attempt within the 60-second window. However, if multiple instances are deployed, the timing of each instance's 60-second window is not synchronized, leading to multiple retries within the same 60-second period. The Seata distributed lock is designed to prevent concurrency, but it overlooks the requirement of limiting the number of attempts within the same period.

@wxrqforever
Copy link
Author

你的理解是不是以周期性为目的,比如60秒一次定时重试,那么60秒内只应该有一次?因为如果多个实例部署,每一个实例的60秒的时机并不相同,导致60秒内其实有多次,而seata的分布式锁只是为了防止并发,而忽略了周期内次数的要求。
Your understanding seems to be focused on periodicity, such as retrying every 60 seconds. In this case, there should only be one attempt within the 60-second window. However, if multiple instances are deployed, the timing of each instance's 60-second window is not synchronized, leading to multiple retries within the same 60-second period. The Seata distributed lock is designed to prevent concurrency, but it overlooks the requirement of limiting the number of attempts within the same period.

是的,我一开始看现象,发现调用了两次,直观感觉以为分布式锁失效了或者锁的时间太短了,我还调长了server.distributedLockExpireTime的配置,发现依旧是一样的,看了代码才确定定时任务执行完后锁就释放了,然后不同实例的启动时机并不相同,所以导致了这个现象。

@funky-eyes funky-eyes added the status: to-be-discussed To be discussed label Jan 3, 2025
@funky-eyes
Copy link
Contributor

这确实是个值得讨论的问题,当初社区做这个功能只是为了防止并发,没考虑到这一层。我们会在下次双周会中讨论这个问题
This is indeed a matter worth discussing. The community originally implemented this feature to prevent concurrency, without considering this aspect. We will discuss this issue in our next bi-weekly meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants