Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The RCP task (exactly-once) frequently reports an error that a certain label has no node #55423

Open
huoarter opened this issue Jan 24, 2025 · 0 comments
Labels
type/bug Something isn't working

Comments

@huoarter
Copy link

huoarter commented Jan 24, 2025

Steps to reproduce the behavior (Required)

Expected behavior (Required)

Real behavior (Required)

The RCP task (exactly-once) frequently reports an error that a certain label has no node. The error occurs in the getNodeId method, as identified through code analysis:

private Long getNodeId(TransactionOperation txnOperation, String label) throws UserException {
    Long nodeId;
    // Save label->be hashmap when beginning a transaction, so that subsequent operators can send to the same BE
    if (TXN_BEGIN.equals(txnOperation)) {
        Long chosenNodeId = GlobalStateMgr.getCurrentState().getNodeMgr()
                .getClusterInfo().getNodeSelector().seqChooseBackendOrComputeId();
        nodeId = chosenNodeId;
        // txnNodeMap is an LRU cache, which atomically removes unused entries
        accessTxnNodeMapWithWriteLock(txnNodeMap -> txnNodeMap.put(label, chosenNodeId));
    } else {
        nodeId = accessTxnNodeMapWithReadLock(txnNodeMap -> txnNodeMap.get(label));
    }

    if (nodeId == null) {
        throw new UserException(String.format(
                "Transaction with op[%s] and label[%s] has no node.", txnOperation.getValue(), label));
    }

    return nodeId;
}

When nodeId == null, the txnNodeMap is printed, revealing that the map's elements are missing and the data is incomplete. Upon investigation, it is found that txnNodeMap is a LinkedHashMap with accessOrder=true, which means that the order of elements changes upon each access. Although a read lock (accessTxnNodeMapWithReadLock) is used, the shared nature of the read lock can lead to issues with the data in txnNodeMap when multiple threads are involved.

The declaration of txnNodeMap should be as follows:

 private final Map<String, Long> txnNodeMap = Collections.synchronizedMap(new LinkedHashMap<>

REF: https://mail.openjdk.org/pipermail/client-libs-dev/2024-October/023429.html

There is LinkedHashMap field created with accessOrder=true.

    private final LinkedHashMap<PixelsKey, ImageSoftReference> map
            = new LinkedHashMap<>(16, 0.75f, true);

Access to it is guarded with ReentrantReadLock.

public Image getImage(final PixelsKey key){
    final ImageSoftReference ref;
    lock.readLock().lock();
    try {
        ref = map.get(key);
    } finally {
        lock.readLock().unlock();
    }
    return ref == null ? null : ref.get();
}

BUT there is a catch: LinkedHashMap.get method for such a case - can
cause structural modification.

StarRocks version (Required)

  • 3.3.7
@huoarter huoarter added the type/bug Something isn't working label Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant