Questions about model evaluation #32

yaxundai · 2024-06-07T14:46:45Z

When I used the pre-trained model 'raphaelsty/neural-cherche-sparse-embed' to evaluate the dataset, specifically, the arguana dataset, with a retrieval k value of 100, the result was very poor
{'map': 0.033567943638956016,
'ndcg@10': 0.042417859280348115,
'ndcg@100': 0.08691780846498275,
'recall@10': 0.09815078236130868,
'recall@100': 0.32147937411095306}
As shown above, ndcg is only 4.2%

raphaelsty · 2024-06-07T14:59:09Z

Hi @KAGAII, make sure you update neural-cherche using pip install neural-cherche --upgrade to get the 1.4.3 version.

from neural_cherche import models, rank, retrieve, utils

device = "cpu" # or "mps" or "conda"

documents, queries, qrels = utils.load_beir(
    "arguana",
    split="test",
)

retriever = retrieve.BM25(
    key="id",
    on=["title", "text"],
)


ranker = rank.ColBERT(
    key="id",
    on=["title", "text"],
    model=models.ColBERT(
        model_name_or_path="raphaelsty/neural-cherche-colbert",
        device=device,
    ).to(device),
)


retriever = retriever.add(
    documents_embeddings=retriever.encode_documents(
        documents=documents,
    )
)


candidates = retriever(
    queries_embeddings=retriever.encode_queries(
        queries=queries,
    ),
    k=30,
    tqdm_bar=True,
)

batch_size = 32

scores = ranker(
    documents=candidates,
    queries_embeddings=ranker.encode_queries(
        queries=queries,
        batch_size=batch_size,
        tqdm_bar=True,
    ),
    documents_embeddings=ranker.encode_candidates_documents(
        candidates=candidates,
        documents=documents,
        batch_size=batch_size,
        tqdm_bar=True,
    ),
    k=10,
)

scores = utils.evaluate(
    scores=scores,
    qrels=qrels,
    queries=queries,
    metrics=["ndcg@10"] + [f"hits@{k}" for k in range(1, 11)],
)

print(scores)

Yield

{
    "ndcg@10": 0.3686831610778578,
    "hits@1": 0.01386748844375963,
    "hits@2": 0.27889060092449924,
    "hits@3": 0.40061633281972264,
    "hits@4": 0.4861325115562404,
    "hits@5": 0.5562403697996918,
    "hits@6": 0.6194144838212635,
    "hits@7": 0.6556240369799692,
    "hits@8": 0.6887519260400616,
    "hits@9": 0.7218798151001541,
    "hits@10": 0.74884437596302,
}

which are good scores, it run in 3 min on mps device. The results you get are due do duplicates queries which are now handled by the evaluation of neural-cherche.

EDIT: sorry I just saw you mention sparse embed a not colbert, running benchmark

raphaelsty · 2024-06-07T15:29:23Z

@KAGAII There is definitely something wrong with SparseEmbed right now, we recently updated SparseEmbed but we may need to update it back to the previous version @arthur-75. I'll make an update in the following days

yaxundai · 2024-06-08T04:54:53Z

Thank you for your prompt reply, looking forward to the new version!

raphaelsty added the question Further information is requested label Jun 7, 2024

raphaelsty self-assigned this Jun 7, 2024

raphaelsty added the bug Something isn't working label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about model evaluation #32

Questions about model evaluation #32

yaxundai commented Jun 7, 2024

raphaelsty commented Jun 7, 2024 •

edited

Loading

raphaelsty commented Jun 7, 2024 •

edited

Loading

yaxundai commented Jun 8, 2024

Questions about model evaluation #32

Questions about model evaluation #32

Comments

yaxundai commented Jun 7, 2024

raphaelsty commented Jun 7, 2024 • edited Loading

raphaelsty commented Jun 7, 2024 • edited Loading

yaxundai commented Jun 8, 2024

raphaelsty commented Jun 7, 2024 •

edited

Loading

raphaelsty commented Jun 7, 2024 •

edited

Loading