[v2] Refactor to Use Enums Instead of Literals #1784

sam-hey · 2025-01-13T09:38:08Z

Description:

Currently, the codebase relies on literals to describe scoring functions and other constants (magic literals), which can make the code harder to understand and maintain. Using literals for values like scoring functions instead of Enums makes the codebase less clear and reduces the benefits of modern IDE features such as autocompletion and error detection.

In line with the refactoring principle outlined in Refactoring Guru's "Replace Magic Number with Symbolic Constant", I propose refactoring the code to replace these magic literals with Enums. This change will enhance code clarity, facilitate easier development, and prevent potential errors caused by the incorrect use of raw literals.

For example, using literals like cosine or dot as scoring function identifiers makes the code harder to inspect and add new functions to, especially as the project grows. The same applies to other literals used in the project, such as ISO_LANGUAGE_SCRIPT.

VSCode does not provide autocompletion for Literal types by default when using the Python extension.

Proposed Solution:

Introduce an Enum to represent all possible scoring functions.
Refactor all models and methods that currently use raw literals to instead use Enums (refer to PR #1759).

Benefits:

Increased Readability: Using Enums or symbolic constants will make it immediately clear what values are being used and their purpose.
Easier Maintenance: Modifying or extending the codebase becomes simpler when Enums are used, as they centralize value management and eliminate the risk of incorrect usage of raw literals.
Better IDE Support: IDEs will provide autocompletion and error checking when using Enums, leading to fewer bugs and faster development.
Consistent Usage: With named constants or Enums, the same value will always be referred to by the same name, reducing the chance of errors caused by inconsistent usage of literals.

Please provide any feedback or thoughts on this proposed change.

Thx a lot!

Start of the Discussion in: #1759

The text was updated successfully, but these errors were encountered:

KennethEnevoldsen · 2025-01-13T11:50:17Z

So will just outline the counter argument. Since you have outlined the reason to use Enum quite well I will here focus more on Literals.

vs code does autocomplete literals:

You can also see here that it in fact detects the error that "" is not valid.

Literal are not magic constants in the same way that string are. To take a few pointers from the blog:

The symbolic constant can serve as live documentation of the meaning of its value.

literals in MTEB does have this feature e.g. all domains (which is type hinted) is contained in the list of DOMAINS. The same is the case for "cosine" which is in DISTANCE_METRICS.

It’s much easier to change the value of a constant than to search for this number throughout the entire codebase

I agree with this point. No way to change across the repo rename "cosine" to "Cosine Distance".

Reduce duplicate use of a number or string in the code. This is especially important when the value is complicated and long (such as 3.14159 or 0xCAFEBABE).

This doesn't really apply, it would just be

import ...
domains=[DOMAINS.PROGRAMMING]

which is more verbose than

domains = ["programming"]

The pros/cons of using literal (just to get them stated):

pros:
- simplicity: it is just a string, everyone knows how to work with them. This is especially important in a code base where we want people to quickly come in and contribute (which is one of the sucesses of MTEB)
- json native: As this is mostly for documentation in json files, it is clear how it is transformed to and from json
cons
- ~~lack of validations~~: resolved by using a pydantic object and most type checkers also check literals

In a way a literal can be seen as a compromise between the simple (but unruly) string and the strict and verbose Enum.

A few questions:

how does influence when writing to the object to json?
How does this influence backward compatibility (I assume it is not backwards compatible)

My general opinion is that for specialized codebased (a few long time maintainers) Enums would be the correct solution. I am unsure if it is the correct solution here with the main concerns being loss of backward compatibility without enough upside and loss of simplicity making it harder for new users to contribute.

Would love to hear the opinion of other maintainers:
@x-tabdeveloping, @isaac-chung, @Muennighoff

sam-hey · 2025-01-13T12:28:16Z

How does this affect JSON serialization of the object?

The enum is converted to a string during serialization, which is why the mapping for SentenceTransformers functions correctly.

How does this impact backward compatibility? (I assume it is not backward compatible)

As far as I can tell, it is fully compatible with the previous implementation. However, potential issues could arise if someone has overridden the available scoring functions or if ModelMeta is passed, as this introduces a minor change. Nonetheless, such cases are likely uncommon.
No, test had to be updated!

sam-hey · 2025-01-13T12:34:58Z

Although it was argued that Enums are not used, this is actually not the case.

mteb/encoder_interface.py

class PromptType(str, Enum):
    query = "query"
    passage = "passage"

sam-hey · 2025-01-13T13:43:13Z

Maybe I can also address this point, as I am quite new to the project:

Loss of simplicity, making it harder for new users to contribute.

I found it significantly more difficult to add a new model with the Literals compared to using just enums

KennethEnevoldsen · 2025-01-13T16:00:51Z

I found it significantly more difficult to add a new model with the Literals compared to using just enums

That is quite nice to know.

for other readers, here is just a small overview of Enums with pydantic objects:

from pydantic import BaseModel
from enum import Enum


class MyEnum(str, Enum):
    option1 = "option 1"
    option2 = "option 2"

class Tester(BaseModel):
    var: MyEnum

tester = Tester(var="option 1")
# works which is nice 

# the print is a bit less readable:
# Tester(var=<MyEnum.option1: 'option 1'>)

# though the value needs a bit of handling on the user-side
tester.var.value 

# converting to json:
tester.model_dump_json()
# '{"var":"option 1"}'

An important note: we currently have two major merges coming up (v2.0.0 and MIEB), this will likely cause notably merge conflicts so it might be worth waiting with this change until post MIEB -> v2.0.0 and MAIN -> v2.0.0.

(cc @gowitheflow-1998)

sam-hey · 2025-01-16T14:22:15Z

Closing this as there seems to be no interest in transitioning to Enums

isaac-chung · 2025-01-16T14:44:59Z

I think there is, at least to me. Kenneth's comment above simply suggests that we wait until after the major efforts in merging a handful of branches. If you have some bandwidth, I'd encourage you to check out outstanding issues tagged in #1791 and/or #1405.

sam-hey · 2025-01-16T17:10:57Z

Closed it due to the following:

Discussion on GitHub:

Given that there hasn't been strong support in #1784, I would remove Enums for now (keeping them consistent with other metadata objects).

Good to know there is support for this change.

KennethEnevoldsen mentioned this issue Jan 13, 2025

[v2] add similarity_fn in ModelMeta #1759

Open

2 tasks

KennethEnevoldsen changed the title ~~[v2] Refactor to Use Enums Instead of Literals: Removing Magic Literals~~ [v2] Refactor to Use Enums Instead of Literals Jan 13, 2025

sam-hey closed this as completed Jan 16, 2025

isaac-chung reopened this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Refactor to Use Enums Instead of Literals #1784

[v2] Refactor to Use Enums Instead of Literals #1784

sam-hey commented Jan 13, 2025

KennethEnevoldsen commented Jan 13, 2025

sam-hey commented Jan 13, 2025

sam-hey commented Jan 13, 2025

sam-hey commented Jan 13, 2025

KennethEnevoldsen commented Jan 13, 2025

sam-hey commented Jan 16, 2025

isaac-chung commented Jan 16, 2025

sam-hey commented Jan 16, 2025

[v2] Refactor to Use Enums Instead of Literals #1784

[v2] Refactor to Use Enums Instead of Literals #1784

Comments

sam-hey commented Jan 13, 2025

Description:

Proposed Solution:

Benefits:

KennethEnevoldsen commented Jan 13, 2025

sam-hey commented Jan 13, 2025

sam-hey commented Jan 13, 2025

sam-hey commented Jan 13, 2025

KennethEnevoldsen commented Jan 13, 2025

sam-hey commented Jan 16, 2025

isaac-chung commented Jan 16, 2025

sam-hey commented Jan 16, 2025