Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Refactor to Use Enums Instead of Literals #1784

Open
sam-hey opened this issue Jan 13, 2025 · 8 comments
Open

[v2] Refactor to Use Enums Instead of Literals #1784

sam-hey opened this issue Jan 13, 2025 · 8 comments

Comments

@sam-hey
Copy link
Contributor

sam-hey commented Jan 13, 2025

Description:

Currently, the codebase relies on literals to describe scoring functions and other constants (magic literals), which can make the code harder to understand and maintain. Using literals for values like scoring functions instead of Enums makes the codebase less clear and reduces the benefits of modern IDE features such as autocompletion and error detection.

In line with the refactoring principle outlined in Refactoring Guru's "Replace Magic Number with Symbolic Constant", I propose refactoring the code to replace these magic literals with Enums. This change will enhance code clarity, facilitate easier development, and prevent potential errors caused by the incorrect use of raw literals.

For example, using literals like cosine or dot as scoring function identifiers makes the code harder to inspect and add new functions to, especially as the project grows. The same applies to other literals used in the project, such as ISO_LANGUAGE_SCRIPT.

VSCode does not provide autocompletion for Literal types by default when using the Python extension.
No Autocomplete
image

Proposed Solution:

  1. Introduce an Enum to represent all possible scoring functions.
  2. Refactor all models and methods that currently use raw literals to instead use Enums (refer to PR #1759).

Benefits:

  • Increased Readability: Using Enums or symbolic constants will make it immediately clear what values are being used and their purpose.
  • Easier Maintenance: Modifying or extending the codebase becomes simpler when Enums are used, as they centralize value management and eliminate the risk of incorrect usage of raw literals.
  • Better IDE Support: IDEs will provide autocompletion and error checking when using Enums, leading to fewer bugs and faster development.
  • Consistent Usage: With named constants or Enums, the same value will always be referred to by the same name, reducing the chance of errors caused by inconsistent usage of literals.

Please provide any feedback or thoughts on this proposed change.

Thx a lot!

Start of the Discussion in: #1759

@KennethEnevoldsen
Copy link
Contributor

So will just outline the counter argument. Since you have outlined the reason to use Enum quite well I will here focus more on Literals.


vs code does autocomplete literals:

Screenshot 2025-01-13 at 12 16 42

You can also see here that it in fact detects the error that "" is not valid.

Literal are not magic constants in the same way that string are. To take a few pointers from the blog:

The symbolic constant can serve as live documentation of the meaning of its value.

literals in MTEB does have this feature e.g. all domains (which is type hinted) is contained in the list of DOMAINS. The same is the case for "cosine" which is in DISTANCE_METRICS.

It’s much easier to change the value of a constant than to search for this number throughout the entire codebase

I agree with this point. No way to change across the repo rename "cosine" to "Cosine Distance".

Reduce duplicate use of a number or string in the code. This is especially important when the value is complicated and long (such as 3.14159 or 0xCAFEBABE).

This doesn't really apply, it would just be

import ...
domains=[DOMAINS.PROGRAMMING]

which is more verbose than

domains = ["programming"]

The pros/cons of using literal (just to get them stated):

  • pros:
    • simplicity: it is just a string, everyone knows how to work with them. This is especially important in a code base where we want people to quickly come in and contribute (which is one of the sucesses of MTEB)
    • json native: As this is mostly for documentation in json files, it is clear how it is transformed to and from json
  • cons
    • lack of validations: resolved by using a pydantic object and most type checkers also check literals

In a way a literal can be seen as a compromise between the simple (but unruly) string and the strict and verbose Enum.


A few questions:

  • how does influence when writing to the object to json?
  • How does this influence backward compatibility (I assume it is not backwards compatible)

My general opinion is that for specialized codebased (a few long time maintainers) Enums would be the correct solution. I am unsure if it is the correct solution here with the main concerns being loss of backward compatibility without enough upside and loss of simplicity making it harder for new users to contribute.

Would love to hear the opinion of other maintainers:
@x-tabdeveloping, @isaac-chung, @Muennighoff

@sam-hey
Copy link
Contributor Author

sam-hey commented Jan 13, 2025

How does this affect JSON serialization of the object?

The enum is converted to a string during serialization, which is why the mapping for SentenceTransformers functions correctly.

How does this impact backward compatibility? (I assume it is not backward compatible)

As far as I can tell, it is fully compatible with the previous implementation. However, potential issues could arise if someone has overridden the available scoring functions or if ModelMeta is passed, as this introduces a minor change. Nonetheless, such cases are likely uncommon.
No, test had to be updated!

@sam-hey
Copy link
Contributor Author

sam-hey commented Jan 13, 2025

Although it was argued that Enums are not used, this is actually not the case.

mteb/encoder_interface.py

class PromptType(str, Enum):
    query = "query"
    passage = "passage"

@KennethEnevoldsen KennethEnevoldsen changed the title [v2] Refactor to Use Enums Instead of Literals: Removing Magic Literals [v2] Refactor to Use Enums Instead of Literals Jan 13, 2025
@sam-hey
Copy link
Contributor Author

sam-hey commented Jan 13, 2025

Maybe I can also address this point, as I am quite new to the project:

Loss of simplicity, making it harder for new users to contribute.

I found it significantly more difficult to add a new model with the Literals compared to using just enums

@KennethEnevoldsen
Copy link
Contributor

I found it significantly more difficult to add a new model with the Literals compared to using just enums

That is quite nice to know.

for other readers, here is just a small overview of Enums with pydantic objects:

from pydantic import BaseModel
from enum import Enum


class MyEnum(str, Enum):
    option1 = "option 1"
    option2 = "option 2"

class Tester(BaseModel):
    var: MyEnum

tester = Tester(var="option 1")
# works which is nice 

# the print is a bit less readable:
# Tester(var=<MyEnum.option1: 'option 1'>)

# though the value needs a bit of handling on the user-side
tester.var.value 

# converting to json:
tester.model_dump_json()
# '{"var":"option 1"}'

An important note: we currently have two major merges coming up (v2.0.0 and MIEB), this will likely cause notably merge conflicts so it might be worth waiting with this change until post MIEB -> v2.0.0 and MAIN -> v2.0.0.

(cc @gowitheflow-1998)

@sam-hey
Copy link
Contributor Author

sam-hey commented Jan 16, 2025

Closing this as there seems to be no interest in transitioning to Enums

@sam-hey sam-hey closed this as completed Jan 16, 2025
@isaac-chung isaac-chung reopened this Jan 16, 2025
@isaac-chung
Copy link
Collaborator

I think there is, at least to me. Kenneth's comment above simply suggests that we wait until after the major efforts in merging a handful of branches. If you have some bandwidth, I'd encourage you to check out outstanding issues tagged in #1791 and/or #1405.

@sam-hey
Copy link
Contributor Author

sam-hey commented Jan 16, 2025

Closed it due to the following:

Discussion on GitHub:

Given that there hasn't been strong support in #1784, I would remove Enums for now (keeping them consistent with other metadata objects).

Good to know there is support for this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants