Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add bge-m3 ModelMeta #1821

Merged
merged 1 commit into from
Jan 16, 2025
Merged

fix: add bge-m3 ModelMeta #1821

merged 1 commit into from
Jan 16, 2025

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Jan 15, 2025

ref #1803

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@Samoed Samoed changed the title fix: add bge fix: add bge-m3 ModelMeta Jan 15, 2025
# https://huggingface.co/BAAI/bge-m3/discussions/29
bgem3_languages = [
"afr_Latn", # af
# als
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure why these are commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken these language codes from the discussion, but I can't find them in the language mapping or I'm not sure which ones they correspond to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh okay ChatGPT usually does a remarkable job at matching these, there is also a Python library that can do this for you, wait a sec I'll find it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed It's called ISO639 and it feels like magic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we have not all langs in LANG_MAPPING

LANG_MAPPING = {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Semes that LANG_MAPPING used only in MTEB class. I think this should be removed in v2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yea interesting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find an easy way to get script from the language, so I'll leave it as is for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, again, I think LLMs can be a good friend in doing that. If you have the name of the language you're also probably a google search away from the solution. And most languages use Latin, Arab or Cyrillic script anyway, so there are some sensible defaults to go with.

@x-tabdeveloping
Copy link
Collaborator

Looks great otherwise, feel free to merge!

@Samoed Samoed merged commit 4ac59bc into main Jan 16, 2025
11 checks passed
@Samoed Samoed deleted the add_bgem3 branch January 16, 2025 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants