-
-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing transcripts #150
Comments
Hi @xenova, thanks for reporting this. I am afraid this can't be fixed without introducing breaking changes, as we apparently can no longer consider the language code a reliable identifier. Have you encountered multiple instances of this happening? How big of a problem is this? 🤔 |
Oh wow that's quite surprising! I have downloaded > 1 million transcripts for an ML project I'm working on (https://www.github.com/xenova/sponsorblock-ml) and only had 1 problem with this, so, it is most likely not that big of an issue. |
Thanks for reporting back! It's good to know, that this isn't too much of an issue. I have never encountered it myself, although I have scraped quite a few of transcripts. I might just leave this as is. To fix this we would have to return a I guess the only practical option is adding the Any thoughts on this? |
Right, this is definitely a simple problem with an anything-but-simple solution. As you mentioned, the most important thing is not to break code that breaks modules which depend on it, so your second option seems quite practical. I have seen implementations (in django I believe) of a "MultiDict" (or something like that) which acts exactly as a dictionary (allowing for indexing), but allows for duplicate keys. This is normally implemented by mapping keys to a list, and when indexing, you just return the first element. Another way to implemented with an auxiliary dictionary used to map keys to the index of their first appearance (so that you can still index normally), but allows for iterating over the container if you need a specific item. For example, you could have a multidict: |
When fetching transcripts for https://www.youtube.com/watch?v=gdsUKphmB3Y, I only get a subset of the available transcripts.
Using library:
On YouTube:
The text was updated successfully, but these errors were encountered: