Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save transcripts in the DB #927

Closed
marcospri opened this issue May 18, 2023 · 0 comments · Fixed by #1072
Closed

Save transcripts in the DB #927

marcospri opened this issue May 18, 2023 · 0 comments · Fixed by #1072

Comments

@marcospri
Copy link
Member

marcospri commented May 18, 2023

Problem: Via currently calls the YouTube API to get the video transript every time someone loads Via's YouTube video player page. This risks getting us blocked by YouTube for making too many requests to their API; it makes page loads slower than necessary (because every page load must wait for network requests to the YouTube API); and it makes page loads more unreliable than necessary (whenever we make network requests to a third-party service there's a chance this request can fail). It also means that if our code for getting transcripts from the YouTube API breaks (say YouTube change or remove their API) then all existing video assignments will be broken as each assignment load will crash when trying to get the transcript.

Solution: Via's get_transcript() method should save the video title in Via's DB after retrieving it from the YouTube API. The next time get_transcript() is called, if the transcript for this particular video is already in the DB, it should return the title from the DB and not make a YouTube API request.

Note on DB keys

A single YouTube video can have multiple transcripts, for example transcripts in different languages, so we can't just use the video ID as the key when saving transcripts in the DB. Instead we need to use (video_id, transcript_id) where transcript_id is some sort of transcript identifier.

The way the code currently works (based on the youtube-transcript-api library), is that transcript IDs are simple language codes like "en", for example:

def get_transcript(self, video_id):
    return YouTubeTranscriptApi.get_transcript(video_id, languages=("en",))

So DB keys will be a combination of a YouTube video ID and a language code, e.g. {"video_id": "Bw8b9YV0EPA", "transcript_id": "en"} (and the transcript will be whatever a successful call to YouTubeTranscriptApi.get_transcript("Bw8b9YV0EPA", languages=("en",)) returns.

Known Issues

  • We're planning to replace the third-party youtube-transcript-api library with our own code and our code will probably use a different form of transcript ID than the language codes that youtube-transcript-api's uses (e.g. "en"): this is a flaw in youtube-transcript-api, a single YouTube video can have multiple transcripts with the same language code (see Missing transcripts jdepoix/youtube-transcript-api#150). When we change the transcript IDs we'll end up invalidating any existing cached transcripts in Via's DB as these will use the old-style IDs and so the lookup code will no longer find them.

    This is fine: the transcripts for these videos will simply be re-fetched once.

    We could do extra work to migrate the IDs of these transcripts in the DB but I doubt it's worth the effort: we can just let Via re-fetch them.

    We could also do extra work to delete the invalidated transcripts in the DB but again I don't think it's worth the effort.

  • When we replace youtube-transcript-api we may also inadvertently change the transcripts of existing assignments from one English transcript to another, if the assignment is for a YouTube video that has multiple English transcripts. This is because our new code will pick a different English transcript by default than what youtube-transcript-api picks. This is fine: we discussed it in a project meeting (Thu 20th July 2023, Sean, Marcos, Rob, Alejandro, Nairi, Lee, Dan) and decided to accept the one-off risk of changing the transcripts of some existing assignments.

@seanh seanh self-assigned this Jun 29, 2023
@seanh seanh changed the title Cache video transcripts Save transcripts in the DB Jul 25, 2023
@seanh seanh added the Backend label Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants