Save transcripts in the DB #927

marcospri · 2023-05-18T12:50:12Z

Problem: Via currently calls the YouTube API to get the video transript every time someone loads Via's YouTube video player page. This risks getting us blocked by YouTube for making too many requests to their API; it makes page loads slower than necessary (because every page load must wait for network requests to the YouTube API); and it makes page loads more unreliable than necessary (whenever we make network requests to a third-party service there's a chance this request can fail). It also means that if our code for getting transcripts from the YouTube API breaks (say YouTube change or remove their API) then all existing video assignments will be broken as each assignment load will crash when trying to get the transcript.

Solution: Via's get_transcript() method should save the video title in Via's DB after retrieving it from the YouTube API. The next time get_transcript() is called, if the transcript for this particular video is already in the DB, it should return the title from the DB and not make a YouTube API request.

Note on DB keys

A single YouTube video can have multiple transcripts, for example transcripts in different languages, so we can't just use the video ID as the key when saving transcripts in the DB. Instead we need to use (video_id, transcript_id) where transcript_id is some sort of transcript identifier.

The way the code currently works (based on the youtube-transcript-api library), is that transcript IDs are simple language codes like "en", for example:

def get_transcript(self, video_id):
    return YouTubeTranscriptApi.get_transcript(video_id, languages=("en",))

So DB keys will be a combination of a YouTube video ID and a language code, e.g. {"video_id": "Bw8b9YV0EPA", "transcript_id": "en"} (and the transcript will be whatever a successful call to YouTubeTranscriptApi.get_transcript("Bw8b9YV0EPA", languages=("en",)) returns.

Known Issues

We're planning to replace the third-party youtube-transcript-api library with our own code and our code will probably use a different form of transcript ID than the language codes that youtube-transcript-api's uses (e.g. "en"): this is a flaw in youtube-transcript-api, a single YouTube video can have multiple transcripts with the same language code (see Missing transcripts jdepoix/youtube-transcript-api#150). When we change the transcript IDs we'll end up invalidating any existing cached transcripts in Via's DB as these will use the old-style IDs and so the lookup code will no longer find them.

This is fine: the transcripts for these videos will simply be re-fetched once.

We could do extra work to migrate the IDs of these transcripts in the DB but I doubt it's worth the effort: we can just let Via re-fetch them.

We could also do extra work to delete the invalidated transcripts in the DB but again I don't think it's worth the effort.
When we replace youtube-transcript-api we may also inadvertently change the transcripts of existing assignments from one English transcript to another, if the assignment is for a YouTube video that has multiple English transcripts. This is because our new code will pick a different English transcript by default than what youtube-transcript-api picks. This is fine: we discussed it in a project meeting (Thu 20th July 2023, Sean, Marcos, Rob, Alejandro, Nairi, Lee, Dan) and decided to accept the one-off risk of changing the transcripts of some existing assignments.

The text was updated successfully, but these errors were encountered:

marcospri mentioned this issue May 18, 2023

Backend support for youtube caption annotations hypothesis/lms#5361

Closed

seanh added the project: video annotations label Jun 15, 2023

seanh self-assigned this Jun 29, 2023

seanh mentioned this issue Jul 11, 2023

Cache YouTube transcripts in the DB #1072

Merged

seanh changed the title ~~Cache video transcripts~~ Save transcripts in the DB Jul 25, 2023

seanh added the Backend label Jul 25, 2023

seanh mentioned this issue Jul 25, 2023

Add a database to Via #1119

Closed

seanh closed this as completed in #1072 Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save transcripts in the DB #927

Save transcripts in the DB #927

marcospri commented May 18, 2023 •

edited by seanh

Loading

Save transcripts in the DB #927

Save transcripts in the DB #927

Comments

marcospri commented May 18, 2023 • edited by seanh Loading

Note on DB keys

Known Issues

marcospri commented May 18, 2023 •

edited by seanh

Loading