You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: Via currently calls the YouTube API to get the video transript every time someone loads Via's YouTube video player page. This risks getting us blocked by YouTube for making too many requests to their API; it makes page loads slower than necessary (because every page load must wait for network requests to the YouTube API); and it makes page loads more unreliable than necessary (whenever we make network requests to a third-party service there's a chance this request can fail). It also means that if our code for getting transcripts from the YouTube API breaks (say YouTube change or remove their API) then all existing video assignments will be broken as each assignment load will crash when trying to get the transcript.
Solution: Via's get_transcript() method should save the video title in Via's DB after retrieving it from the YouTube API. The next time get_transcript() is called, if the transcript for this particular video is already in the DB, it should return the title from the DB and not make a YouTube API request.
Note on DB keys
A single YouTube video can have multiple transcripts, for example transcripts in different languages, so we can't just use the video ID as the key when saving transcripts in the DB. Instead we need to use (video_id, transcript_id) where transcript_id is some sort of transcript identifier.
The way the code currently works (based on the youtube-transcript-api library), is that transcript IDs are simple language codes like "en", for example:
So DB keys will be a combination of a YouTube video ID and a language code, e.g. {"video_id": "Bw8b9YV0EPA", "transcript_id": "en"} (and the transcript will be whatever a successful call to YouTubeTranscriptApi.get_transcript("Bw8b9YV0EPA", languages=("en",)) returns.
Known Issues
We're planning to replace the third-party youtube-transcript-api library with our own code and our code will probably use a different form of transcript ID than the language codes that youtube-transcript-api's uses (e.g. "en"): this is a flaw in youtube-transcript-api, a single YouTube video can have multiple transcripts with the same language code (see Missing transcripts jdepoix/youtube-transcript-api#150). When we change the transcript IDs we'll end up invalidating any existing cached transcripts in Via's DB as these will use the old-style IDs and so the lookup code will no longer find them.
This is fine: the transcripts for these videos will simply be re-fetched once.
We could do extra work to migrate the IDs of these transcripts in the DB but I doubt it's worth the effort: we can just let Via re-fetch them.
We could also do extra work to delete the invalidated transcripts in the DB but again I don't think it's worth the effort.
When we replace youtube-transcript-api we may also inadvertently change the transcripts of existing assignments from one English transcript to another, if the assignment is for a YouTube video that has multiple English transcripts. This is because our new code will pick a different English transcript by default than what youtube-transcript-api picks. This is fine: we discussed it in a project meeting (Thu 20th July 2023, Sean, Marcos, Rob, Alejandro, Nairi, Lee, Dan) and decided to accept the one-off risk of changing the transcripts of some existing assignments.
The text was updated successfully, but these errors were encountered:
Problem: Via currently calls the YouTube API to get the video transript every time someone loads Via's YouTube video player page. This risks getting us blocked by YouTube for making too many requests to their API; it makes page loads slower than necessary (because every page load must wait for network requests to the YouTube API); and it makes page loads more unreliable than necessary (whenever we make network requests to a third-party service there's a chance this request can fail). It also means that if our code for getting transcripts from the YouTube API breaks (say YouTube change or remove their API) then all existing video assignments will be broken as each assignment load will crash when trying to get the transcript.
Solution: Via's
get_transcript()
method should save the video title in Via's DB after retrieving it from the YouTube API. The next timeget_transcript()
is called, if the transcript for this particular video is already in the DB, it should return the title from the DB and not make a YouTube API request.Note on DB keys
A single YouTube video can have multiple transcripts, for example transcripts in different languages, so we can't just use the video ID as the key when saving transcripts in the DB. Instead we need to use
(video_id, transcript_id)
wheretranscript_id
is some sort of transcript identifier.The way the code currently works (based on the youtube-transcript-api library), is that transcript IDs are simple language codes like
"en"
, for example:So DB keys will be a combination of a YouTube video ID and a language code, e.g.
{"video_id": "Bw8b9YV0EPA", "transcript_id": "en"}
(and the transcript will be whatever a successful call toYouTubeTranscriptApi.get_transcript("Bw8b9YV0EPA", languages=("en",))
returns.Known Issues
We're planning to replace the third-party
youtube-transcript-api
library with our own code and our code will probably use a different form of transcript ID than the language codes thatyoutube-transcript-api
's uses (e.g."en"
): this is a flaw inyoutube-transcript-api
, a single YouTube video can have multiple transcripts with the same language code (see Missing transcripts jdepoix/youtube-transcript-api#150). When we change the transcript IDs we'll end up invalidating any existing cached transcripts in Via's DB as these will use the old-style IDs and so the lookup code will no longer find them.This is fine: the transcripts for these videos will simply be re-fetched once.
We could do extra work to migrate the IDs of these transcripts in the DB but I doubt it's worth the effort: we can just let Via re-fetch them.
We could also do extra work to delete the invalidated transcripts in the DB but again I don't think it's worth the effort.
When we replace
youtube-transcript-api
we may also inadvertently change the transcripts of existing assignments from one English transcript to another, if the assignment is for a YouTube video that has multiple English transcripts. This is because our new code will pick a different English transcript by default than whatyoutube-transcript-api
picks. This is fine: we discussed it in a project meeting (Thu 20th July 2023, Sean, Marcos, Rob, Alejandro, Nairi, Lee, Dan) and decided to accept the one-off risk of changing the transcripts of some existing assignments.The text was updated successfully, but these errors were encountered: