Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-Generated Captions API #191

Open
yashrajbharti opened this issue Dec 1, 2024 · 3 comments
Open

Auto-Generated Captions API #191

yashrajbharti opened this issue Dec 1, 2024 · 3 comments

Comments

@yashrajbharti
Copy link

yashrajbharti commented Dec 1, 2024

Introduction

The challenge we aim to address is the significant accessibility gap in web video content. Research shows that only 0.5% of web videos currently include captions, leaving a large portion of online video inaccessible to individuals who rely on them. To solve this, we propose adding an autogenerate attribute to the <track> element, enabling browsers to automatically generate captions for videos. This will improve accessibility and encourage widespread adoption of captions across the web, particularly for content creators who may lack the resources to manually create captions.

Read the complete Explainer

Feedback (Choose One)

Please provide all feedback below.

I welcome feedback in this thread, but encourage you to file bugs against the Explainer.

@Crissov
Copy link

Crissov commented Dec 2, 2024

  • Why should this be an opt-in feature?
  • Why does the proposed attribute name contain a hyphen, when autocapitalize, autocomplete and autoplay don‘t?
  • Should this be designed to also apply to still images or to tables (replacing summary)?

@yashrajbharti
Copy link
Author

yashrajbharti commented Dec 2, 2024

  • Why should this be an opt-in feature?

Requiring an opt-in for auto-generated captions ensures that developers retain control over the user experience and avoid introducing unintended side effects. For example:

  • Auto-generated captions may not meet quality expectations in all languages or for all types of content.
  • Certain video creators may prefer manual captions to maintain higher accuracy or context-specific relevance.
  • An opt-in approach allows websites to assess the impact of auto-generated captions before widespread adoption, addressing accessibility improvements incrementally.
  • It ensures backward compatibility with existing implementations where developers or platforms rely on manual captioning or have specific requirements for <track> elements without auto-generation.
  • Why does the proposed attribute name contain a hyphen, when autocapitalize, autocomplete and autoplay don’t?

If necessary, alternatives such as autogenerate, autocaption could be considered during discussions. I have changed it to autogenerate.

  • Should this be designed to also apply to still images or to tables (replacing summary)?

The scope of this proposal is limited to video content and captions as it addresses a significant accessibility gap (only 0.5% of web videos include captions). Expanding the functionality to still images or tables would dilute the focus of this API, but such use cases could be explored in separate proposals tailored to their unique requirements.

@yashrajbharti
Copy link
Author

To prototype the look and feel of auto-generated captions and support my explainer, I created a vanilla JavaScript solution that demonstrates captions on the fly. Simply try it by uploading a video and playing it. The demo also includes examples with preloaded videos and real-time caption generation. I also made a Chrome extension to explore how it would work in practice, which can be loaded from the repo Captions on the Fly.

The captions have three styles:

Style Description
Static Captions replace the previous text and display only the latest transcribed line.
Scroll Captions scroll as new lines are added, retaining old transcriptions for context.
Append Captions are appended below the previous ones, keeping transcriptions visible up to two lines and scrollable.

This solution is designed to highlight different UX approaches to handling live captions dynamically. All styles are draggable and can be positioned anywhere on the screen.

I also hope to adapt this concept to the Chrome built-in Gemini Nano API, enabling real-time, text-based answers from video/audio. The processed data could be cached directly in the client's browser for efficiency. I feel this approach is particularly valuable for live streams, where manually adding captions in real-time is impractical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants