From 9624d78b403025281c9f7391fc54956790bd409f Mon Sep 17 00:00:00 2001 From: bills Date: Mon, 13 Jan 2025 13:35:06 +0000 Subject: [PATCH 01/10] =?UTF-8?q?Create=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...n-our-internal-meetings-in-a-week.markdown | 244 ++++++++++++++++++ 1 file changed, 244 insertions(+) create mode 100644 source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown new file mode 100644 index 00000000..c6092232 --- /dev/null +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -0,0 +1,244 @@ +--- +title: I built an AI prototype that can participate in our internal meetings (in + a week) +kind: article +author: Atharva Raykar +created_at: 2025-01-13 00:00:00 UTC +layout: post +--- +[PS, this is currently slop, but I'll build on this.] + +Imagine dropping an AI assistant straight into your Google Meet calls. Not just text chat - it can hear you speak, understand context, and talk back naturally. That's what I built in about a week, with no prior experience in audio programming or Node.js. Here's how it went down. + +## The Big Picture + +Most of us are familiar with LLMs (Large Language Models) through interfaces like ChatGPT - you type something, wait a bit, and get text back. But the latest models can do so much more. They can process speech directly, understand the nuances of conversation, and even respond with natural-sounding voice. The challenge is: how do we actually plug this intelligence into our existing tools? + +That's what this project explores. I built a bot that: +- Joins Google Meet calls like a regular participant +- Listens to everything being said +- Takes notes automatically +- Responds verbally when addressed directly +- Can potentially handle meeting-related tasks like setting reminders or assigning action items + +The interesting part isn't just what it does, but how it has to do it. The system needs to juggle multiple streams of audio, handle real-time processing, and maintain context across an entire conversation. Let's break it down. + +## System Overview + +Looking at the diagram above, we have three main components: + +1. **Browser Automation**: Using Puppeteer to control a Chrome instance that joins the Meet call +2. **Audio Pipeline**: Converting between different audio formats and managing virtual devices +3. **Gemini Integration**: Handling the actual AI interactions through WebSocket connections + +Each of these parts has its own challenges. Let's dive into them one by one. + +## The Browser Challenge + +Google Meet wasn't exactly designed with bots in mind. To get our assistant into a call, we need to: +- Launch a browser programmatically +- Navigate through Meet's UI +- Handle permissions for microphone and camera +- Capture the audio stream +- Feed our AI's responses back in + +Here's a snippet of how we handle the join process: + +```javascript +async joinMeeting(meetLink) { + await this.page.goto(meetLink, { waitUntil: "networkidle0" }); + + // Click through the initial dialog + await this.page.waitForSelector("::-p-text(Got it)"); + await this.page.click("::-p-text(Got it)"); + + // Enter the bot's name + const nameInputSelector = 'input[aria-label="Your name"]'; + await this.page.waitForSelector(nameInputSelector); + await this.page.type(nameInputSelector, "Lenso"); + + // Find and click the join button + const joinButtonSelectors = [ + "button[data-join-button]", + 'button[aria-label="Ask to join"]', + 'button[jsname="Qx7uuf"]' + ]; + + // Try multiple selectors because Meet's UI can be inconsistent + let joinButton = null; + for (const selector of joinButtonSelectors) { + joinButton = await this.page.$(selector); + if (joinButton) break; + } + + await joinButton.evaluate((b) => b.click()); +} +``` + +The code looks simple, but getting here involved a lot of trial and error. Meet's UI elements don't always have consistent selectors, and the timing of operations is crucial. + +## The Audio Pipeline + +This is where things get interesting. We need to: +1. Capture the WebM audio stream from Meet +2. Convert it to 16kHz PCM format that Gemini expects +3. Take Gemini's responses and convert them to 24kHz PCM +4. Feed that back into Meet through a virtual audio device + +Here's how we set up the audio processing: + +```javascript +class AudioProcessor { + constructor() { + this.inputStream = new BufferToStreamTransform(); + this.outputStream = new BufferToStreamTransform(); + + // Set up ffmpeg conversion pipeline + const command = ffmpeg() + .input(this.inputStream) + .inputFormat("webm") + // Output format settings for Gemini's requirements + .outputOptions([ + "-acodec pcm_s16le", // 16-bit PCM + "-ar 16000", // 16kHz sample rate + "-ac 1", // Mono + "-f s16le", // Raw PCM format + ]) + .on("error", (err) => { + log("FFmpeg error:", err); + this.outputStream.emit("error", err); + }); + + command.pipe(this.outputStream); + } +} +``` + +The trickiest part was handling the virtual audio devices. We use PulseAudio to create a virtual microphone that can both play our AI's responses and capture them for Meet: + +```javascript +async createVirtualSource(sourceName = "virtual_mic") { + // Create a null sink + const { stdout: sinkStdout } = await execAsync( + `pactl load-module module-null-sink sink_name=${sourceName}` + ); + + // Create a remap source + const { stdout: remapStdout } = await execAsync( + `pactl load-module module-remap-source ` + + `source_name=${sourceName}_input ` + + `master=${sourceName}.monitor` + ); + + // Set as default source + await execAsync(`pactl set-default-source ${sourceName}_input`); +} +``` + +## The AI Integration + +Now for the fun part - making our bot actually intelligent. We use Gemini (Google's multimodal AI model) through a WebSocket connection for real-time communication. The bot needs to: +- Process incoming audio continuously +- Understand when it's being addressed +- Generate appropriate responses +- Manage tool calls for tasks like note-taking + +Here's how we set up the AI's personality: + +```javascript +const systemInstruction = { + parts: [{ + text: `You are a helpful assistant named Lenso who works for nilenso, + a software cooperative. When you hear someone speak: + 1. Listen carefully to their words + 2. Use the note_down tool to record the essence of what they are saying + 3. DO NOT RESPOND unless directly addressed by name + + Remember that you're in a Google Meet call, so multiple people can talk + to you. When you hear a new voice, ask who that person is.` + }] +}; +``` + +The tool system is particularly interesting. Instead of just chatting, the AI can perform actions: + +```javascript +const noteTool = { + name: "note_down", + description: "Notes down what was said.", + parameters: { + type: "object", + properties: { + conversational_snippet: { + type: "string", + description: "JSON STRING representation of what was said" + } + } + } +}; +``` + +## Real-world Performance + +So how does it actually work in practice? Pretty well, with some caveats: + +1. **Audio Quality**: The multiple conversions between audio formats can sometimes impact quality, but it's generally good enough for understanding. + +2. **Latency**: There's a noticeable delay between someone speaking and the AI responding, mainly due to: + - Audio processing time + - Network latency to Gemini + - Text-to-speech generation + +3. **Context Management**: The AI is surprisingly good at: + - Remembering who's who in the conversation + - Understanding when it's being addressed + - Taking relevant notes without interrupting + +4. **Integration Challenges**: The biggest issues aren't with the AI itself, but with the integration points: + - Browser automation can break if Meet's UI changes + - Audio device management needs proper setup + - Error handling across the pipeline is complex + +## Limitations + +## Future Potential + +This prototype barely scratches the surface. Some interesting possibilities: + +1. **Enhanced Tools**: + - Calendar integration for scheduling + - Task management system integration + - Real-time translation + - Meeting summarization + +2. **Better Context**: + - Screen sharing understanding + - Presentation content analysis + - Gesture recognition + - Emotion detection + +3. **Improved Interaction**: + - More natural interruption handling + - Better turn-taking in conversations + - Proactive suggestions + - Multiple personality modes + +## Key Takeaways + +1. **Multimodal is the Future**: Text-only interfaces are limiting. Natural conversation with AI feels fundamentally different. + +2. **Integration is Everything**: The AI capabilities exist, but making them work with existing tools is the real challenge. + +3. **Real-time is Hard**: Managing streams of data, handling interruptions, and maintaining context in real-time adds significant complexity. + +4. **Tools are Powerful**: Giving AI the ability to perform actions rather than just generate responses opens up huge possibilities. + +Building this was a fascinating exploration of how we might integrate AI more naturally into our daily tools. The technology is there - now it's about making it work seamlessly in the real world. + +Want to try it yourself? The code is all JavaScript/Node.js, and you'll need: +- A Gemini API key +- PulseAudio for audio device management +- FFmpeg for audio processing +- A Google account for Meet access + +Just remember: the future of AI isn't just about smarter models - it's about better integration. From 61119212ae370f7cd5a204dace091e8b936c7174 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:29:04 +0000 Subject: [PATCH 02/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...n-our-internal-meetings-in-a-week.markdown | 167 ++++++++---------- 1 file changed, 73 insertions(+), 94 deletions(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index c6092232..1d092b01 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -6,13 +6,19 @@ author: Atharva Raykar created_at: 2025-01-13 00:00:00 UTC layout: post --- -[PS, this is currently slop, but I'll build on this.] +The funny thing about artificial intelligence is that the astonishing amount of intelligence we have today is terribly underutilised. The bottleneck is integration, not intelligence. -Imagine dropping an AI assistant straight into your Google Meet calls. Not just text chat - it can hear you speak, understand context, and talk back naturally. That's what I built in about a week, with no prior experience in audio programming or Node.js. Here's how it went down. +In our weekly all-hands meeting, we usually assign a person to take notes of what's being discussed and spell out the outcomes, owners and action items when the meeting ends. Sometimes this person may pull out important context from previous meetings by looking at older notes. It's valuable grunt work. -## The Big Picture +Why not drop an AI assistant straight into our Google Meet calls? -Most of us are familiar with LLMs (Large Language Models) through interfaces like ChatGPT - you type something, wait a bit, and get text back. But the latest models can do so much more. They can process speech directly, understand the nuances of conversation, and even respond with natural-sounding voice. The challenge is: how do we actually plug this intelligence into our existing tools? +I'm not quite satisfied with how AI integrations in meetings are mostly about summarising things after the fact. The process of ensuring that a meeting goes well as it happens is far more valuable than a summary. It's about ensuring things stay focused, and the right information and context is available to all participants. + +Today's AI can hear you speak, understand context, and talk back naturally. Here's how I built this in about a week. + +_video demo goes here_ + +LLMs (Large Language Models) are mainstream because of interfaces like ChatGPT - you type something, wait a bit, and get text back. But fewer people know that many models natively work with audio. They can process speech directly, understand the nuances of conversation, and even respond with natural-sounding voice. The challenge is: how do we actually plug this intelligence into our existing tools? That's what this project explores. I built a bot that: - Joins Google Meet calls like a regular participant @@ -20,29 +26,30 @@ That's what this project explores. I built a bot that: - Takes notes automatically - Responds verbally when addressed directly - Can potentially handle meeting-related tasks like setting reminders or assigning action items +- So many more possibilities actually. More on this later when I wax philosophical at the end. -The interesting part isn't just what it does, but how it has to do it. The system needs to juggle multiple streams of audio, handle real-time processing, and maintain context across an entire conversation. Let's break it down. +Let me give you a sketch of how I made this. ## System Overview Looking at the diagram above, we have three main components: -1. **Browser Automation**: Using Puppeteer to control a Chrome instance that joins the Meet call +1. **Browser Automation**: Using Puppeteer to control a Google Chrome instance that joins the Meet call 2. **Audio Pipeline**: Converting between different audio formats and managing virtual devices -3. **Gemini Integration**: Handling the actual AI interactions through WebSocket connections +3. **Google Gemini Integration**: Handling the actual AI interactions through WebSocket connections Each of these parts has its own challenges. Let's dive into them one by one. ## The Browser Challenge -Google Meet wasn't exactly designed with bots in mind. To get our assistant into a call, we need to: +Google Meet wasn't exactly designed with bots in mind. The official APIs don't let you do much. To get our assistant into our call, I came up with this: - Launch a browser programmatically - Navigate through Meet's UI - Handle permissions for microphone and camera - Capture the audio stream - Feed our AI's responses back in -Here's a snippet of how we handle the join process: +Here's a sketch of how we handle the join process: ```javascript async joinMeeting(meetLink) { @@ -75,44 +82,15 @@ async joinMeeting(meetLink) { } ``` -The code looks simple, but getting here involved a lot of trial and error. Meet's UI elements don't always have consistent selectors, and the timing of operations is crucial. +The code is simple, but getting here involved some trial and error. Meet's UI elements don't always have consistent selectors, and the timing of operations is crucial. ## The Audio Pipeline This is where things get interesting. We need to: -1. Capture the WebM audio stream from Meet +1. Capture the WebM audio stream from Meet (I used puppeteer-stream for this, which is a package that uses the Chrome extension API to expose browser audio) 2. Convert it to 16kHz PCM format that Gemini expects 3. Take Gemini's responses and convert them to 24kHz PCM -4. Feed that back into Meet through a virtual audio device - -Here's how we set up the audio processing: - -```javascript -class AudioProcessor { - constructor() { - this.inputStream = new BufferToStreamTransform(); - this.outputStream = new BufferToStreamTransform(); - - // Set up ffmpeg conversion pipeline - const command = ffmpeg() - .input(this.inputStream) - .inputFormat("webm") - // Output format settings for Gemini's requirements - .outputOptions([ - "-acodec pcm_s16le", // 16-bit PCM - "-ar 16000", // 16kHz sample rate - "-ac 1", // Mono - "-f s16le", // Raw PCM format - ]) - .on("error", (err) => { - log("FFmpeg error:", err); - this.outputStream.emit("error", err); - }); - - command.pipe(this.outputStream); - } -} -``` +4. Feed that back into Meet through a virtual audio device (set up with PulseAudio) The trickiest part was handling the virtual audio devices. We use PulseAudio to create a virtual microphone that can both play our AI's responses and capture them for Meet: @@ -135,6 +113,8 @@ async createVirtualSource(sourceName = "virtual_mic") { } ``` +The browser automation effectively thinks that it's getting audio from the system microphone, but it's a mock microphone. I'm using `pacat` to feed audio bytes from Gemini's API to "speak" into the microphone. If I had the time, I'd have much cleaner and better ways to do this, but I wanted a proof of concept out in a week. Using `pacat` involved some hacks when I wanted to allow the user to interrupt our bot. + ## The AI Integration Now for the fun part - making our bot actually intelligent. We use Gemini (Google's multimodal AI model) through a WebSocket connection for real-time communication. The bot needs to: @@ -148,18 +128,28 @@ Here's how we set up the AI's personality: ```javascript const systemInstruction = { parts: [{ - text: `You are a helpful assistant named Lenso who works for nilenso, - a software cooperative. When you hear someone speak: - 1. Listen carefully to their words - 2. Use the note_down tool to record the essence of what they are saying - 3. DO NOT RESPOND unless directly addressed by name - - Remember that you're in a Google Meet call, so multiple people can talk - to you. When you hear a new voice, ask who that person is.` + text: `You are a helpful assistant named Lenso (who works for nilenso, a software cooperative). +When you hear someone speak: +1. Listen carefully to their words +2. Use the ${this.noteTool.name} tool to record the essence of what they are saying +3. DO NOT RESPOND. If you have to, just say "ack". + +You may respond only under these circumstances: +- You were addressed by name, and specifically asked a question. +- In these circumstances, DO NOT USE ANY TOOL. + +Remember that you're in a Google Meet call, so multiple people can talk to you. Whenever you hear a new voice, ask who that person is, make note and then only answer the question. +Make sure you remember who you're responding to. + +ALWAYS use the ${this.noteTool.name} tool when nobody is addressing you directly. Only respond to someone when you are addressed by name.` }] }; ``` +I spent a cool ten minutes to make this prompt. Anyone who has built an AI application knows the importance of prompt engineering (nb, link to that research paper about it), so consider the fact that the meeting bot proof of concept is nowhere near the level of intelligence it actually could be having. + +Oh, and I haven't even done any evals. But hey, I made this in a week. If this was something that's far more serious, I'd seriously emphasise the increased importance of engineering maturity when baking intelligence into your product. + The tool system is particularly interesting. Instead of just chatting, the AI can perform actions: ```javascript @@ -178,67 +168,56 @@ const noteTool = { }; ``` -## Real-world Performance +The way the Gemini API works is that it will send us a "function call" with the arguments. I can extract this call, and actually do it in our system (for now I dump notes in a text file) and return the response back to the model if needed and continue generation. -So how does it actually work in practice? Pretty well, with some caveats: +What's great about a live API like this is that it's a two-way street. The model can be listening or talking back while also simultaneously performing actions. I really like that you can interrupt it and steer the conversation. The client and server is constantly pushing events to each other and reacting on them, rather than going through a single-turn request-response cycle. -1. **Audio Quality**: The multiple conversions between audio formats can sometimes impact quality, but it's generally good enough for understanding. +## Limitations -2. **Latency**: There's a noticeable delay between someone speaking and the AI responding, mainly due to: - - Audio processing time - - Network latency to Gemini - - Text-to-speech generation +So are we there yet? Is it possible to have these AI employees join our meetings and just do things? -3. **Context Management**: The AI is surprisingly good at: - - Remembering who's who in the conversation - - Understanding when it's being addressed - - Taking relevant notes without interrupting +Given that I could get this far in a week, I think it's only a matter of time. There's a few notable limitations to address though: -4. **Integration Challenges**: The biggest issues aren't with the AI itself, but with the integration points: - - Browser automation can break if Meet's UI changes - - Audio device management needs proper setup - - Error handling across the pipeline is complex +* The security situation is currently quite bad. The more we give models access to the real world outside, the more we expose it to malicious prompt injection attacks that can hijack the model and make it do bad things. Models are currently very gullible. We can't build serious agents without mitigating this problem. +* Currently, the API only supports 15 minute interactions. After which the model has to reconnect and lose context. We also know that context windows (ie, the effective "memory" of a language model in an interaction) degrades as it gets more crowded over time. This can potentially be mitigated through good quality data retrievals baked into the integration. +* The bot does not currently match voice to person. This context needs to be fed to it somehow. One way I could think of is to observe the blue boxes when someone is speaking to infer who the voice belongs to. +* The models are currently forced to respond to everything it hears. I am currently working around it by cutting off the audio stream whenever it responds with a note-taking tool call. +* I don't like how it pronounces my name. -## Limitations +## Costs? + +I used the Gemini Flash 2.0 Experimental model, which is available for free for developers. We still don't know how much it costs in production. + +But I can speculate. Gemini 1.5 Flash, the previous model in the series is $0.075/million input tokens. Google's docs say that audio data takes up 32 tokens per second. -## Future Potential +Even if we assume that the new model is 10x more expensive, our meeting bot would cost less than a dollar for actively participating in an hour-long meeting. -This prototype barely scratches the surface. Some interesting possibilities: +Intelligence is cheap. -1. **Enhanced Tools**: - - Calendar integration for scheduling - - Task management system integration - - Real-time translation - - Meeting summarization +## Beyond the scrappy fiddle -2. **Better Context**: - - Screen sharing understanding - - Presentation content analysis - - Gesture recognition - - Emotion detection +This prototype barely scratches the surface. Off the top of my head, I can think of all of these things that are possible to implement with the technology we have today. -3. **Improved Interaction**: - - More natural interruption handling - - Better turn-taking in conversations - - Proactive suggestions - - Multiple personality modes +- Integrate it with calendar apps for scheduling. +- Let it set up reminders for you. +- Allow it to browse the web and scrape information for you during the meeting. +- Recall what was said in previous meetings. +- Screen sharing understanding. +- Critique a proposal. +- Delegate to an inference-time reasoning model to solve hard problems. -## Key Takeaways +## Reflections on the state of AI -1. **Multimodal is the Future**: Text-only interfaces are limiting. Natural conversation with AI feels fundamentally different. +Firstly, multimodality is a huge value unlock waiting to happen. Text-only interfaces are limiting. Natural conversation with AI feels quite different. Humans use a lot of show-and-tell to work with each other. -2. **Integration is Everything**: The AI capabilities exist, but making them work with existing tools is the real challenge. +More importantly, integration is everything. A lot of the intelligence we have created often goes to waste, because it exists in a vacuum, unable to interact with the world around it. They lack the necessary sensors and actuators (to borrow terminology I once read in Norvig and Russell's seminal AI textbook). -3. **Real-time is Hard**: Managing streams of data, handling interruptions, and maintaining context in real-time adds significant complexity. +It's not enough that the models we have are smart. They need to be easy and natural to work with in order to provide value to businesses and society. That means we need to go beyond chatting with text emitter. -4. **Tools are Powerful**: Giving AI the ability to perform actions rather than just generate responses opens up huge possibilities. +## Appendix -Building this was a fascinating exploration of how we might integrate AI more naturally into our daily tools. The technology is there - now it's about making it work seamlessly in the real world. +"Talk is cheap. Show me the code!" -Want to try it yourself? The code is all JavaScript/Node.js, and you'll need: -- A Gemini API key -- PulseAudio for audio device management -- FFmpeg for audio processing -- A Google account for Meet access +Okay, here you go: _link_ -Just remember: the future of AI isn't just about smarter models - it's about better integration. +But this is prototype quality code. Please do not let this go anywhere near production! From 179f08877ce1978ed7f4e3b63fd26fe78d71adf8 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:30:41 +0000 Subject: [PATCH 03/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 ++ 1 file changed, 2 insertions(+) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index 1d092b01..b65fa7ea 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -32,6 +32,8 @@ Let me give you a sketch of how I made this. ## System Overview +_TODO: Diagram_ + Looking at the diagram above, we have three main components: 1. **Browser Automation**: Using Puppeteer to control a Google Chrome instance that joins the Meet call From 6ebf390e7caaf5f9c5fce1dfd2d361c431cc95f9 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:33:18 +0000 Subject: [PATCH 04/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index b65fa7ea..02ad3d13 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -18,7 +18,7 @@ Today's AI can hear you speak, understand context, and talk back naturally. Here _video demo goes here_ -LLMs (Large Language Models) are mainstream because of interfaces like ChatGPT - you type something, wait a bit, and get text back. But fewer people know that many models natively work with audio. They can process speech directly, understand the nuances of conversation, and even respond with natural-sounding voice. The challenge is: how do we actually plug this intelligence into our existing tools? +LLMs (Large Language Models) are mainstream because of interfaces like ChatGPT - you type something, wait a bit, and get text back. Far fewer people know that models can also natively work with audio. They can process speech directly, understand the nuances of conversation, and even respond with natural-sounding voice. The challenge is: how do we actually plug this intelligence into our existing tools? That's what this project explores. I built a bot that: - Joins Google Meet calls like a regular participant From e9ed9b637ba1d651b76eacb1f9f0b342b87085b9 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:34:48 +0000 Subject: [PATCH 05/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index 02ad3d13..cc32c89d 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -115,7 +115,7 @@ async createVirtualSource(sourceName = "virtual_mic") { } ``` -The browser automation effectively thinks that it's getting audio from the system microphone, but it's a mock microphone. I'm using `pacat` to feed audio bytes from Gemini's API to "speak" into the microphone. If I had the time, I'd have much cleaner and better ways to do this, but I wanted a proof of concept out in a week. Using `pacat` involved some hacks when I wanted to allow the user to interrupt our bot. +The browser automation effectively thinks that it's getting audio from the system microphone, but it's a mock microphone. I'm using `pacat` to feed audio bytes from Gemini's API to "speak" into the microphone. If I had the time, I'd have much cleaner and better ways to do this, but I wanted a proof of concept out in a week. Using the simplistic `pacat` also called for some ugly hacks to allow users to interrupt our bot. ## The AI Integration From 3ca40b401b55c7a058f3c20b2794e4299a5748a7 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:36:04 +0000 Subject: [PATCH 06/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index cc32c89d..6499c35d 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -150,7 +150,7 @@ ALWAYS use the ${this.noteTool.name} tool when nobody is addressing you directly I spent a cool ten minutes to make this prompt. Anyone who has built an AI application knows the importance of prompt engineering (nb, link to that research paper about it), so consider the fact that the meeting bot proof of concept is nowhere near the level of intelligence it actually could be having. -Oh, and I haven't even done any evals. But hey, I made this in a week. If this was something that's far more serious, I'd seriously emphasise the increased importance of engineering maturity when baking intelligence into your product. +Oh, and I haven't even done any evals. But hey, I made this in a week. If this was something that's far more serious, I'd seriously emphasise the increased importance of engineering maturity when baking intelligence into your product. (this should link to the govind article. govind pls wrap this up) The tool system is particularly interesting. Instead of just chatting, the AI can perform actions: From 5ce1d289515a723da309abec35d4a950edc0435c Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:36:51 +0000 Subject: [PATCH 07/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index 6499c35d..517496cb 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -170,7 +170,7 @@ const noteTool = { }; ``` -The way the Gemini API works is that it will send us a "function call" with the arguments. I can extract this call, and actually do it in our system (for now I dump notes in a text file) and return the response back to the model if needed and continue generation. +The way the Gemini API works is that it will send us a "function call" with the arguments. I can extract this call, and actually perform it in our system (for now I dump notes in a text file) and return the response back to the model if needed and continue generation. What's great about a live API like this is that it's a two-way street. The model can be listening or talking back while also simultaneously performing actions. I really like that you can interrupt it and steer the conversation. The client and server is constantly pushing events to each other and reacting on them, rather than going through a single-turn request-response cycle. From cf7fc4858089e2aa465fc7830889af8b3a3802d6 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:37:46 +0000 Subject: [PATCH 08/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index 517496cb..555fa786 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -178,7 +178,7 @@ What's great about a live API like this is that it's a two-way street. The model So are we there yet? Is it possible to have these AI employees join our meetings and just do things? -Given that I could get this far in a week, I think it's only a matter of time. There's a few notable limitations to address though: +Given how far I could get in a week, I think it's only a matter of time that we'll see more AI employees show up in meetings. There's a few notable limitations to address though: * The security situation is currently quite bad. The more we give models access to the real world outside, the more we expose it to malicious prompt injection attacks that can hijack the model and make it do bad things. Models are currently very gullible. We can't build serious agents without mitigating this problem. * Currently, the API only supports 15 minute interactions. After which the model has to reconnect and lose context. We also know that context windows (ie, the effective "memory" of a language model in an interaction) degrades as it gets more crowded over time. This can potentially be mitigated through good quality data retrievals baked into the integration. From 8fb41d30ef5e30d499702ebf254732b47ca6a453 Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:38:25 +0000 Subject: [PATCH 09/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index 555fa786..c49362b8 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -188,7 +188,7 @@ Given how far I could get in a week, I think it's only a matter of time that we' ## Costs? -I used the Gemini Flash 2.0 Experimental model, which is available for free for developers. We still don't know how much it costs in production. +I used the Gemini Flash 2.0 Experimental model, which is free to try for development purposes. We still don't know how much it costs in production. But I can speculate. Gemini 1.5 Flash, the previous model in the series is $0.075/million input tokens. Google's docs say that audio data takes up 32 tokens per second. From 6075f7fdebd9e76a1b39da74e7d76fca17b933dd Mon Sep 17 00:00:00 2001 From: bills Date: Tue, 14 Jan 2025 12:39:18 +0000 Subject: [PATCH 10/10] =?UTF-8?q?Update=20Post=20=E2=80=9C2025-01-13-i-bui?= =?UTF-8?q?lt-an-ai-prototype-that-can-participate-in-our-internal-meeting?= =?UTF-8?q?s-in-a-week=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-can-participate-in-our-internal-meetings-in-a-week.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown index c49362b8..8aacebc5 100644 --- a/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown +++ b/source/_posts/2025-01-13-i-built-an-ai-prototype-that-can-participate-in-our-internal-meetings-in-a-week.markdown @@ -214,7 +214,7 @@ Firstly, multimodality is a huge value unlock waiting to happen. Text-only inter More importantly, integration is everything. A lot of the intelligence we have created often goes to waste, because it exists in a vacuum, unable to interact with the world around it. They lack the necessary sensors and actuators (to borrow terminology I once read in Norvig and Russell's seminal AI textbook). -It's not enough that the models we have are smart. They need to be easy and natural to work with in order to provide value to businesses and society. That means we need to go beyond chatting with text emitter. +It's not enough that the models we have are smart. They need to be easy and natural to work with in order to provide value to businesses and society. That means we need to go beyond our current paradigm of chatting with text emitters. ## Appendix