-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSPECS.nd
255 lines (218 loc) · 5.54 KB
/
SPECS.nd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# Technical Specifications
## Processing Pipeline
```mermaid
graph TD
A[Input Video] --> B[Audio Processing]
B --> C[Transcription]
B --> D[Video Segmentation]
C --> E[Topic Modeling]
E --> F[Segment Analysis]
D --> F
F --> G[Final Output]
subgraph "Audio Processing"
B1[Normalize Audio] --> B2[Convert to Mono]
B2 --> B3[Resample 16kHz]
B3 --> B4[Apply Filters]
end
subgraph "AI Services"
C1[Deepgram/Groq] --> C2[Speech to Text]
F1[Gemini] --> F2[Visual Analysis]
end
```
## Component Architecture
```mermaid
flowchart TB
subgraph Core["Core Processing"]
direction TB
main[Main Pipeline] --> checkpoint[Checkpoint System]
checkpoint --> processor[Process Manager]
end
subgraph Audio["Audio Processing"]
direction TB
ffmpeg[FFmpeg] --> normalize[Audio Normalization]
normalize --> silence[Silence Detection]
silence --> convert[Format Conversion]
end
subgraph AI["AI Services"]
direction TB
transcribe[Transcription APIs] --> topic[Topic Modeling]
topic --> visual[Visual Analysis]
end
subgraph Storage["File Management"]
direction TB
project[Project Structure] --> cache[Cache System]
cache --> results[Results Storage]
end
Core --> Audio
Core --> AI
Core --> Storage
```
## Detailed Specifications
### Audio Processing Parameters
#### FFmpeg Configuration
```bash
ffmpeg -i {input} \
-af "highpass=f=200, \
acompressor=threshold=-12dB:ratio=4:attack=5:release=50" \
-ar 16000 \
-ac 1 \
-c:a aac \
-b:a 128k \
{output}
```
#### Audio Normalization
```bash
ffmpeg-normalize \
-pr \
-tp -9.0 \
-nt rms \
-prf "highpass=f=100" \
-prf "dynaudnorm=p=0.4:s=15" \
-pof "lowpass=f=8000" \
-ar 48000 \
-c:a pcm_s16le \
--keep-loudness-range-target
```
#### Silence Detection
```python
SILENCE_PARAMS = {
"duration": "1.5", # Minimum silence duration in seconds
"threshold": "-25" # Silence threshold in dB
}
```
### AI Model Configurations
#### Deepgram Transcription
```python
DEEPGRAM_CONFIG = {
"model": "nova-2",
"language": "en",
"features": {
"topics": True,
"intents": True,
"smart_format": True,
"punctuate": True,
"paragraphs": True,
"utterances": True,
"diarize": True,
"filler_words": True,
"sentiment": True
}
}
```
#### Groq Transcription
```python
GROQ_CONFIG = {
"model": "whisper-large-v3",
"temperature": 0.2,
"response_format": "verbose_json",
"language": "en"
}
```
#### Topic Modeling (LDA)
```python
LDA_PARAMS = {
"num_topics": 5,
"random_state": 100,
"chunksize": 100,
"passes": 10,
"per_word_topics": True,
"minimum_probability": 0.0
}
```
#### Gemini Visual Analysis
```python
GEMINI_CONFIG = {
"model": "gemini-1.5-pro-latest",
"analysis_prompt": """
Analyze this video segment.
The transcript for this segment is: '{transcript}'.
Describe the main subject matter, key visual elements,
and how they relate to the transcript.
"""
}
```
## Recovery System
```mermaid
stateDiagram-v2
[*] --> PROJECT_CREATED: Init
PROJECT_CREATED --> AUDIO_PROCESSED: Process Audio
AUDIO_PROCESSED --> TRANSCRIPTION_COMPLETE: Transcribe
TRANSCRIPTION_COMPLETE --> TOPIC_MODELING_COMPLETE: Model Topics
TOPIC_MODELING_COMPLETE --> SEGMENTS_IDENTIFIED: Identify Segments
SEGMENTS_IDENTIFIED --> VIDEO_ANALYZED: Analyze Segments
VIDEO_ANALYZED --> PROCESS_COMPLETE: Complete
state "Error Recovery" as error {
Failure --> LoadCheckpoint
LoadCheckpoint --> ResumeProcess
ResumeProcess --> ReturnToLastState
}
```
## File Structure
```mermaid
graph TD
A[Project Root] --> B[src/]
A --> C[tests/]
A --> D[docs/]
B --> E[video_topic_splitter/]
E --> F[__init__.py]
E --> G[audio.py]
E --> H[transcription.py]
E --> I[topic_modeling.py]
E --> J[video_analysis.py]
E --> K[project.py]
E --> L[core.py]
E --> M[cli.py]
C --> N[test_audio.py]
C --> O[test_transcription.py]
C --> P[test_topic_modeling.py]
```
## Output Format
### Segment Analysis JSON Structure
```json
{
"segment_id": 1,
"start_time": 0.0,
"end_time": 120.5,
"transcript": "...",
"topic": {
"id": 2,
"keywords": ["..."],
"confidence": 0.85
},
"visual_analysis": {
"description": "...",
"key_elements": ["..."],
"transcript_correlation": 0.92
}
}
```
## Project Organization
```mermaid
graph LR
A[Input] --> B{Project Manager}
B --> C[Audio Pipeline]
B --> D[Transcription Service]
B --> E[Topic Analysis]
B --> F[Visual Processing]
C --> G{Checkpoint System}
D --> G
E --> G
F --> G
G --> H[Results]
G --> I[Recovery]
```
## Performance Considerations
### Resource Usage
- Audio Processing: ~2x input duration
- Transcription: API dependent
- Topic Modeling: O(n*k*i) where:
- n = document length
- k = number of topics
- i = iteration count
### Recommended Specifications
- CPU: 4+ cores
- RAM: 8GB minimum
- Storage: 3x input video size
- GPU: Optional, improves video processing
## Attribution
All technical specifications above were provided by the user. The implementation of these specifications into working code was performed by Claude (Anthropic) AI. The documentation and diagrams were also generated by Claude based on the implementation details.