Speech, transcript, & subtitles clarification

FitzFrobozz · November 2, 2024, 5:35pm

(Forgive me if I have overlooked something in the Manual, I’m going off of Speech to text and Subtitles.)

I have two questions about something that is confusing me a little.

TLDR: main goal is really to just get a transcript I just generated into text/JSON/similar form rather than XML (whose purpose in this context I’m unclear on). My initial thought was that I would somehow move the transcript into subtitles first, but I hit a wall while trying to do so.

Issue summary:

I have added an MP3 to a fresh new kdenlive project and (using Whisper) created a transcription via the Speech tools, but am confused about:

not understanding how to export the transcript to a text or (or similar) file. (I’ll settle for exporting subtitles, but again, I don’t have those in the timeline yet.)
and
not understanding how to create timestamped subtitles from the transcript in the timeline

More details:

The project currently only has two tracks, video and audio. I have created subtitles in the Timeline in other projects, but not from the transcript. Am I supposed to use the magic wand button and use speech recognition, instead, even though I’ve created the speech transcript?
While troubleshooting, when I insert selections (from the Speech box) into the Timeline, they are added to the audio track and I don’t understand what that represents (my hope was to have them inserted into the subtitle section instead).
In an attempt to bypass the subtitles thing, saving text from the Speech box creates an XML file rather than something “cleaner”. (Was looking for for either text or JSON or something similar, not sure if I’m supposed to clean it up myself to get into formats that I prefer.
Which begs the question, what do people use the XML for? (Sorry, still relatively new to kdenlive & video editing in general.)

[Edited to add: on preview, I’m now thinking that transcripts and subtitles are not intended to work together in the way I was thinking, earlier.]

Eugen_Mohr · November 4, 2024, 4:46pm

Yes, the documentation is not so clear. There are 2 different things about speech to text.

Transcription: This is done in the window Speech Editor. Clicking on a clip in the Project Bin, then go to the window/tab Speech Editor and click Start recognition. This will transcribe the clip with timestamps.
Once transcription is done you have 2 possibilities:

right click on the text and Select All and Copy/Ctrl+C and you can paste the text in any editor for further use.
select text and you can create either new sequences from this selection or you can insert it into the timeline. This is like editing/cutting according spoken words. This allows you to re-arrange a clip in a different order or you take only the part of the clip with the correct text. This technique can be power full for editing interviews or blogs.

Subtitle: This is done in the timeline on the subtitle track by clicking on speech recognition (the magic wand icon). You can export/import subtitles.

To conclude: transcription and subtitle has nothing to do with each other, this are technically 2 different tasks/engines inside Kdenlive.

FitzFrobozz · November 8, 2024, 9:33pm

Thanks so much for the reply and clarification, Eugen_Mohr! I really appreciate it.

I did manage to more or less master both things after experimentation and such, and have a decent workflow down now.

One thing that still eludes me: Is it possible to export a transcription to anything other than raw XML? I’d like to have transcripts that I can ultimately get into CSV with timestamps, and I’m struggling a bit with the XML. (For now, I’m doing that with Subtitles, but that has disadvantages in that the text is more broken up, compared to transcript text.)

(If not, I’m wondering if I should think about creating an FR for more transcription output options.)

Eugen_Mohr · November 9, 2024, 9:44am

You mean that you can export the transcription with the leading timestamp to let’s say:

HTML
JSON
CSV

In the following format:

TimeStamp (tab) text

So, you have the timestamp and the text in one line with a tab-stop as divider.

Correct?

FitzFrobozz · November 9, 2024, 9:09pm

For the theoretical FR? Yeah, that about sums up what I would personally look for. A simple export with those options would be really nice.

If you were feeling extra ambitious, could also consider mimicking Guide Export, e.g. do something like:

Save As: Text File, JSON, CSV, HTML
Format strings (in a text field): {{timecode}} or {{realtimecode}}, {{speechtext}} etc. (I invented this last one, could call it whatever)

…which would permit adding arbitrary custom elements to the output, as is currently possible when exporting guides. But that might feel like overkill at this stage.