Voxd and Ydotool Incorporation in KDE Linux

Jay_Augen · October 16, 2025, 8:28am

This program should be incorporated into KDE Linux. With it you are able to use speech to enter text into any application by simply pressing the shortcut key which you set up. It has a daemon that uses less than 600 kilobytes of memory in the background. It requires ydotool which is available in Arch Linux. I am using it presently and to my delight it works flawlessly.

ngraham · October 16, 2025, 8:35pm

Where does voxd come from? I can’t find it in Arch’s main package repos, AUR, or Flathub.

rodrigopedra · October 16, 2025, 8:41pm

I guess it is this one:

ngraham · October 16, 2025, 8:45pm

Looks like a cool app. This fellow should probably stick it on Flathub rather than producing four different package formats in his own Github repo. Less work for everyone, and higher visibility and availability.

Jay_Augen · October 16, 2025, 9:00pm

I will write him that on Reddit.

pallaswept · October 17, 2025, 12:39am

There’s a non-obvious catch to this. ydotool is amazing, but there’s a serious limitation to all of these kind of apps, imposed by the present input stack in wayland - you can only run one. That means that if you were to include this in KDE, it’d instantly break a multitude of other stuff, or at best it’d be useless in those cases. Since they are all mutually exclusive, there’s a bigger problem that really needs to be solved before stuff like this becomes bundled in.

To be very clear, I’m not anti- this idea, I’m just pointing out a little-known and not-very-obvious but totally huge hurdle that all these projects face. I’ll leave this with a quote from the developer of inputremapper (an outstanding project which was really hurt by this limitation) who knows more about this topic than me and most of us:

Development effort should be focused on improving the situation in linux in general: allowing multiple mapping tools to do their thing on top of each other, and allowing injecting any character (instead of any keycode)

pallaswept · October 17, 2025, 9:52pm

A quick thing I wanted to mention, GitHub - taj-ny/InputActions: Mouse and touchpad gestures for Hyprland, Plasma 6 Wayland has been coexisting quite well with ydotool for me (as in, I have them simultaneously reading and writing keypresses and it works). As I understand (which is not very much at all) it’s actually sending input via kwin. Possibly food for thought for a dev who might think about approaching KDE integration with input tools like voxd?

Sorry for the double post, I thought that might be useful and I shouldn’t have left it out last time.

Jay_Augen · October 19, 2025, 8:56am

I am using Ydotool on KDE and it breaks nothing. I am also on Arch Linux, which is similar to KDE Linux.

Jay_Augen · October 19, 2025, 8:57am

I spoke with the developer and he said that he would get right on a flatpak version.

Herzenschein · October 19, 2025, 5:41pm

FYI: GitHub - jinliu/kdotool: xdotool-like for KDE Wayland

pallaswept · October 20, 2025, 1:57am

Then you didn’t test the criteria I mentioned:

Jay_Augen · October 20, 2025, 3:14am

Sorry, you are right. But as I don’t use input gestures, I don’t see the point. Of course, this is my scenario, as I prefer the simplicity of voice input.

pallaswept · October 20, 2025, 3:16pm

Well, let’s say (to pick an extreme example for the purpose of illustration) someone today right now is using a tongue mouse because they have non-working hands, and they’re using say input-remapper doing the mapping required, and then in future, KDE integrates ydotool, now that person probably has no input device.

It’s not an obvious pitfall as you and I both noticed, so the reason I mentioned it was to avoid someone falling into it later, and also so that if some developer reads your post and thinks “hey that’s a good idea” (because it is) then they will know that, in order to make it happen, first, they need to develop the ‘missing link’ software which will allow routing of specific inputs to specific applications, on Wayland. (mentioned in the quote from the dev if input-remapper)

Basically: we need ‘pipewire for input devices’ before this is even possible without breaking a lot of important stuff, so I agree with you, ydotool integration would be great, but if a dev wants to add it, they need to create that ‘pipewire for input devices’, first.

BTW kdotool (mentioned above) does work in parallel with ydotool, because it doesn’t generate input through an input device, it generates kwin scripts and runs them. So have fun with that one!

Malik_Kennedy · October 20, 2025, 3:23pm

If it helps, I can quickly describe one case where I use input gestures. I often use an underpowered 11inch laptop with KDE and Karousel (a scrollable tiling KWin script simulating the likes of Niri), with a horizontal triple finger scroll to switch tabs or applications. I adopted this gesture from ChromeOS (my 11” laptop is itself is a chromebook running Linux). Three finger horizontal scroll to move to the next application (or tab as originally in ChromeOS) to the left or right. Myself, personally, I don’t see this use case competiing with any place where I use voice input, because I mostly only use voice input for dictation. The closest thing this gesture competes with is Alt-Tab, but the gesture can be more ergonomic especially on the small device because I am already resting on the touch pad in anticipation of scrolling.

Jay_Augen · October 20, 2025, 3:38pm

You are both right, of course. I should have been more specific and said I do not see the point for myself. However, I did say that this is my scenario. I am a mere end user, not a developer. I just thought that the inclusion of voice typing in KDE Linux would be a brilliant feature.

krake · October 20, 2025, 4:36pm

There are likely also other options for integration, e.g. similar (or even identical) to how virtual keyboards work.

I would not be surprised if other platforms use such an approach as well.

pallaswept · October 20, 2025, 10:48pm

They don’t.

The problem I’ve been discussing is not ydotool-specific, it’s applicable to all input on Wayland.

The problem I’m trying to point out is that such a mechanism does not yet exist.

Other platforms don’t worry about it because they just let everything see everything, but Wayland’s security model introduces a special new problem for us.

Like let’s say we have voice control through some hypothetical software, and we have input gestures, and virtual keyboard. How do you route the keys pressed by the voice control into the input gesture software so that you can verbally hold shift while you swipe, while preventing the application from seeing that virtual shift ‘keypress’ we made?

This is the job of this “pipewire for input devices”, as I called it above. What I’m trying to say, in more technical terms, is that we need software which will permit the user to design a directed asynchronous graph of input devices and processing software and target applications, so that the right inputs get to the right place in the right order and combination.

Without this, our best hope is that the mixture of input apps we use don’t actively conflict with one another but there’s no way to make them actively work together.

krake · October 21, 2025, 8:15am

The do.

Maybe not perfectly yet but the Plasma keyboard initiative seems to have mad great progress. Many devices with embedded Linux use them as other implementations as their primary means of entering text or numbers.

On Linux (X11 and Wayland) the mechanism is called “Input Methods”.

Programs that can provide text can register as those and get “connected” with text input fields when necessary. At which time they get information about the current content and other values, e.g. whether this is a numerical input field, email, etc.

The most common use cases are on-screen keyboards and tools for entering Asian language characters.

See this topic for a link to a blog and discussions on several other ideas.

A frontend to Voxd (and/or other voice to text systems) could register as such and provide text input capability.

I am pretty sure that all platforms have similar API/systems even if the do such thing.

Key features are provision of the existing text (without the input having to use OCR to extract it from captured pixel data) and context (what the text input is for).

I would be surprised if Android and iOS allow apps unrestricted access to each others’ input and output.

That is a different, more complex, use case than sending keys/text to the currently focused window like keyboard input would be.

Definitely worth exploring but will likely require much more challenging than using existing integration points.

Edit: see also this effort for moving input methods forward

Herzenschein · October 21, 2025, 12:10pm

Possibly offtopic, but is it possible to have more than one active at the same time?

krake · October 21, 2025, 2:00pm

The compositor will likely to be an arbiter so that two don’t send text for the same input field at the same time.

If you have a on-screen keyboard and a voice-to-text system you probably want some UI for the user to decide whether they are typing or dictating, e.g. a “microphone button” on an on-screen keyboard or a shortcut for desktop to activate dictating