This program should be incorporated into KDE Linux. With it you are able to use speech to enter text into any application by simply pressing the shortcut key which you set up. It has a daemon that uses less than 600 kilobytes of memory in the background. It requires ydotool which is available in Arch Linux. I am using it presently and to my delight it works flawlessly.
Where does voxd come from? I canât find it in Archâs main package repos, AUR, or Flathub.
I guess it is this one:
Looks like a cool app. This fellow should probably stick it on Flathub rather than producing four different package formats in his own Github repo. Less work for everyone, and higher visibility and availability.
I will write him that on Reddit.
Thereâs a non-obvious catch to this. ydotool is amazing, but thereâs a serious limitation to all of these kind of apps, imposed by the present input stack in wayland - you can only run one. That means that if you were to include this in KDE, itâd instantly break a multitude of other stuff, or at best itâd be useless in those cases. Since they are all mutually exclusive, thereâs a bigger problem that really needs to be solved before stuff like this becomes bundled in.
To be very clear, Iâm not anti- this idea, Iâm just pointing out a little-known and not-very-obvious but totally huge hurdle that all these projects face. Iâll leave this with a quote from the developer of inputremapper (an outstanding project which was really hurt by this limitation) who knows more about this topic than me and most of us:
Development effort should be focused on improving the situation in linux in general: allowing multiple mapping tools to do their thing on top of each other, and allowing injecting any character (instead of any keycode)
A quick thing I wanted to mention, GitHub - taj-ny/InputActions: Mouse and touchpad gestures for Hyprland, Plasma 6 Wayland has been coexisting quite well with ydotool for me (as in, I have them simultaneously reading and writing keypresses and it works). As I understand (which is not very much at all) itâs actually sending input via kwin. Possibly food for thought for a dev who might think about approaching KDE integration with input tools like voxd?
Sorry for the double post, I thought that might be useful and I shouldnât have left it out last time.
I am using Ydotool on KDE and it breaks nothing. I am also on Arch Linux, which is similar to KDE Linux.
I spoke with the developer and he said that he would get right on a flatpak version.
Then you didnât test the criteria I mentioned:
Sorry, you are right. But as I donât use input gestures, I donât see the point. Of course, this is my scenario, as I prefer the simplicity of voice input.
Well, letâs say (to pick an extreme example for the purpose of illustration) someone today right now is using a tongue mouse because they have non-working hands, and theyâre using say input-remapper doing the mapping required, and then in future, KDE integrates ydotool, now that person probably has no input device.
Itâs not an obvious pitfall as you and I both noticed, so the reason I mentioned it was to avoid someone falling into it later, and also so that if some developer reads your post and thinks âhey thatâs a good ideaâ (because it is) then they will know that, in order to make it happen, first, they need to develop the âmissing linkâ software which will allow routing of specific inputs to specific applications, on Wayland. (mentioned in the quote from the dev if input-remapper)
Basically: we need âpipewire for input devicesâ before this is even possible without breaking a lot of important stuff, so I agree with you, ydotool integration would be great, but if a dev wants to add it, they need to create that âpipewire for input devicesâ, first.
BTW kdotool (mentioned above) does work in parallel with ydotool, because it doesnât generate input through an input device, it generates kwin scripts and runs them. So have fun with that one! ![]()
If it helps, I can quickly describe one case where I use input gestures. I often use an underpowered 11inch laptop with KDE and Karousel (a scrollable tiling KWin script simulating the likes of Niri), with a horizontal triple finger scroll to switch tabs or applications. I adopted this gesture from ChromeOS (my 11â laptop is itself is a chromebook running Linux). Three finger horizontal scroll to move to the next application (or tab as originally in ChromeOS) to the left or right. Myself, personally, I donât see this use case competiing with any place where I use voice input, because I mostly only use voice input for dictation. The closest thing this gesture competes with is Alt-Tab, but the gesture can be more ergonomic especially on the small device because I am already resting on the touch pad in anticipation of scrolling.
You are both right, of course. I should have been more specific and said I do not see the point for myself. However, I did say that this is my scenario. I am a mere end user, not a developer. I just thought that the inclusion of voice typing in KDE Linux would be a brilliant feature.
There are likely also other options for integration, e.g. similar (or even identical) to how virtual keyboards work.
I would not be surprised if other platforms use such an approach as well.
They donât. ![]()
The problem Iâve been discussing is not ydotool-specific, itâs applicable to all input on Wayland.
The problem Iâm trying to point out is that such a mechanism does not yet exist.
Other platforms donât worry about it because they just let everything see everything, but Waylandâs security model introduces a special new problem for us.
Like letâs say we have voice control through some hypothetical software, and we have input gestures, and virtual keyboard. How do you route the keys pressed by the voice control into the input gesture software so that you can verbally hold shift while you swipe, while preventing the application from seeing that virtual shift âkeypressâ we made?
This is the job of this âpipewire for input devicesâ, as I called it above. What Iâm trying to say, in more technical terms, is that we need software which will permit the user to design a directed asynchronous graph of input devices and processing software and target applications, so that the right inputs get to the right place in the right order and combination.
Without this, our best hope is that the mixture of input apps we use donât actively conflict with one another but thereâs no way to make them actively work together.
The do.
Maybe not perfectly yet but the Plasma keyboard initiative seems to have mad great progress. Many devices with embedded Linux use them as other implementations as their primary means of entering text or numbers.
On Linux (X11 and Wayland) the mechanism is called âInput Methodsâ.
Programs that can provide text can register as those and get âconnectedâ with text input fields when necessary. At which time they get information about the current content and other values, e.g. whether this is a numerical input field, email, etc.
The most common use cases are on-screen keyboards and tools for entering Asian language characters.
See this topic for a link to a blog and discussions on several other ideas.
A frontend to Voxd (and/or other voice to text systems) could register as such and provide text input capability.
I am pretty sure that all platforms have similar API/systems even if the do such thing.
Key features are provision of the existing text (without the input having to use OCR to extract it from captured pixel data) and context (what the text input is for).
I would be surprised if Android and iOS allow apps unrestricted access to each othersâ input and output.
That is a different, more complex, use case than sending keys/text to the currently focused window like keyboard input would be.
Definitely worth exploring but will likely require much more challenging than using existing integration points.
Edit: see also this effort for moving input methods forward
Possibly offtopic, but is it possible to have more than one active at the same time?
The compositor will likely to be an arbiter so that two donât send text for the same input field at the same time.
If you have a on-screen keyboard and a voice-to-text system you probably want some UI for the user to decide whether they are typing or dictating, e.g. a âmicrophone buttonâ on an on-screen keyboard or a shortcut for desktop to activate dictating