kwin-mcp: MCP server for GUI automation using KWin's EIS interface and virtual sessions

Hi everyone,

I’ve built an open-source MCP server called kwin-mcp that enables AI agents to perform full GUI automation on KDE Plasma 6 desktops. It runs inside completely isolated KWin Wayland sessions — no windows appear on your host display, no input leaks out.

How it works

Each session creates three layers of isolation:

  1. A private D-Bus bus via dbus-run-session
  2. A virtual KWin Wayland compositor via kwin_wayland --virtual
  3. Input injection scoped to that compositor via KWin’s EIS D-Bus interface

The server then exposes 29 tools through MCP (Model Context Protocol): mouse input, keyboard, multi-touch gestures, screenshots via ScreenShot2 D-Bus, accessibility tree queries via AT-SPI2, clipboard, window management, and generic D-Bus calls.

Why KWin?

KWin is the only Wayland compositor I’m aware of that exposes an EIS (Emulated Input Server) interface via D-Bus. This is critical because it provides a clean path for input injection without triggering XDG RemoteDesktop portal authorization dialogs. Since we own the isolated session, we can bypass the portal entirely.

What you can do with it

  • AI-driven GUI testing: Let Claude Code or other MCP clients interact with KDE applications
  • Headless automation: Run GUI workflows in CI/CD without a display
  • Accessibility inspection: Query the AT-SPI2 widget tree programmatically

Real-world use case: E2E testing a KDE Plasma dock

I’m actually using kwin-mcp myself to develop krema, a dock for KDE Plasma 6. I develop it with Claude Code, and kwin-mcp handles automated E2E testing — launching the dock in an isolated KWin session, clicking icons, verifying window previews, testing drag-and-drop, all without touching my actual desktop.

Here’s krema running on my desktop:

This is the kind of workflow kwin-mcp enables: write a KDE app, and let an AI agent test its GUI automatically in an isolated environment.

Screenshots & performance

Screenshot capture runs at ~30-70ms per frame through KWin’s ScreenShot2 D-Bus interface. Any action tool accepts a screenshot_after_ms parameter for burst frame capture — you can capture multiple frames after a click to observe UI transitions without extra round-trips.

Installation

pip install kwin-mcp
# or
uv tool install kwin-mcp

Requirements: KDE Plasma 6+, Python 3.12+, at-spi2-core

Limitations

  • KDE Plasma 6+ only — relies on KWin-specific D-Bus interfaces
  • US QWERTY layout for direct keyboard_type; Unicode input works via wtype
  • AT-SPI2 coverage varies by application

I’d love feedback from anyone who works on KWin internals or has experience with the EIS protocol. Are there plans to make EIS more broadly available across KDE applications?

2 Likes

This is really neat to see, I’ve been working on my own mcp server to do this same thing letting claude-desktop and -code “see and use” my desktop. My initial trial use via xdg-portal was pretty poor to move the mouse around and cope with the desktop having to mode-switch between screen capture and movement, so I’m keen to look into your approach more. I’d sort of moved onto other plugin features of which this was one of many, but soon as I free up I want to try incorporating your approach as well.

Another big problem I had was I normally work in a 11520x2160 display of 3 monitors, which it had a problem with trying to localize coordinates with at that scale when snap-shooting each screen to “see” results from movement attempts. I’d not thought to look into dbus interaction directly, rather thinking to use xdg-portal mostly for the permissions aspect to “request” access of a user’s desktop the same way as say zoom might to allay some concern over your ai watching you creepily as microsoft made recall sound.

Thanks for sharing!

1 Like

what beautiful timing! Just today I asked claude to research the state of LLM GUI app control in Linux/KDE because I want to build a hands-free voice based UI for KDE in the near future.

Definitely going to be keeping an eye on this!

1 Like

Cool to hear someone else is working on the same problem! My original motivation was pretty narrow — I needed sandboxed E2E testing for a KDE Plasma dock I’m developing (krema), so kwin-mcp was designed around isolated virtual sessions from the start. But reading your comment, I’m realizing this could be adapted for real desktop control too with some work, and that’s something I hadn’t seriously considered.

On the multi-monitor front, I’m on a single monitor and the sandbox approach meant I never had to think about coordinate spaces across displays, but it’s a good problem to have on the radar if I go down the real-desktop path.

One thing I’ve found from actual usage is that LLM vision has real limits for fine details — like when I’m trying to verify a 2-5px shadow offset on my dock, Claude Code just can’t see the difference from a screenshot. I ended up having to zoom in and create image diffs between shadow-on/off states to make it work, and I’m considering adding dedicated visual comparison tools for this kind of thing. Curious how you’re handling the precision problem on your end with such a large display area.

Nice timing indeed! Voice-based UI is an interesting angle — I actually tried something similar with voxtype but hit the CJK wall as a Korean speaker. Turns out Korean input via virtual keyboard requires understanding the xkb structure and IME internal state, which made it way harder than expected.

Under the hood kwin-mcp relies on AT-SPI2 which gives you structured widget data (role, name, actions, bounding box), so that part of the stack might be relevant for mapping voice commands to UI elements. Would be curious to see what you come up with.

My original impetus was much as your own, for software testing the GUI component of the MCP management itself as key components to managing the workflow of the MCP server, navigating menus, workspaces, dialogs, etc, but as you said it wasn’t very good at accurately grabbing or moving thing, especially when such a large desktop area. I even added a component to pipe the screen shot to summarize using my local ollama instance for more local processing with a visual llm model, but that had it’s own issues working with ollama and random cutting-edge video-based models.

First thing I thought as you mentioned tying into dbus, as that should also give geometry from the DE for window placement and such, and maybe can use that to enhance it’s ability to determine placement and accuracy of the mouse to click, grab, move, etc. Of course maybe in this there’s the whole weirdness with wayland and absolute positioning mess no one can figure out or agree on to fix…

The reason I thought to use xdg-portal was mostly for permissions, “ProgramXYZ wants to control your desktop” prompts for explicit human acceptance (the robot wants to take over your deskop, you sure?), but also because this would allow for in theory some multi-platform approach to use the same conference control over desktop to see and control, which would avoid having to debug windoze and mac specifically as outliers. While I am linux-focused, I don’t want to snub my nose at the poor bastards using windoze or mac either.

I was thinking to look into xdg more to see if any better way than analog human methods to move the mouse until it’s inside the box via analog move and view at the same time, but I don’t think so. Ultimately I backshelved this to revisit later, but that said it’s slowed down my testing of the UI elements for my MCP application significantly doing it the old fashioned without a dedicated QA team. Your post gave me some new ideas, so as soon as I’m done with other core bits I plan to revisit this with some renewed vigor.