OCR for Spectacle

I think it would be a useful new feature to add optical character recognition to Spectacle, as more websites make it impossible to select text on them. Something like Text Extractor utility from the MS PowerToys. Admittedly I am unsure how hard it would be to implement, but Spectacle seems to be the most fitting option for something like this.

2 Likes

I share my own OCR script here. It’s using Baidu OCR API and Spectacle to take a regional screenshot. Other APIs should be similar. Both X11 and Wayland are supported.

#!/usr/bin/env bash
client_id="abcdefg"
client_secret="hijklmn"

TMPFILE=$(mktemp)
TMPFILE2=$(mktemp)
trap 'rm -f "$TMPFILE";rm -f "$TMPFILE2"' EXIT
spectacle -r -o $TMPFILE -b -n
# [ "$XDG_SESSION_TYPE" = "x11" ] && xclip -selection clipboard -t image/png -o 2>/dev/null > $TMPFILE
# [ "$XDG_SESSION_TYPE" = "wayland" ] && wl-paste -n -t image/png 2>/dev/null > $TMPFILE
if [ -s "$TMPFILE" ]; then
    base64 -w0 ${TMPFILE} > ${TMPFILE2}
    auth_host="https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id="${client_id}"&client_secret="${client_secret}
    json_response=$(curl -s --retry-delay 1 --connect-timeout 10 --max-time 10 $auth_host)
    token=$(perl -MCpanel::JSON::XS -ne 'print decode_json($_)->{access_token}' <<< $json_response)
    if [ -n "$token" ]; then
        ocr_host="https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token="${token}
        result=$(curl -s --retry-delay 1 --connect-timeout 10 --max-time 30 $ocr_host --data-urlencode image@${TMPFILE2} -H 'Content-Type:application/x-www-form-urlencoded')
        words_result_num=$(perl -MCpanel::JSON::XS -ne 'print decode_json($_)->{words_result_num}' <<< $result)
        [ "$words_result_num" -gt 0 ] && {
            [ "$XDG_SESSION_TYPE" = "x11" ] && perl -MCpanel::JSON::XS -ne 'my $arrayref=decode_json($_)->{words_result};foreach my $i(@$arrayref){print $i->{words}}' <<< $result | xclip -selection clipboard
            [ "$XDG_SESSION_TYPE" = "wayland" ] && perl -MCpanel::JSON::XS -ne 'my $arrayref=decode_json($_)->{words_result};foreach my $i(@$arrayref){print $i->{words}}' <<< $result | wl-copy
        }
        kdialog --passivepopup "OCR finished" 7 --title "OCR"
    fi
fi
1 Like

Why the need of an external/online service when you can safely use tesseract locally on your private/official documents?

For example ocrmypdf can do a pretty decent job:

Simply install ocrmypdf with the needed tesseract package language,

For example, on Arch/Manjaro you can install it from AUR:

yay -S ocrmypdf tesseract-data-eng

Then run:
ocrmypdf -l eng -f --sidecar input.pdf output.pdf

And finally open your output.pdf on Okular.

NB: For RTL (e.g Arabic, Hebrew,…) content, you need to open that output.pdf via any Chromium based browser, because copy/paste from OCR text layer will reverse generated characters.

2 Likes

Indeed, any kind of integration that got added to Spectacle would be using Tesseract locally, not an online service.

2 Likes

Of course there are existing OCR solutions, but what is suggested was the integration itself. I don’t want to save the whole screen as pdf then use ocrmypdf. I want to screencap only eg. 2 lines of text and have the text on my clipboard in 1/2 clicks. I think this use case is not unique to me, and Spectacle seems to be the application most suited for a feature like this.
If I wanted to use pdfs i could just print the whole webpages as pdf, no need for spectacle for that.
Maybe I explained it poorly in the post, but I linked the powertoys application as an example of seamless integration.

ocrmypdf is just an example that makes it easy for tesseract to import PDF files, because tesseract doesn’t yet support reading PDFs.

tesseract supports png, jpeg/jpg, tiff… as input formats, and can produce OCR content as text, [searchable PDF/A-xa] pdf, hocr…

Local OCR looks like a toy when they meet Chinese characters, so that’s why a reliable online service is needed.

If that is to be implemented, both “a reliable online” and a local service should be available, I think.
Opening up a large privacy gap for all user groups for the gain of one, albeit large, is not a good idea.
Also, there should be a fat warning before screenshot contents leave the local machine.

The thought of a third party looking at my documents is just frightening. Not knowing who they are is even more frightening.

1 Like

Local OCR looks like a toy when they meet Chinese characters

I don’t know Chinese, may be the provided trained model needs some additions for more fonts.

I used tesseract heavily on many long scanned documents written in Arabic and French plus English, and the result is quite good, especially with the ability to add OCR text layer on the final PDF.

Hi medin.

Thanks for the info about Tesseract. Works great.

Plan to make a service menu out of it like I do to everything.

Vektor

We can have multiple OCR backends, just like we have multiple Phonon backends.

Maybe we can have a KAI daemon, with multiple backends for OCR, voice recognition, etc.

agree, that OCR should be an integral part of screen capturing. even my phone offers to extract text - when recognized - instead of taking pictures.

so far it takes two steps to get text out of a screenshot for me with: