Automatically OCR Documents with Hazel and PDFpen

Optical Character Recognition (OCR) is a magical thing. Normally, when you scan something it is processed simply as a flat image. OCR is the process of converting scanned images of typed (or sometimes printed) text into machine-readable text. Once a scan has been OCR’ed (OCR’ed is a verb, right?), it becomes searchable which is great for identifying and sorting documents. However you can also use automation, with the use of tools like Hazel, to perform tasks based on the text inside a document.

Once you start leveraging the power of automation in combination with OCR, you can see why it’s important to have documents in my computer OCR’ed. For documents I scan myself, they are typically OCR’ed by my scanner. My ScanSnap ix500 comes with built-in software that performs the OCR simultaneously with the scanning process. There are also portable scanning apps like PDFpen Scan+ for iOS that will perform OCR. However, for documents that I download or receive from others, I typically use PDFpen to OCR the document for me.

Screen Shot 2014-02-02 at 2.15.00 PM.PNG

I’ve setup a Hazel rule to monitor my downloads folder and use PDFpen to automatically OCR any PDF. Once a PDF document is readable by the computer, there are any number of actions I can take to sort it and I’ve setup another dozen or so Hazel rules to do just this (listen to the MPU episode 79 on Hazel for more information) however before Hazel knows what to do with a document, it must know what that document says. Thus the OCR is a critical first step. This rule looks at any PDF in my download folder that is not been tagged as already OCR’ed, an AppleScript will kick off PDFpen to OCR the document, then quit PDFpen when finished and tag the file as having been OCRed so Hazel doesn’t keep repeating the same action on it. The AppleScript is a modified version of MacSparky’s PDFPen OCR Folder Action Script.

Here’s the script: (note if you use PDFpen instead of PDFpen Pro you'll have to alter the script accordingly)

tell application "PDFpenPro 6"
open theFile as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
tell application "PDFpenPro 6"
quit
end tell
end tell