DPCUC OFFICIAL LOGO September 2009
Featured Software
September 2009

Submitted by
Dan Delong


TopOCR – free & fast recognition to text, pdf or html

Optical Character Reading applications are still problematic. Moreover, the original programming for most big name OCRs was likely written over a decade ago. Some scanners and multi-function printer/copiers come with “crippled” OCR software, such that only paper documents, and not image files, are accepted. In addition, really good OCR software, like Omni Page, costs hundreds of dollars, and even it can’t equal some of the features in TopOCR.

TopOCR is a new, free, Optical Character Reader (OCR) - small enough for installation on portable devices (PDAs and cell phones). It’s small size does not diminish its usefulness on personal computers either. Because TopOCR is so accurate and speedy, it may end up being the first choice for many users. Its particular talent is the ability to extract just the text from digital images, despite the presence of mixed content (pictures, captions and columns). And it matters not if the image files were made with a dedicated digital camera, a scanner, or a cell phone camera. Just imagine snapping photos of printed material with your small, hand-held, wireless device – perhaps an iPhone. And, let’s say that device holds TopOCR software. Then, just like Scecret Agent 007, you can send extracted messages immediately, in quick bursts of encrypted text. Average citizens (non-spies) might like this feature too. The text contained in the photographed images (and with some effort, selected parts of the image itself) can be emailed off instantly, right from the camera or cell phone. Or, run the software on any Windows computer.

TopOCR’s interface presents the user with two docked windows. One pane (left) is for the image to be OCR’d. (And, it comes with lots of image editing features). The other pane (right) is for the OCR’d text results. (And, it comes with lots of word processing features). On first try, the program recognizes the text in and image, whether in columns or as photo captions, keeping our left to right reading convention intact, while excluding any images present in the scan. Left in its default setting, the extracted text automatically appears in the right pane.

Again, the left panel (below) contains the loaded image. Every time an editing change to the image happens (sizing, contrast/brightness, dewarp, binarize,…), a new window appears in this pane showing the result. The user is then able to revert to earlier edits by clicking through these windows. Users might also wish to edit some images in this left pane in order to improve problematic OCR results.

The right pane (below) initially shows only the text contained in the image on the left side. The embedded text editor in the right pane allows for further formatting, like making frames and tables to contain images and text. In this case the portraits of the men on the left were highlighted and cut/pasted into some table cells on the right. Columns for text below each man’s portrait were also made this way, by adding first a row to the table (just like a WYSIWYG web editor) and filling it with cut/paste text.

Left Pane
Right Pane
TopOCR operational windows

Both right and left windows may be saved. However, the right pane allows saving in the most useful formats: - either as pdf, or as text (with or without line breaks), or as rich text, or as html. The PDF (Portable Document File format) is completely searchable, as is the html file. Although TopOCR currently imports only bmp, tiff, jpg or gif formats, images that it saves in the html format are actually downgraded to 8-bit PNG files (256 colours). The likely reason for this quality downgrade is to make cell phone emailing much faster, thereby keeping bandwidth costs to a minimum. However, files saved using the pdf format still retain their embedded high resolution images.

The program’s web site provides lots of tips for taking better quality cell phone shots of documents, and gives some advice about using the supplied image editor to increase final OCR accuracy. My own cell phone did not have the required 3 megapixels, or higher, camera, nor did I test the results in good lighting conditions (as advised). Higher resolution images, taken with a regular digital camera, gave excellent results. And, the processing of these large images was extremely fast.

An added bonus with this program is the linkage of the OCR’d text results to Microsoft’s built in “Sam” for “text-to-speech” translation. This feature lets you listen to the text directly from within the program, or save the speech as a WAV file (directly from within the program). This sound file is playable using any Windows audio player (or may be put on an audio CD to play in the car). Additionally, a button for converting the WAV file to MP3 is offered, but the Lame encoder required did not seem to be installed and linked to the program correctly. However, you may paste, load or type any text into the text editor pane, and listen to “Sam”, the narrator.

Text to speach

The interface also supports dragging and dropping of files.

A word about installation: The installer will make changes to the registry and will place a shortcut on the desktop, along with a listing in All Programs. However, the program itself will also run by clicking on the main executable, found in its Programs folder. If you wish to run TopOCR from a USB key, just copy the TopOCR folder to the key. Afterwards,(if you wish) use Control Panel to Add/Remove it from the original installation on your computer.



System requirements:

Platform: Win 98 or later (512 MB RAM)

Version: 3.1, 2008

Price: free

Language: English, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish.

Download Size: 8.09 MB

Installed Size: 16 MB

Rating:

Download Site: here




Return