OCR Using Microsoft Document Imaging API’s in C#.Net


OCR: Optical Character Recognition

OCR means extracting the content from images i.e. getting the text from an Image file or scanned PDF, Word and etc. Type of documents.

Steps to implement OCR using Microsoft Document Imaging 

  1. To use Microsoft Document Imaging API we need install any of these software’s Microsoft Office 2007 or SharePoint Designer 2007. In these two software’s SharePoint Designer 2007 is a free software.
  2. Create a console application using visual studio
  3. Add reference of a Microsoft Office Document Imaging 12.0 Type Library which is COM object
  4. Write the below code                                                                                                                                                    OCR1
  5. Execute the application, It will prompt for Image file path                                              OCR2
  6. Actual image                                                                                                                         OCR
  7. Type the Image file path and press Enter                                                                            OCR3
  8. You can see the output of the text from the image                                                            OCR4

Note: The data which has been extracted from images might not be accurate because if the character in the image is blur the MODI will be not able to recognize the exact character and it will consider it as any other default character. So it is completely dependent upon the quality of the image.