Project Milestone 06: Methods Outline

Complete Methods Outline

Our project focuses on building an accessible web application for low-vision users by integrating Optical Character Recognition (OCR) and AI-powered image captioning technologies. We will use Tesseract OCR to extract text from images and PDFs, converting this information into spoken content through text-to-speech features. Additionally, we will leverage pre-trained models like BLIP and Donut from Hugging Face to generate descriptive captions for images and layouts within documents. These captions will enhance the user experience by providing context and descriptions otherwise inaccessible to visually impaired individuals. For development, we will use web frameworks such as Gradio and Streamlit, allowing users to easily upload documents, receive audio outputs, and navigate structured document content. The Tesseract OCR Training Dataset and the large SynthText dataset will be used to fine-tune and validate our models, ensuring accurate extraction and robust performance across diverse formats. Post-processing techniques will clean extracted text and refine captions for coherence and accuracy. We will iteratively test the web application, incorporating feedback from peers to improve accessibility, layout, and functionality. Performance optimization and cloud-based solutions will be explored to mitigate any delays in processing. Our ultimate goal is to create a seamless and immersive content consumption experience for low-vision users.

Software and Implementation Plan

Optical Character Recognition (OCR) for Text Extraction: To extract text from images and PDFs, Tesseract OCR will be used due to its open-source nature and high accuracy in recognizing characters from various languages and fonts. Tesseract allows seamless conversion of scanned documents and image-based text into readable digital content. This feature is particularly crucial for low-vision users who rely on screen readers and text-to-speech applications to access written information.

AI-Powered Image Captioning for Better Understanding: To enhance image comprehension, pre-trained deep learning models from Hugging Face, such as BLIP or Donut, will be employed to generate descriptive captions. These models use advanced machine learning techniques to analyze visual content and provide meaningful descriptions of images.

Web Application Development: The web application will be built using an interactive framework to allow users to upload images and PDFs for processing. Two main options for development include Gradio and Streamlit. Gradio is particularly suited for integrating machine learning models with an intuitive user interface. Streamlit, on the other hand, provides a structured layout that is more suitable for applications requiring detailed document processing and structured navigation of extracted information.

Dataset to Use

For our project, we will be using the Tesseract OCR Training Dataset, a hand-labeled dataset designed to fine-tune Tesseract's OCR capabilities. It includes comprehensive text samples and custom scripts to streamline improvements. We will also consider using SynthText to fine-tune Tesseract. SynthText provides synthetic yet realistic text overlays on various backgrounds, helping fine-tune or test Tesseract's robustness on diverse document layouts.

SynthText.zip (size: 41GB) contains:

Ground-truth annotations are contained in the file gt.mat (Matlab format), including the following cell arrays (size 1x858,750 each):

Tools Used for Analysis

To process and analyze user-inputted data, our project will utilize a combination of OCR (Optical Character Recognition) and AI-powered image captioning tools. These tools will work together to provide a complete document accessibility solution for low-vision users.

Possible Pitfalls:

While our approach integrates well-established tools, there are several potential challenges we anticipate: