Project Milestone 06: Methods Outline

Complete Methods Outline

Our project focuses on building an accessible web application for low-vision users by integrating Optical Character Recognition (OCR) and AI-powered image captioning technologies. We will use Tesseract OCR to extract text from images and PDFs, converting this information into spoken content through text-to-speech features. Additionally, we will leverage pre-trained models like BLIP and Donut from Hugging Face to generate descriptive captions for images and layouts within documents. These captions will enhance the user experience by providing context and descriptions otherwise inaccessible to visually impaired individuals. For development, we will use web frameworks such as Gradio and Streamlit, allowing users to easily upload documents, receive audio outputs, and navigate structured document content. The Tesseract OCR Training Dataset and the large SynthText dataset will be used to fine-tune and validate our models, ensuring accurate extraction and robust performance across diverse formats. Post-processing techniques will clean extracted text and refine captions for coherence and accuracy. We will iteratively test the web application, incorporating feedback from peers to improve accessibility, layout, and functionality. Performance optimization and cloud-based solutions will be explored to mitigate any delays in processing. Our ultimate goal is to create a seamless and immersive content consumption experience for low-vision users.

Software and Implementation Plan

Optical Character Recognition (OCR) for Text Extraction: To extract text from images and PDFs, Tesseract OCR will be used due to its open-source nature and high accuracy in recognizing characters from various languages and fonts. Tesseract allows seamless conversion of scanned documents and image-based text into readable digital content. This feature is particularly crucial for low-vision users who rely on screen readers and text-to-speech applications to access written information.

AI-Powered Image Captioning for Better Understanding: To enhance image comprehension, pre-trained deep learning models from Hugging Face, such as BLIP or Donut, will be employed to generate descriptive captions. These models use advanced machine learning techniques to analyze visual content and provide meaningful descriptions of images.

Web Application Development: The web application will be built using an interactive framework to allow users to upload images and PDFs for processing. Two main options for development include Gradio and Streamlit. Gradio is particularly suited for integrating machine learning models with an intuitive user interface. Streamlit, on the other hand, provides a structured layout that is more suitable for applications requiring detailed document processing and structured navigation of extracted information.

Dataset to Use

For our project, we will be using the Tesseract OCR Training Dataset, a hand-labeled dataset designed to fine-tune Tesseract's OCR capabilities. It includes comprehensive text samples and custom scripts to streamline improvements. We will also consider using SynthText to fine-tune Tesseract. SynthText provides synthetic yet realistic text overlays on various backgrounds, helping fine-tune or test Tesseract's robustness on diverse document layouts.

SynthText.zip (size: 41GB) contains:

858,750 synthetic scene-image files (.jpg) split into 200 directories
7,266,866 word-instances
28,971,487 characters

Ground-truth annotations are contained in the file gt.mat (Matlab format), including the following cell arrays (size 1x858,750 each):

imnames: names of the image files
wordBB: word-level bounding-boxes (tensors of size 2x4xNWORDS_i), with:
- First dimension: 2 for x and y respectively
- Second dimension: 4 points (clockwise from top-left)
- Third dimension: number of words in the ith image
charBB: character-level bounding-boxes (tensors of size 2x4xNCHARS_i; same format as wordBB)
txt: text strings contained in each image (char array), structured so that:
- Words belonging to the same "instance" (same font, color, distortion) are grouped by line-feed character (ASCII: 10)
- A "word" is any contiguous string of non-whitespace characters
- A "character" is defined as any non-whitespace character

Tools Used for Analysis

To process and analyze user-inputted data, our project will utilize a combination of OCR (Optical Character Recognition) and AI-powered image captioning tools. These tools will work together to provide a complete document accessibility solution for low-vision users.

Tesseract OCR for Text-to-Speech and Layout Analysis
- Tesseract OCR will be responsible for extracting text from scanned documents, images, and PDFs.
- The extracted text will be converted into speech, allowing low-vision users to hear document contents through a screen reader.
- In later project stages, Tesseract OCR will also analyze the output of image captioning models, providing spoken descriptions of document layouts and embedded images.
HuggingFace Image-to-Text Models for Image Captioning
- AI-powered image captioning models, such as BLIP and Donut, will be implemented using the Hugging Face Transformers API.
- These models will generate meaningful, context-aware descriptions of images and diagrams within documents.
- The goal is to enhance document accessibility by ensuring that visual content is not overlooked but instead integrated into a seamless, narrated digital experience for users.
Post-Processing and Integration Considerations
- OCR Output Refinement: Post-processing techniques will be applied to clean up extracted text and correct formatting inconsistencies.
- Caption Verification: Image captions generated by AI models will be reviewed for accuracy and coherence.
- Speech Output Optimization: The text-to-speech functionality will be optimized for clarity and naturalness to ensure usability.
Other Software-based Analysis:
- Accuracy tracking using precision, recall, and F1 scores with libraries like scikit-learn.
- Data visualization through Matplotlib or Plotly for understanding error distributions and model performance.
- Image pre-processing diagnostics using OpenCV to enhance OCR input quality.
Manual and Team-Based Analysis:
- Human evaluation and rating of randomly selected outputs for clarity, accuracy, and usability.
- Error categorization through collaborative documentation (e.g., spreadsheets) to track common mistakes.
- Focus groups and interviews with low-vision users to collect qualitative feedback.
- Peer review sessions for cross-checking outputs and improving model context understanding.

Possible Pitfalls:

While our approach integrates well-established tools, there are several potential challenges we anticipate:

OCR Accuracy and Layout Interpretation
- Challenge: Tesseract OCR may struggle with extracting text from documents with complex layouts (e.g., tables, multi-column text) or poor contrast.
- Mitigation: Fine-tune Tesseract using the SynthText dataset and apply layout segmentation techniques to improve accuracy.
Contextual Errors in Image Captioning
- Challenge: AI-powered image captioning models (BLIP, Donut) may generate descriptions that are vague, biased, or contextually inaccurate.
- Mitigation: Implement post-processing techniques, refine model prompts, and incorporate human feedback loops to improve the quality of generated captions.
Performance and Computational Cost
- Challenge: Running OCR and image captioning models in real-time may require significant processing power, leading to potential delays.
- Mitigation: Optimize the models for efficiency, explore cloud-based processing solutions, and implement batch processing for large files.
Accessibility Challenges in UI Design
- Challenge: Ensuring the interface is fully accessible to low-vision users, including proper contrast, screen reader compatibility, and ease of navigation.
- Mitigation: Conduct iterative user testing with visually impaired individuals and refine UI elements based on feedback.