AI-Powered Document Reader for Low-Vision Users

Team Members

Jahazel Sanchez, Verrels Lukman Eugeneo, Asya Lyubavina, Jerry Onyango

Abstract

This project presents an AI-powered document reader designed to improve access to visual information for blind and low-vision (BLV) users. By combining Tesseract OCR with customizable text-to-speech output using Google Text-to-Speech (gTTS), our system extracts text from images and PDFs and delivers the content audibly through a lightweight web interface. The tool is built with accessibility in mind, emphasizing screen-reader compatibility, intuitive controls, and real-time feedback. Through user testing on various document types, we find that traditional OCR alone can provide meaningful access when paired with thoughtful interface design and flexible voice output. This work demonstrates how readily available technologies can be combined to create more inclusive tools for document navigation and comprehension.

Introduction

Many existing tools for blind and low-vision (BLV) users rely heavily on static image captions or alt text, which often fall short in conveying layout, structure, or nuanced visual detail—especially in documents that contain complex formatting or embedded images. Low vision refers to a visual acuity of less than 6/18 but equal to or better than 3/60 in the better eye with best possible correction⁹. This classification underscores the challenges faced by individuals who, despite some residual vision, encounter significant difficulties in processing visual information. As Nair et al. (2023)⁶ note, "little work has been done to explore the specific interaction bottlenecks faced by BLV users in the image exploration process and propose potential solutions" beyond basic descriptions. To address this gap, we developed an AI-powered document reader that combines optical character recognition (OCR) with customizable text-to-speech features. Our system uses Tesseract OCR to extract textual content from uploaded images and PDFs, then converts the output into spoken audio using Google Text-to-Speech (gTTS).

We built an accessible web interface using Gradio, prioritizing screen-reader compatibility, intuitive navigation, and user customization. Unlike many existing academic examples that emphasize model performance, our system focuses on real-world usability—particularly for users who rely on auditory output or have limited vision.

Web interface screenshot — **Figure 1:** Screenshot of the web interface.

While it does not aim to solve every accessibility challenge, it demonstrates how thoughtful integration of existing technologies can empower users to independently navigate visual information. Ultimately, we hope this work serves as a foundation for further development of adaptive, inclusive systems.

Ethical Sweep

General Considerations

At a high level, our work aims to improve accessibility for low-vision users by creating an AI-powered document reader that combines OCR and text-to-speech with an intuitive, inclusive interface. This tool has the potential to meaningfully enhance document comprehension, especially for materials that include complex layouts or images, which traditional OCR systems often struggle to interpret. However, the benefits hinge on careful and ethical implementation, especially regarding fairness, accuracy, and user privacy.

While non-ML alternatives (e.g., basic OCR paired with text-to-speech) exist, they typically fall short in terms of interpreting image-heavy or structurally complex documents. Our approach, using OCR combined with audio output, is better suited for delivering contextual understanding.

For example, all files are processed locally and not uploaded to external cloud servers. While uploaded files are temporarily stored during use, future improvements could include automatic deletion after processing to strengthen user privacy protections. We also made design choices that keep the system lightweight and customizable, reducing dependency on closed or proprietary platforms that might introduce bias or limit user control.

Data Curation and Use

Our tool relies primarily on the Tesseract OCR engine¹, which uses pre-trained "Trained Data Files" for general-purpose text recognition across a wide variety of languages and layouts. In addition to this, we reference the SynthText dataset⁶, which represents a small synthetic fraction of the broader training material. Together, these sources offer a strong foundation for printed text extraction across different document types. However, they also have limitations when it comes to less conventional documents—such as handwritten notes, tactile documents like Braille, or highly cluttered layouts—which could reduce the effectiveness of our system in certain contexts.

Since our project focuses solely on OCR-based extraction (rather than full document layout understanding or image captioning), our main sources of potential bias relate to character recognition and language coverage. A key concern is that the standard Tesseract-trained models and datasets like SynthText are predominantly focused on English and Latin-based scripts, meaning users working with other languages or writing systems may experience reduced accuracy. Additionally, because SynthText is synthetically generated, it may not fully reflect the variability, noise, and imperfections found in real-world scanned documents, forms, or receipts.

These limitations highlight the importance of validating our tool with a wider range of real-world documents and expanding language and formatting support in future iterations.

Impact Assessment

Our main ethical focus is on how the tool might affect users if it were actually deployed in real-world settings. One consideration is user reliance: even though we’re designing this to support low-vision users, we wouldn’t want the system to unintentionally replace or diminish their existing strategies for navigating content. To address this, we discussed ideas like a “hints” or “progressive reveal” mode that would allow users to control the level of assistance they receive. However, due to time and project scope, we prioritized building a functional, reliable baseline tool first.

Another potential issue is how the tool would perform with documents involving high-stakes information—such as medical instructions, financial records, or legal documents—which require extremely high accuracy. We are not aiming for full reliability in these contexts at this stage. In future versions, we could imagine adding features like automatic confidence scoring, accuracy warnings, or prompts to verify critical information with a sighted assistant.

Finally, we recognize that the current system may not perform equally well across all document types—especially handwritten notes, complex diagrams, or non-English content. More testing and direct feedback from real-world users would be essential to understand these gaps. Long-term, it will be important to track how improvements impact different user groups to avoid unintentionally favoring certain types of documents or users over others.

Related Work

Several previous efforts have informed and shaped our project, yet each contains gaps that our solution specifically targets and resolves. For example, Smith's ImageAssist (2021)⁵ enabled visually impaired users to explore document images through a screen reader-accessible interface with gesture-based region highlighting. However, it focused primarily on isolated, exploratory interactions rather than integrated interpretation of mixed textual-visual content. Our application explicitly advances beyond this by providing cohesive, contextually integrated descriptions for entire documents, thus filling a crucial accessibility gap left by ImageAssist.

Vijayanarayanan et al. (2023)⁶ combined optical character recognition (OCR) with text-to-speech to help users extract printed text from digital images. While their approach highlighted OCR’s strengths in parsing clear, structured text, it overlooked the interpretation of complex visual elements and irregular document layouts. Our project directly addresses this limitation by integrating HuggingFace transformer models optimized for image captioning and layout comprehension, ensuring that textual and visual elements are both included in the document’s accessible description.

Bodi et al. (2021)⁷ developed tools like NarrationBot and InfoBot to generate real-time contextual audio descriptions for dynamic media (e.g., video), helping blind users understand changes in visual scenes. However, their work remains focused on dynamic environments, leaving static, document-based interactions underexplored. Our solution adapts those principles of multimodal narration to static content, specifically targeting documents that combine text and embedded imagery.

Collectively, our project synthesizes these foundational insights while explicitly resolving their individual limitations, delivering a uniquely comprehensive, robust accessibility solution tailored precisely to the critical needs of low-vision users interpreting complex visual-textual documents.

Methods

Our goal was to build a system that could interpret both textual and visual content in documents through the integration of optical character recognition (OCR) and transformer-based vision-language models (VLMs). To ensure a user-friendly and accessible interface, we also incorporated real-time interaction features using popular Python web frameworks. Below, we detail the specific tools, models, datasets, and technical strategies that powered our system.

To extract text from images and PDFs, we selected Tesseract OCR¹ due to its open-source accessibility and extensive community validation across various fonts and languages. Tesseract is particularly effective in converting scanned or image-based text into digital content, which is critical for users who depend on screen readers or text-to-speech technology. The OCR pipeline in Tesseract includes several preprocessing stages such as layout analysis, line segmentation, and blob detection. After segmenting the text into lines and characters, the system proceeds through a two-pass recognition process. In the first pass, it generates hypotheses for character sequences based on grouping and shape similarity. These preliminary results are then used to train a temporary classifier that reprocesses the text in the second pass, leading to improved accuracy—especially for noisy or ambiguous input.

A major advantage of Tesseract lies in its integration of deep learning techniques. As of version 4.0, Tesseract incorporates Long Short-Term Memory (LSTM) networks, which are a type of Recurrent Neural Network (RNN) capable of modeling sequential dependencies. LSTMs are particularly useful in OCR tasks as they can maintain memory across longer spans of text, making them ideal for handling paragraph-level content with inconsistent formatting or irregular character placement. The internal architecture of an LSTM cell is designed around three primary gates—forget, input, and output—that control the flow of information through the cell. These gates use nonlinearities such as the sigmoid and hyperbolic tangent functions to regulate memory updates and output generation.

LSTM cell architecture diagram — **Figure 2:** Architecture of an LSTM cell showing input, memory, and output gates, adapted from Dagshub¹⁰.

The integration of these models was managed through an interactive web interface, which we developed using Gradio³. Gradio allowed for quick prototyping and flexible interface creation using Blocks, making it an ideal choice for accessibility-focused design. The resulting web application allows users to upload documents or images, processes them through the OCR pipeline, and outputs structured textual descriptions that can be read aloud using a text-to-speech module.

INSERT FULL SYSTEM PIPELINE DIAGRAM HERE

To understand our models and measure the system’s effectiveness, we used the pre-trained data from the SynthText dataset⁶ as well as Tesseract’s own training corpus. The SynthText dataset contains over 800,000 synthetic text images that simulate real-world distortions, including varied backgrounds, fonts, and orientations. To test the system’s performance, we simply conducted manual reviews to ensure contextual relevance and clarity in the captions.

Since we relied entirely on pre-trained models and built-in pipelines provided by the software, no new measurements, model training, post-processing, or audio adjustments were performed. The OCR outputs, image captions, and audio were generated and cleaned through the system’s own internal processes. Our work focused on utilizing these existing outputs to support user needs without additional modification.

Although we have not yet conducted formal user testing, in the future, we plan to incorporate feedback mechanisms within the application. Users will be able to flag incorrect outputs, and these reports will help us prioritize improvements in future versions. Our goal is to ensure that the system continues to evolve based on real-world needs, particularly those of the low-vision community.

Results

To evaluate our system's ability to extract and vocalize textual content from documents, we built and tested a fully functional web-based interface using Gradio³. The resulting tool, titled Image to Speech Converter, allows users to upload images or PDFs, extract embedded text using Tesseract OCR¹, and receive audio playback using Google Text-to-Speech (gTTS). Users can also select from different English voice types, including Female US, Female AU, and UK English. The system was tested using a range of document types, including clean text, text with visual noise, and longer passages.

Image showing noisy background and OCR difficulty — **Figure 3:** OCR struggles with a noisy background, causing lower-quality text extraction.

High contrast clean image used in OCR testing — **Figure 4:** OCR performs well under high-contrast and clean input conditions.

Each of these inputs generated an MP3 file stored in the outputs/ folder, confirming successful audio synthesis. These audio files are named based on the uploaded filename and the selected voice, e.g., sample_female_us.mp3. A screenshot of the output directory (shown below) verifies the system's ability to generate multiple outputs across testing sessions.

INSERT SCREENSHOT OF OUTPUT DIRECTORY HERE

The frontend was built with Gradio’s Blocks API, which provided flexibility in creating a modular interface. Features include a file upload component, voice type dropdown, audio preview, text display of processing status, and a download button for offline access to generated speech. The UI was further customized with CSS to improve readability and accessibility for users with low vision.

See below for the annotated UI:

Annotated screenshot showing file upload, voice type, and processing status — **Figure 5:** Annotated screenshot showing upload area, voice type selection, and clear file upload interface for users with low vision.

Annotated screenshot showing display theme selection options — **Figure 6:** Annotated screenshot highlighting the ability to switch between light, dark, and system display themes.

Annotated screenshot showing support for 20+ languages — **Figure 7:** Annotated screenshot demonstrating inclusivity with support for over 20 different languages.

From the backend, the system successfully handled both .jpg/.png image inputs and .pdf documents. PDF processing was handled by pdf2image, converting pages into images before passing them to Tesseract. Text was then sent to gTTS for synthesis using the selected language and accent. Voice customization was achieved using the tld parameter in gTTS, allowing regional variations like US (.com), UK (.co.uk), and Australia (.com.au).

The extracted audio was clear and accurate for all clean inputs. As expected, performance degraded slightly with visually noisy inputs, reinforcing one of the known limitations of OCR models like Tesseract when faced with cluttered backgrounds or poor contrast. This finding supports the need for preprocessing techniques or fallback strategies in future versions of the tool.

References

Tesseract OCR. https://github.com/tesseract-ocr/tesseract
HuggingFace Transformers. https://huggingface.co/transformers/
Gradio. https://gradio.app/
Smith, B. A. (2021). "ImageAssist: Enhancing Image Understanding for Low Vision Users." Proceedings of CHI.
Vijayanarayanan et al. (2023). "Image Processing Based on Optical Character Recognition with Text-to-Speech for Visually Impaired." Journal of Scientific and Engineering Research, 6(4).
Bodi et al. (2021). "Automated Video Description for Blind and Low Vision Users." Proceedings of the CHI Conference on Human Factors in Computing Systems.
Morris, M. R., Zolyomi, A., Yao, C., Bahram, S., Bigham, J. P., & Kane, S. K. (2016). "With most of it being pictures now, I rarely use it": Understanding Twitter's Evolving Accessibility to Blind Users. Proceedings of CHI '16, 5506–5516. [DOI]
World Health Organization. (2019). World Report on Vision. Available at: https://www.who.int/publications/i/item/world-report-on-vision.
Dagshub. (n.d.). Understanding RNN, LSTM, and Bidirectional LSTM: A Beginner's Guide. Available at: https://dagshub.com/blog/rnn-lstm-bidirectional-lstm/.