Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion

This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML representation of a scanned PDF document reduces the document's storage size a...

Full description

Bibliographic Details
Main Authors: Nazemi, Azadeh, Murray, Iain, McMeekin, David
Format: Journal Article
Published: Information Processing Society of Japan 2014
Subjects:
Online Access:http://hdl.handle.net/20.500.11937/36574
_version_ 1848754807927472128
author Nazemi, Azadeh
Murray, Iain
McMeekin, David
author_facet Nazemi, Azadeh
Murray, Iain
McMeekin, David
author_sort Nazemi, Azadeh
building Curtin Institutional Repository
collection Online Access
description This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML representation of a scanned PDF document reduces the document's storage size and encodes the mathematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document, all textual and non-textual components must be recognised, identified and tagged. These components may be test or mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expressions are one of the most significant components within scanned scientific and engineering PDF documents and need to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non-alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech engine (TTS) to make the document accessible for vision-impaired users.
first_indexed 2025-11-14T08:46:17Z
format Journal Article
id curtin-20.500.11937-36574
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T08:46:17Z
publishDate 2014
publisher Information Processing Society of Japan
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-365742017-09-13T15:29:35Z Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion Nazemi, Azadeh Murray, Iain McMeekin, David graphics recognition Mathematical Information Retrieval (MIR) math recognition Support Vector Machine (SVM) This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML representation of a scanned PDF document reduces the document's storage size and encodes the mathematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document, all textual and non-textual components must be recognised, identified and tagged. These components may be test or mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expressions are one of the most significant components within scanned scientific and engineering PDF documents and need to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non-alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech engine (TTS) to make the document accessible for vision-impaired users. 2014 Journal Article http://hdl.handle.net/20.500.11937/36574 10.2197/ipsjtcva.6.132 Information Processing Society of Japan fulltext
spellingShingle graphics recognition
Mathematical Information Retrieval (MIR)
math recognition
Support Vector Machine (SVM)
Nazemi, Azadeh
Murray, Iain
McMeekin, David
Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
title Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
title_full Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
title_fullStr Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
title_full_unstemmed Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
title_short Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
title_sort mathematical information retrieval (mir) from scanned pdf and mathml conversion
topic graphics recognition
Mathematical Information Retrieval (MIR)
math recognition
Support Vector Machine (SVM)
url http://hdl.handle.net/20.500.11937/36574