Mathematical information retrieval (MIR) from
scanned PDF documents and MathML conversion
Nazemi, AzadehORCID: 0000-0002-1138-309X, Murray, Iain and McMeekin, David A.ORCID: 0000-0001-6445-1183
(2014)
Mathematical information retrieval (MIR) from
scanned PDF documents and MathML conversion.
IPSJ Transactions on Computer Vision and Applications, 6
.
pp. 132-142.
ISSN 1882-6695
This paper describes part of an ongoing comprehensive research project that is aimed at generating a
MathML format from images of mathematical expressions that have been extracted from scanned PDF documents.
A MathML representation of a scanned PDF document reduces the document’s storage size and encodes the math-
ematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible
through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document,
all textual and non-textual components must be recognised, identified and tagged. These components may be text or
mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expres-
sions are one of the most significant components within scanned scientific and engineering PDF documents and need
to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple
different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non-
alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the
MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech
engine (TTS) to make the document accessible for vision-impaired users.
Keywords: math recognition, graphics recognition, Mathematical Informati
Item Type:
Article (Published)
Refereed:
Yes
Uncontrolled Keywords:
mathematical recognition; graphics recognition; Mathematical Information Retrieval (MIR); Support Vector; Machine (SVM)