Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation

Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing...

Full description

Bibliographic Details
Main Authors: Nazemi, Azadeh, Murray, Iain, McMeekin, David
Format: Journal Article
Published: Canadian Center of Science and Education 2014
Subjects:
Online Access:http://hdl.handle.net/20.500.11937/8128
_version_ 1848745565550018560
author Nazemi, Azadeh
Murray, Iain
McMeekin, David
author_facet Nazemi, Azadeh
Murray, Iain
McMeekin, David
author_sort Nazemi, Azadeh
building Curtin Institutional Repository
collection Online Access
description Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions.
first_indexed 2025-11-14T06:19:23Z
format Journal Article
id curtin-20.500.11937-8128
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T06:19:23Z
publishDate 2014
publisher Canadian Center of Science and Education
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-81282017-09-13T14:36:23Z Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation Nazemi, Azadeh Murray, Iain McMeekin, David document layout analysis assistive technology optical character recognition Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions. 2014 Journal Article http://hdl.handle.net/20.500.11937/8128 10.5539/cis.v7n1p162 Canadian Center of Science and Education fulltext
spellingShingle document layout analysis
assistive technology
optical character recognition
Nazemi, Azadeh
Murray, Iain
McMeekin, David
Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
title Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
title_full Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
title_fullStr Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
title_full_unstemmed Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
title_short Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
title_sort layout analysis for scanned pdf and transformation to the structured pdf suitable for vocalization and navigation
topic document layout analysis
assistive technology
optical character recognition
url http://hdl.handle.net/20.500.11937/8128