Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus

The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This pa...

Full description

Bibliographic Details
Main Authors: Smith, Catherine, Adolphs, Svenja, Harvey, Kevin, Mullany, Louise
Format: Article
Published: Edinburgh University Press 2014
Subjects:
Online Access:https://eprints.nottingham.ac.uk/35782/
_version_ 1848795160484249600
author Smith, Catherine
Adolphs, Svenja
Harvey, Kevin
Mullany, Louise
author_facet Smith, Catherine
Adolphs, Svenja
Harvey, Kevin
Mullany, Louise
author_sort Smith, Catherine
building Nottingham Research Data Repository
collection Online Access
description The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction.
first_indexed 2025-11-14T19:27:40Z
format Article
id nottingham-35782
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:27:40Z
publishDate 2014
publisher Edinburgh University Press
recordtype eprints
repository_type Digital Repository
spelling nottingham-357822020-05-04T16:54:56Z https://eprints.nottingham.ac.uk/35782/ Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus Smith, Catherine Adolphs, Svenja Harvey, Kevin Mullany, Louise The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction. Edinburgh University Press 2014-11-01 Article PeerReviewed Smith, Catherine, Adolphs, Svenja, Harvey, Kevin and Mullany, Louise (2014) Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus. Corpora, 9 (2). pp. 137-154. ISSN 1755-1676 Computer mediated communication Keyword analysis Spelling variation http://dx.doi.org/10.3366/cor.2014.0055 doi:10.3366/cor.2014.0055 doi:10.3366/cor.2014.0055
spellingShingle Computer mediated communication
Keyword analysis
Spelling variation
Smith, Catherine
Adolphs, Svenja
Harvey, Kevin
Mullany, Louise
Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
title Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
title_full Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
title_fullStr Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
title_full_unstemmed Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
title_short Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
title_sort spelling errors and keywords in born-digital data: a case study using the teenage health freak corpus
topic Computer mediated communication
Keyword analysis
Spelling variation
url https://eprints.nottingham.ac.uk/35782/
https://eprints.nottingham.ac.uk/35782/
https://eprints.nottingham.ac.uk/35782/