TH E COUMTY BRESS AR C HIVE
When the County Press launched an online searchable archive of its past issues, it promised to provide a rich source of historical data. Unfortunately a fair amount of it has been rendered as gibberish. Can it be repaired?
|
|
|
|
|
The OCR process of converting printed text to digital text has made a massive contribution to research, enabling keyword searching of a huge range of books, documents and newspapers. The process has developed to the point where even relatively poor printing quality in old typefaces can be converted, although errors still arise, even in sophisticated operations like Google Books.
The problem with the County Press Archive is the sheer volume of errors. Incorrect characters and erratic spacing has resulted in it being difficult for searches to bring up reliable results. There are examples of people searching for known incidents and phrases and failing to find the article.
The newspaper says the archive is a beta version and under development. However, some experts believe it is going to be a difficult problem to overcome because it is inherent in the source to which the OCR was applied. The normal practice would be to photograph the original pages. The County Press decided to use the existing microfilm records that were processed 30 years ago. These have been laced up many times in the reader and are now the worse for wear. In the past, researchers have also noted that some sections were over exposed and difficult to read.
The County Press have briefly referred to the problem, although without acknowledging its extent. Promoting the archive in their columns, they said "Improvements are also being made to archive pages to make them easier to find." They are attempting to overcome the errors by passing the worst pages through photo editing software to try and improve the text before OCR reprocessing. This should give some improvements but it's a time consuming operation and it may be difficult to address the full extent of the errors.
The County Press is to be congratulated in seeking to provide an extremely important research resource, covering over a century of highly detailed material. Even with its limitations, it will prove of some value. Nevertheless, if it cannot be brought up to standard, it may be seen as something of a missed opportunity.
|