This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".
In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.
The files in the repository are as follows
ncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the paper
synth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.
synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.
synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.
The data in this repo is used by the code repositories associated with the project