tedunderwoodillinois, Open-source AI requires open data. There's a lot out there, but one of the obstacles is that older public-domain books have terrible OCR transcription. To that end, Pleias is releasing a billion words of public-domain text with experimental LLM-based OCR correction. https://huggingface.co/datasets/PleIAs/Post-OCR-Correction
Add comment