Héritage Canadiana, a large repository of digitized microfilm reels from Library and Archives Canada, has quietly begun rolling out full-text-searchable transcriptions (now mostly OCR on typed archival docs, but #Transkribus generated transcriptions on manuscript sources are on their way). The search engine is awful - you can't search only transcribed sources - but the collection is already impressive and promises to get better fast. @histodons
Umfangreiche französischsprachige Quellenkorpora des Mittelalters maschinell erschließen?
Im nächsten #DigitalHistoryOFK nimmt Pauline Spychala (DHI Paris) die Texterkennungsplattformen #eScriptorium & #Transkribus unter die Lupe. Ziel ihres Projektes ist die Entwicklung eines Workflows, der beide Tools effektiv kombiniert, um u.a. den Eigenschaften der untersuchten Quellen gerecht zu werden.
OK, I've finally got round to transcribing enough pages of my own handwriting to train up a model with #Transkribus, and the results are surprisingly good! I expected to need more than the minimal 25 pages to get a decent level of accuracy but it's already noticeably better than the generic recognition on my reMarkable tablet or OneNote.
Since #PyLaia is open source, it should be possible now to recreate this training on my own desktop with the same parameters, and apply the model to recognise new pages, and from there figure out a workflow to simplify getting handwritten notes into plain text for reference or publication.
I hear a frequent complaint about applying quantitative methods on texts that have been through #HTR tools, such as #Transkribus, that the expected error rate means that you will miss too many occurrences of the word you are looking for. (1/n)