Download PDFOpen PDF in browserUsing Machine Learning for Text File Format IdentificationEasyChair Preprint 469812 pages•Date: December 3, 2020AbstractFile format identification is a necessary step for effective digital preservation of records. It allows appropriate actions for curation and access of file types. While binary files contain header information (metadata) about the file type which can aid identification, text files have none. Methods applied for binary file format identification are ineffective for text files. Most text formats can be opened as plain text files, however file type information is often needed to understand the files full use and context. When huge volumes of files need to be checked, automated methods are necessary for text file format identification. A project was initiated at The National Archives to identify file types from the contents of text files using computational intelligence methods. A machine learning based methodology was tested and implemented using test data. The prototype developed as a proof-of-concept has achieved reasonably good accuracy in successfully detecting five file formats. Keyphrases: Text file formats, digital preservation, supervised learning
|