2004/09/07
A Repetition Based Measure for Verification of Text Collections and for Text Categorization
Dmitry V. Khmelev and William J Teahan
- In SIGIR '03
- highlighted, printed and filed
- related to plagiarism detection, webpage similarity, corpus verification, PARCELS.
Simple repetition of text substrings for plagiarism and duplicate detection. The formula involves computing a concatenated suffix array for an entire set of documents. The idea is to use not only the single longest common substring but a sum of the longest common substrings across all prefixes of a target document.
The R measure is apparently good not just for duplicate detection but also for authorship detection in the test corpora demonstrated in their paper.
To think about: how to adapt this measure to have an effective (and speedy) tool for web page fragment classification and classification.
