Plagiarism detection with winnowing

Hashing, n-grams, and bag of words — why is it always you three? — 31 Dec 2022

A while back I was building a project where I wanted to find StackOverflow snippets used in GitHub projects without attributing the source. I got the project to a decent working state and thought others might be interested in a good algorithm for plagiarism detection that works equally well for code and text.

My goals for the project were to find an approach that:

Worked even when content was rearranged, inserted, deleted, or linted.
Could accurately find small snippets in large documents.
Could be indexed by an off-the-shelf inverted index like Lucene or Bleve.
Allowed for fast ingestion, even on my laptop.
Used linearly proportional disk space to the original documents.

Unsurprisingly, details are scant about most commercial plagiarism detection software. But, I remembered my college used a system out of Stanford called MOSS to look for copying in our assignments and it turned out that they published a paper that I could “borrow” from.

The “winnowing” approach they describe creates fingerprints of documents that can be fed into a normal inverted index for retrieval later on.

Winnowing algorithm

Winnowing works by looking at windows of text, for example a window of length 10 over the text Hello, world! produces 4 windows. Then each window is hashed to create a stream of fingerprints.

Text:     Hello, world!
Windows: [Hello, wor]    --hash()-> 10
          [ello, worl]   --hash()-> 5
           [llo, world]  --hash()-> 23
            [lo, world!] --hash()-> 0

In this example, we end up with the stream [10, 5, 23, 0] which has two interesting properties:

The index in the stream is also the index in the original document.
The size of the hash stream is proportional to the size of the input.

We could stop here and throw all of these into an inverted index, but that’s going to lead to a very bloated index and won’t be much better than the trigram index Google uses for its Code Search. We can do better by picking a few items to make up the fingerprint.

We do this through a second windowing function that picks the minimum valued hash in the range. If a hash overlaps many ranges it’s only recorded once. In our example we’ll choose a window of size 2:

Original: [10, 5, 23, 0]
Windows:  [10, 5]        -> 5
              [5, 23]    -> 5
                 [23, 0] -> 0

This results in a document with a fingerprint of [5, 0].

Algorithm considerations

Both window sizes are variable. The first window size over the text can’t change because the produced hashes will vary. However, the second window size can be adjusted dynamically to improve search speed or reduce indexing size at the expense of quality.

The authors of the paper recommended a rolling hash algorithm, but I found FNV to work fine because indexing was the primary overhead. I also found it useful to send the output of the fingerprints back through the hash to produce a more uniform distribution of values to avoid some hot-spotting.

The inputs may also need to be standardized prior to ingestion – or indexed multiple times with different filters applied, similar to what one might do with spelling corrections in traditional information retrieval. Standardization for code might include tokenizing it, removing white space, or replacing variable names.

Tying it together

Once I had the algorithm down, I was able to plug it into Bleve – a Go search engine library and get decent results. Bleve is/was missing the ability to search by converting a document into a term vector so I had to build that myself to better explore similar documents.

I found TF-IDF useful as a mechanism to avoid ranking things like Java’s infamous public static void main(String args) { highly.

There are a few improvements I’d like to try in the future:

Trying Microsoft’s BitFunnel indexing method.
Improving search operators, like adding the ability to subtract out all the terms found in a boilerplate document or corpus.
Improving the way corpora are handled e.g. two git repositories where all the files have been moved or when a Java class was refactored into two files.
Compiling it to work in the browser with an uploaded directory as an alternative to MOSS.