Project: A plagiarism detection tool based on the LCS algorithm

This project accompanies Lecture of Week 10 of the course Algorithm Design and Analysis.

This project is optional.

Submitting a complete and good project in time brings one award point ! The hard deadline for this is 20.05.2019 by e-mail to ioana.sora@cs.upt.ro You must submit the source code of your implementation and a short text description of how you implemented and adapted the LCS algorithm to fit the given problem, as well as other design decisions/optimizations/heuristics/etc that you applied.

If you have further questions related to this project, you may send them by e-mail to ioana.sora@cs.upt.ro.


Requirements:

Implement a plagiarism detection tool based on the LCS algorithm

The LCS algorithm has been discussed in class (see slides). Also more details on the LCS algo are discussed in [CLRS] - chap 15.4)

The tools takes arguments in the command line, and depending on these arguments it can function in one of the following two modes:

Pair comparison mode: -p file1 file2

In pair comparison mode, the tool takes as arguments the names of two text files and displays the content found to be identical in the two files.

In pair comparison mode, the tool must be able to analyze texts such these from test sets set1 and set2

Tabular mode: -t dirname

In tabular mode, the tool takes as argument the name of a directory (for example D:\\Essays\\) and produces a table containing for each pair of distinct files (file1, file2) the percentage of the contents of file1 which can be found also in file2.

The analysis of set2 must not take longer than a couple of minutes.

In tabular mode, the tool must be also able to analyze without crashing text files of the size of these in test set set3

The tool must produce good detection results, detecting similar text even if:

If the chapters of the compared essays have been reordered (chapter 1 becomes chapter 2, while chapter 2 becomes chapter 1 in the plagiarized file), it is an acceptable limitation of this tool that it will not detect the global plagiarism, but only per longest chapter.

Comments

The principle(or theory) is easy ...

Consider as an example the following two essays presented by Alice and Bob:

Alice:

I have a pet dog. His name is Bruno.
His body is covered with bushy white fur.
He has four legs and two beautiful eyes. 
My dog is the best dog one can ever have. 

Bob:

I have a cat. His 
name is Paw. His body 
is covered with 
shiny black fur. He has 
four legs and two yellow eyes. 
My cat is the best 
cat one can ever have. 
By applying straightforward the LCS algorithm on the strings resulting from Alice's and Bob's esays, following longest common subsequence results:
I have a pet . His name is . His body is covered with shy  fur. He has four legs and two el eyes. 
My is the best  one can ever have. 
The length of the LCS divided to the length of Alice's essay gives the percentage of text in Alice that is also found in Bob. If this percentage is high enough(over a treshold such as 70%) then we have an indication of possible plagiarism.

But the practice is hard

Problem1: the detection quality

Let us suppose that Greg was not really inspired and just handed in the following very short text:
It rains.
In this case, the LCS of Alice and Greg is
It rain.
This leads to the wrong conclusion that Greg copied 7/8 = 87% of his essay from Alice. This problem of false positives will appear every time when one of the essays is much shorter that the other.

Since it seams that taking characters as the elements of the sequences is too finegrained, the comparison could be done at the granularity level of lines instead of characters. It is how the diff utility works. Diff has been developed as a versioning tool, to detect changes between versions of the same document (lines added, changed, deleted), and not as a plagiarism detection tool. Using diff for plagiarism detection will lead to a number of false negatives, leaving undetected clever students such as Alice and Bob who modify the formatting (line length) of the text.

It is your task to find a way to apply the LCS algorithm so that it reduces the numbers of both false positives and false negatives in plagiarism detection.

Problem 2: the scalability of the tool

The essays are expected to have up to 20000 words, which leads to file sizes (number of characters) of up to 150 KB.

In this case, if we apply the LCS algorithm on the strings of characters directly resulting from the files, we would need a dynamic programming table of about 20 GB of memory and the have a slow running time due to the same number of iterations.

It is your task to implement a tool that scales up to the requirements described in the first section.

An Extra-Optional Requirement

You can gain a second award point for implementing a version of a LCS algorithm having linear memory complexity while it is able to find also the contents of the LCS sequence !

Hint: this goes beyond the algorithm presented in class and you will have to research by yourself. You may look up for Hirschberg's algorithm.