This project is optional.
Submitting a complete and good project in time brings one award point !
The hard deadline for this is 20.05.2019
by e-mail to ioana.sora@cs.upt.ro
You must submit the source
code of your implementation and a short text description of how you
implemented and adapted the LCS algorithm to fit the given problem, as
well as other design decisions/optimizations/heuristics/etc that you
applied.
If you have further questions related to this project, you may send them by e-mail to ioana.sora@cs.upt.ro.
Implement a plagiarism detection tool based on the LCS algorithm
The LCS algorithm has been discussed in class (see slides). Also more details on the LCS algo are discussed in [CLRS] - chap 15.4)
The tools takes arguments in the command line, and depending on these arguments it can function in one of the following two modes:
Pair comparison mode: -p file1 file2
In pair comparison mode, the tool takes as arguments the names of two text files and displays the content found to be identical in the two files.
In pair comparison mode, the tool must be able to analyze texts such these from test sets set1 and set2
Tabular mode: -t dirname
In tabular mode, the tool takes as argument the name of a directory (for example D:\\Essays\\) and produces a table containing for each pair of distinct files (file1, file2) the percentage of the contents of file1 which can be found also in file2.
The analysis of set2 must not take longer than a couple of minutes.
In tabular mode, the tool must be also able to analyze without crashing text files of the size of these in test set set3
The tool must produce good detection results, detecting similar text even if:
If the chapters of the compared essays have been reordered (chapter 1 becomes chapter 2, while chapter 2 becomes chapter 1 in the plagiarized file), it is an acceptable limitation of this tool that it will not detect the global plagiarism, but only per longest chapter.
Alice:
I have a pet dog. His name is Bruno. His body is covered with bushy white fur. He has four legs and two beautiful eyes. My dog is the best dog one can ever have.
Bob:
I have a cat. His name is Paw. His body is covered with shiny black fur. He has four legs and two yellow eyes. My cat is the best cat one can ever have.By applying straightforward the LCS algorithm on the strings resulting from Alice's and Bob's esays, following longest common subsequence results:
I have a pet . His name is . His body is covered with shy fur. He has four legs and two el eyes. My is the best one can ever have.The length of the LCS divided to the length of Alice's essay gives the percentage of text in Alice that is also found in Bob. If this percentage is high enough(over a treshold such as 70%) then we have an indication of possible plagiarism.
It rains.In this case, the LCS of Alice and Greg is
It rain.This leads to the wrong conclusion that Greg copied 7/8 = 87% of his essay from Alice. This problem of false positives will appear every time when one of the essays is much shorter that the other.
Since it seams that taking characters as the elements of the sequences is too finegrained, the comparison could be done at the granularity level of lines instead of characters. It is how the diff utility works. Diff has been developed as a versioning tool, to detect changes between versions of the same document (lines added, changed, deleted), and not as a plagiarism detection tool. Using diff for plagiarism detection will lead to a number of false negatives, leaving undetected clever students such as Alice and Bob who modify the formatting (line length) of the text.
It is your task to find a way to apply the LCS algorithm so that it reduces the numbers of both false positives and false negatives in plagiarism detection.
In this case, if we apply the LCS algorithm on the strings of characters directly resulting from the files, we would need a dynamic programming table of about 20 GB of memory and the have a slow running time due to the same number of iterations.
It is your task to implement a tool that scales up to the requirements described in the first section.
Hint: this goes beyond the algorithm presented in class and you will have to research by yourself. You may look up for Hirschberg's algorithm.