Compare and Detect Similar Text Files in a Directory
by salmanahmed from LinuxQuestions.org on (#5H8YX)
Hi
I have around 100 text files in a directory. Some of these files are similar to each other (containing similar text and sentences). But they are not exactly similar:
1. Some are 20-25% similar with one or more files
2. Some are 40-50% similar with one or more files
and so on...
I want to know that which of these files contain similar text to which other files. As it is very difficult and time consuming to check all the files manually for similarity, so my question here is that: Is there any tool in Linux which can find this similarity between files? Or do I have to create a script (or function) to achieve my objective? How can I do that?
PS: I know about 'diff' however until now I have been using it to compare two files. I dont' know whether we can use 'diff' in situation like mine.


I have around 100 text files in a directory. Some of these files are similar to each other (containing similar text and sentences). But they are not exactly similar:
1. Some are 20-25% similar with one or more files
2. Some are 40-50% similar with one or more files
and so on...
I want to know that which of these files contain similar text to which other files. As it is very difficult and time consuming to check all the files manually for similarity, so my question here is that: Is there any tool in Linux which can find this similarity between files? Or do I have to create a script (or function) to achieve my objective? How can I do that?
PS: I know about 'diff' however until now I have been using it to compare two files. I dont' know whether we can use 'diff' in situation like mine.