"Clustering and similarity of the text documents"
Hello,
I have been recently dealing with some extraction methods of the keyphrases from the text. Now I would like to solve another problem: Clustering the documents& similarity between them.
It goes like that: Let us suppose that we have some security documents from various sources. I would like to examine these documents and cluster them. Sometimes a document can be published from various sources about the same topic/device/problem. The goal is to find these 'overlapping' documents and put the in one cluster. Published documents have the following features: the structure may be changed, some words may be added, but the key phrases are the same, mainly a number that identifies a report or other key phrases, that appear repeatedly. Any suggestions about the model? I've tried to use several clustering parameters and metrics, but the results are rather not good. The approach based on frequency of common words would fail, because of the specific structure of the documents. Thanks in advance for any suggestions.
Answers
Dear Zacev,
as a first question: Is it possible to make this a supervised problem by having annotated data? That would make life way easier.
~Martin
Dortmund, Germany
Would you like me to provide samples of documents that I am working with or the process? I'm not sure If I understood correctly.
I have uploaded the full process. So far I have taken 6 documents from three different sources. Successfully Clustering put these document into 3 different clusters, so all the documents from one source belong to the same cluster. Now, as I wrote, I would like to sort these documents in clusters, so they would be clustered upon some keywords or ID numbers in the same cluster - if two documents consider the same device name, they should be put together (doesn't matter from which source).