The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Web Mining - Web Page Similarity
Hello,
I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.
I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators. I then tokenize the webpages, use stopwords and transform cases. Finally, I use the "data to similarity" operator.
However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.
Could somebody advise on how to do this? I will be really thankful!
Have a good day!
- Prat
I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.
I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators. I then tokenize the webpages, use stopwords and transform cases. Finally, I use the "data to similarity" operator.
However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.
Could somebody advise on how to do this? I will be really thankful!
Have a good day!
- Prat
0
Answers
Happy Mining!
~Marius