Classify webpages into 4 groups using a set of keywords
Hello,
I have used the following operators so far: Read Excel -> Get Pages -> Data to Documents -> Process Documents -> Select attributes
What I want is to classify around 450 webpages into 4 categories acoording to the words they use.
So, for example, if a website uses a lot the following group of words (not necessarily all of them): "a", "b", "c", "d", "e", etc will be classified as "Category ABC"; if it uses more the words "z", "x", "v", "u", etc will be classifyied as "Category ZXV"... etc I want this to include 4 categories. For each category I have a set of 14 to 16 related words.
Now, I would like to associate each word to a category AND I wanted RM to analyse the words of all documents (in these case, websites) and to define which website belongs to each category based on the occurence and frequency of words they use.
Is this possible to do with RP? And (assuming I did everything correctly in the process above) how can I preceed from here?
Many many thanks for your help.
Best,
Katia
Answers
Dear Katia,
Check out this article: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067 it shows how to built such a model in RM.
~Martin
Dortmund, Germany
Dear Martin,
Thank you so much for your reply. I am sure this is all very well explained, however, if you are a qualitative researcher and not familiar with RM it is still very hard to follow the steps.
For example:
1) How do I create a dictionary? In word? In excel? Should I create a table and list all the words per category? (e.g. data, tools, science...etc. as "analytical" category; arts, creative, studio, ... etc as "symbolic" category; etc) What is the format? How do I identify each word as part of a specific category?
2) How do I attribute weights to each word? Based on what? My goal is to find a way to assume that if a group of words occurs often in a website (more than another group of words) this website could be classified as "analytical" or "symbolic". I don't think they have particular weights. What is a negative weight here? Also, different categories have different number of words. For example, "analytical" has 16 words allocated to it, while "symbolic" has 12 words associated to it. How does this influence the weights?
There is no operator that could allow me to do this automatically? Like an automated classification?
Apologies for my very amateur questions
Best,
Katia
Hi Katia,
Well - yes there are operators to automatically classify. Either by annotated examples (Supervised Learning) or by Clustering. If you have a solid base of annotated examples i would highly recommend to do it like this. These operators are then usually doing something like building such a dictionary internally.
I thought already have a list of keywords you assigned to a special class. I've just realized that the linked article is not 100% fitting for your task, because it only distinguishes between two classes.
So the question is - What do you want to do?
1. Have a list of words where you say "if x is in Doc1, it's most likely class A". And use this list to group.
2. Do a automated grouping
3. Do a classification based on annotated documents ( you have ~100 Documents where you can say it's Group A,B,C or D)
If you choose #1, Excel is a solid option. You could also use the built-in data editor.
~Martin
Dortmund, Germany
Hi Martin,
Many thanks again.
The thing is, I don't want the classification to be random. I just want to look for particular words (not all the words present on the websites) which I have considered (based on literature and previous studies) to be associated to specific categories. So in that sense, I have already a "list" created. This works like having latent variables: the presence of a specific set of words, instead of another one, will be "symptomatic" of a specific category.
But ideally, I would like RM to do the classification by itself and just ask for those words. Something that would allow me to label the words into the right categories. Then, by looking into all 450 websites, considering the words I am providing and the information of related categories, I would have an automated classification of each website. And therefore, I would be able to say that websites 1, 2 and 3 are "analytical" types of website, while websites 6, 7 and 8 are "symbolic" types of website, for example. And so forth for the 4 categories and lists of words.
The functionality of TS-IDS is quite useful. But from here, I was expecting something more automated to classify the websites for me.
Regarding the annotated examples - I can look into the websites and tell (through a subjective analysis) if they are analytical or symbolic, but I need a more objective way (through the exact words used) to classify them. Therefore, I don't have annotated examples I could use because I am not able to look at the website and through the words determine the categories.
I hope this makes sense
Thank you so much.
Best,
Katia
Hi Katia,
i would map this to a clustering on a defined set of words. Where and in which format do you have the data?
~Martin
Dortmund, Germany
Hi Martin,
Thank you.
Regarding my data, I have the following two options:
I could use two sources/ datasets:
To facilitate the analysis, I could use the second dataset and run a cluster analysis. But in this case, I still need to have the words clustered to check if they separate among the 4 categories I am assuming they separate (each category should have around 20 words associated to it). Is it possible to cluster by the columns (words) instead of by the rows (firms’ websites)? I would like to see how the words cluster themselves. If “analytical” words will cluster together and “symbolic” words will cluster together. If yes, then I could see which companies fall inside those clusters.
Is this possible?
Another question is: Can I analyse TS-IDS with this dataset? Or do I need textual information for that?
Many thanks.
Best,
Katia
Hi Katia,
you could use both data sets to do the clustering. For the rawdata you would need to download the page, prepare it, TF-IDF it and then you could cluster. Doable, but more complex than the other.
The excel table you have is quicker to use. You can apply directly a clustering on it (preferably k-medoids with cosine similarity). Afterwards you could analyse which words are important for which cluster. Hopefully your "ana" words occur in just one or two.
The other idea would to transpose the table and cluster afterwards. That would mean that you cluster words together which occur together.
~Martin
Dortmund, Germany
Hi Martin,
Thank you so much for all the help so far.
I am going for the second option which is to use the excel table. I did already the clustering (k-medoids with cosine similarity) and it seems to work fine. I am just not sure how to analyse which words are important for each cluster OR to transpose the table and cluster. Is there any tutorial on how to do it?
Many thanks.
Best,
Katia