The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Grab Meta-keywords, frequency lists
leptserkhan
Member Posts: 7 Contributor II
Hello. I am new to rapidminer and I am wondering if rapid mine is suitable for my project.
My project needs to analyze the similarity or dissimilarity between the meta keywords contained in web pages.
My basic questions for this type of analysis are:
My project needs to analyze the similarity or dissimilarity between the meta keywords contained in web pages.
My basic questions for this type of analysis are:
- Can rapid miner take a list of URLs and crawl those domains grabbing ONLY the meta-keywords. I am not interested in analyzing the content of those entire websites, only the categorization/analysis of the meta keywords contained in the web sites.
- Can rapdi miner do some standard categorization on the meta keywords providing frequency lists and themes of words?
- Can it then produce a graph of that analsyis?
- Can rapid miner be configured to apply more weight to certain words, i.e., the word employment, if contained in a meta keyword on a web page would "weigh" heavier in results than any other words in this analsyis. If so, how is that feature accomplished?
- What would be the general steps to take to import the data and provide this analysis?
0
Answers
2. "Can rapid miner take a list of URLs and crawl those domains" yes. web mining:crawl web (or getpages)
3. "grabbing ONLY the meta-keywords" yes. text mining:keep document parts (regex based)
4. "Can rapdi miner do some standard categorization on the meta keywords providing frequency lists and themes of words?".
yes, but it depends what you mean by theme. it can count the occurrences, relative frequencies, relative frequencies relative to the other documents, or binary occurrence. you might be able to analyze synonyms using SVD
5. "Can it then produce a graph of that analsyis?"
What kind of graph?
6. "Can rapid miner be configured to apply more weight to certain words, i.e., the word employment, if contained in a meta keyword on a web page would "weigh" heavier in results than any other words in this analsyis. If so, how is that feature accomplished? "
Text Processing:Process Documents operator -> select attributes and weights
7. "What would be the general steps to take to import the data and provide this analysis?"
crawl
remove all but meta
lower case
tokenize
[stem]
process documents
- vectorize
- weight attributes
then Modeling - Similarity - Similarity to Data - Cosine Distance
A question I left out that I think is important in all of this. I know there is a process in text mining (forgot the name) whereby low-value words are given higher significance and high-level words can be given lower significance. Let me illustrate:
We pull back a bunch of meta-keywords from websites, maybe a corpus of 1,000 keywords, for example. The corpus is a bunch of 1,000 keywords from maybe 50 websites. If the user surfed to 40 websites which constitute maybe word types that represent words that have to do with engineering, but the remaining 10 websites only constitute only a small fraction of the total meta-keywords, say 60 words, but they are the ones I need to stand out the most, how does one go about highlighting the words from those 10 websites (60 words) as being more significant than the words in the remaining 940 corpus? And vica versa?
I don't know if text mining has a feature or algorithm for this?
Right now I could use some direction, help, tutorial on using the web crawler features.
Something step by step somewhere that shows me how to do basic web crawling: grab a page or several pages, download and extract and/or categorize text from those pages.
That I think would be the best use of my time so I can learn and not sound foolish with questions.
Thank you.
Any advice greatly appreciated on how to start learning the web crawl features.
although I did not fully understand what you are going to accomplish, here are some directions:
The "process web" operator might suit your needs even better than the crawl operator, because you can extract the information from a website during crawling and only keep this information. This might lower memory consumption of large crawling runs significantly.
There's documentation available for each operator if you click on the help page. I think there are explained the most important parameters for crawling, like the rule definition. If you hold the mouse cursor a few seconds over each parameter, a tool tip will explain what it does.
I think with this information it is understandable after some time of experimenting.
For more detailed informations, please consider taking part in a text mining and web mining course available in our shop. It will give you detailed descriptions as well as demonstrating the hole process of crawling/processing/extracting/learning and applying.
Greetings,
Sebastian
Although I can see that this product is fantastic and head and shoulders above anything else on the market now, the cost of seminars excludes organizations like mine from participating. Even for the most basic understanding a user needs a lot of patience and understanding of regex expressions. The existing documentation gives a general overview of the product with only examples of the more sophisticated uses, which again, requires one to attend not just one, but several seminars/webinars to understand it fully.
Good luck with this product. I see that it is still evolving and holds great promise.
Yes, text mining is complicated. Try using GATE...now that is hard. RapidMiner makes it "easy".
There are plenty of videos on YouTube. Check out VancouverData (me), NeuralMarketTrends1, and DrMarkusHofmann channels.
In the meantime, here is an *example* process that does a simple similarity check:
The input data is an excel sheet that looks like this: you can copy and paste the above into rapidminer's XML window and run it
It can create a pretty graph like this, that shows the similarity:
I know what you mean, some other data mining tools are extremely difficult to learn and use. I do think RapidMiner is a great tool, and far easier than most, but for a newbie data miner, text analyzer, it does require a wee bit more learning than I need now. But I will put my faith in it and keep trying, as you suggest.
I haven't given up. I see that rapidminer will do almost anything I need once I learn the skill set.
It would be very useful if someone could build a video on the use of the web crawler processes -- it seems that all the other videos are excellent in what they present, and in fact do answer many of the questions a new user would have, but the process of web crawler processes has been left out of the available videos. So I will examine each of the other videos to tease out some useful information that I can then apply to web crawling processes.
Thank you.
we have a webinar on Web minig:
http://rapid-i.com/component/page,shop.product_details/flypage,garden_flypage.tpl/product_id,29/category_id,17/option,com_virtuemart/Itemid,180/
Cheers,
Simon
Thanks for sharing the XML code for this keyword similarity process, it helped me look at things a bit different as I'm learning text mining.
Best Regards,
Tom
www.neuralmarkettrends.com
PS: nice blog!