The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Information Retrieval an weighting by html tags"
simon_knoll
Member Posts: 40 Contributor II
Hello,
is there a possibillity within rapidminer to weight extracted terms by the html or xml tags where they are entailed?
Example:
"<h1>Stock Quotes</h1>"
is rated higher than
"<h4>Phone number</h4>"
regards
Simon Knoll
is there a possibillity within rapidminer to weight extracted terms by the html or xml tags where they are entailed?
Example:
"<h1>Stock Quotes</h1>"
is rated higher than
"<h4>Phone number</h4>"
regards
Simon Knoll
Tagged:
0
Answers
I included it today into the new TextProcessing Extension of RapidMiner 5. The current Plugin does not support this, so you might wait until we release RapidMiner 5...
Greetings,
Sebastian
could you do me a favor and show me a short example, how i can apply weight for html tags or which operators i need?
regards,
Simon Knoll
the Text Processing Extension contains Operators for extracting XPath querries. It's called Generate Extract. If you have stored the contents of a web page in an ExampleSet, you might use this operator to extract the content of a h4 tag as a new attribute. If you take a look at the current version of the Process Documents from Data operator, it allows you to select attributes from where the text should be taken. In this list, you can also assign a weight to each attribute. Combining these two things should suit your needs.
If this does not proof helpful, we could think of implementing some sort of weight applier, that will assing weights on tokens if it fulfills some condition.
Greetings,
Sebastian
but i've got some problems with the "generate extract" operator. more precise, im not getting any results, furthermore im getting empty results :-)
maybe im using it in the wrong way regards,
simon
the problem with your setup is, that the source attribute does not exists. My problem with that is, that the operator does not complain about this, but instead simply doesn't deliver anything. I changed that behavior...
For getting the text into an attribute, you can uncheck the create_word_vector parameter in the Process Document and instead add Keep_text. Then a new attribute called text will be added containing the text. You can select this for the generate extract operator and then it works as below: Greetings,
Sebastian
regards
simon
as the operator documentation tries to say, if a query results in an enumeration of items like for example "en,de,fr", then this values are separated using the given characters. But anyway you have to enter the exact search expression more than once to specify more than one attribute name. Where should the operator store the second value, if you enter only one attribute?
Greetings,
Sebastian
unfortunatly i dont understand your suggestion. so what i want to achive is following:
having this "html" code i want to extract all the href values (1,2,3,4,5,6,7,8,9,0)
now if i use following xpath expression from the xpath point of view i get with this query all the href's.
to check this you simply can test it at http://www.mizar.dk/XPath/Default.aspx
so my question is now, how i can achive that in rapidminer?
how do you want to store the values of the href after having them retrieved? Should each href be a single example or do you want to have multiple attributes?
This is important, because the ways totally differ.
Greetings,
Sebastian
Greetings,
Simon
sorry for the late answer, but I simply didn't find the time to answer questions here in the forum in the meanwhile. Here's a process that will show you how both ways work: Please keep in mind, that there's the restriction, that each example of an example set must have the same attributes, so creating attributes depending on a the content of a text cannot be done!
Greetings,
Sebastian
thank you, this realy helped me.
do you know where i can find the AttributeWeights and AttributeWeightsApplier operators at the rapidminer gui?
greetings,
simon
there are several weighting operators available in the Modeling / Attribute Weighting group. You can the use scale by weights operator for applying these weights.
Greetings,
Sebastian
thank you, but i did not figured out how i can "create" weights for different attributes and pipe them for instance to the "scale by weights" operator
best regards
simon
take a look at the Data to Weights operator. With this you can convert an example set to a weight vector. You could create an example set having this weights for example with the logging funtionality and finally turn the log into a ExampleSet by using the log to data operator.
Greetings,
Sebastian
thank you for your answer, but i dont get it.
So i have a process like this: in this process i extracted some features from a html document(for simplicity in this process generated by the "Create Document" operator).
these extracted features result in the following example set now my question. how i can add weighting for the different features that i extracted (e.g weight html_title with 2 and html_linktext with 1) wich then maybe could result in such a example set(or how ever a weightng looks like, i added a weight column just to get the point): thanks in advance
simon
if this weight should only depend on the query_key this is no problem. Simply use the [tt]Generate Attributes[/tt] operator and use [tt]if(query_key="html_title",2,1)[/tt] as expression. Of course, you can nest the [tt]if(...,...,...)[/tt] expressions as you would like to.
Kind regards,
Tobias
my question is now, how can i feed a k-means algorithm with this data, if i want to cluster the documents regarding the extracted features. if im just giving the resulting exampleset as input, it clusters every single example for its own. but i want to cluster the documents and not the extractions.
any advice?
best regards
simon
here i have an exampleset with several examples describing 2 different objects.
now if i want to apply a clustering algorithm on this, and i want to cluster these 2 objects (in reality there are obviously more than just 2 objects) and not every single example, how i have to do?
best regards
simon knoll