The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text-classification: Data from XML and multiple keywords"
Hi,
I'm new to Rapidminer and I want to use it for some text-classification. I've two questions:
1) All my data is stored in XML. Is it possible to import the data as XML (so I don't need to transform it into csv)?
2) I'm building a small csv-example for testing, something like:
title;abstract;keyword;keyword;keyword;.....
As you can see I have multiple columns, each with one keyword. Is it possible to mark more than one column as an label? I tried, but when I change the next column, the previous is changing back.
I hope you understand my questions, my english is not the best.
Thanks and regards,
tron42
I'm new to Rapidminer and I want to use it for some text-classification. I've two questions:
1) All my data is stored in XML. Is it possible to import the data as XML (so I don't need to transform it into csv)?
2) I'm building a small csv-example for testing, something like:
title;abstract;keyword;keyword;keyword;.....
As you can see I have multiple columns, each with one keyword. Is it possible to mark more than one column as an label? I tried, but when I change the next column, the previous is changing back.
I hope you understand my questions, my english is not the best.
Thanks and regards,
tron42
Tagged:
0
Answers
I just installed RapidMiner and wanted to start working on it with some test data that is also in XML. I also haven't found a way to import my files to RapidMiner. It it really not possible to do so?
XML is hierarchical by nature, so it is hard to say how this would work.
You could try reading in the file as HTML and using XPATH to get the attribute values, but it is probably easiest to convert to CSV or Excel first.
As for the LABEL question, a LABEL in rapidminer is like a Y variable in a regression. It is the thing you are trying to predict.
Neil
Transforming XML into CSV is not difficult, so ok. Thought that I could save a step.
Maybe in the next version of Rapidminer a build-in-plugin or something
Because of the label, I should paraphrase the question:
Is it possible to do MULTI-LABEL CLASSIFICATION with Rapidminer?
I want to predict more than one keyword for each text.
Thanks a lot and Regards!
There is a multi label looping operator. Here's an example that uses it. Maybe this would help get you started.
regards
Andrew
I will check the example out and report my progress when I'm done with my mid-term exams this week.
Thanks and regards!
Worked perfectly in my case
I just wanted to add this to the discussion.
This is the basic setup to extract information from XML files. The Extract Information operator also allows to perform XPath queries on XML files. The result is then stored in the defined attribute.
It's just a matter of taste, whether you want to do this with RM or build a proper CSV/Excel/... beforehand. Althought with RM you can easily add another feature. Regarding the multi label problem:
You can learn a model for each label. Therefor you need to set the current label attribute to the role "label" and the other labels to the role "other" (just not regular, because then the current label is also learned on the other labels).
Ciao Sebastian
ps. thanks Matthias! I modified the example above.
just a little remark regarding the basic example process. The parameter "extract text only" ("Process Documents" operator) needs to be disabled if following XPath queries shall deliver something. This little details can be missed quickly and perhaps lead to some confusion about not working XPath statements. Shame on you, Sebastian
Regards,
Matthias
thanks for your response!
I'm not sure that I understand that right. Lets say I have three keywords, therefore I have three labels. I would set the first keyword as a label and the other two as "other". I start the trainig and I will get a model. After that I would set the second keyword as a label and the other two as "other" (the first and the third keyword). Than I would restart the training and get a new model (a second)? Or will the first model be updated?
Sorry I'm new to rapidminer and I'm not really familiar with the tool yet.
Regards,
tron42
can you explain your intention again please? What I understood from is, that each keyword is an indicator/label. For example keyword1 indicates the sentiment good/bad review for quality, keyword2 indicates good/bad review for service, keyword3 for....
So then you learn on the attribute "abstract" (which you need to process with the Textprocessing operators, of course, Process Documents, and inside at least tokenization and possibly some Stopword Filter and Filter by legth) one classifiaction model for "quality", one for "service", and so on.
However, you seem to have something different in you mind.
Ciao Sebastian
each keyword is not only an indicator, it describes the text. For example I have a text about China, so the keywords are: china, asia, hongkong, north korea, ... and a lot more keywords which characterises the article. I want to train those relationships between the text and keywords, so that I can predict possible keywords for an unknown text.
Regards,
David
<title lang="en">Everyday Italian</title>
Desired results: (4 separate examples)
<title lang="en">Everyday Italian</title>
<title lang="en">Harry Potter</title>
<title lang="en">XQuery Kick Star</title>
<title lang="en">Learning XML</title>
you will have to use the Cut Document operator together with the XPath querry to get all matches as documents in the inner subprocess of Cut Document.
Greetings,
Sebastian