The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Scan index files of books for important terms
Legacy User
Member Posts: 0 Newbie
in Help
Hi there!
I'm not sure if this is the right forum to post this problem, but I hope you guys can help me.
The scenario is: We have a lot of index-files in RTF-format like the glossaries at the end of an academic book.
We want to analyze which words and expressions occur the most and as such are the most important in this field of study.
I know that it is easy with rapidMinder to count all tokens in these files, but often the expressions are a combination of two or even more words which you can only detect if you look at the text layout, like:
user 154-167
behaviour 178-190
goal 32-38
....
You get what I mean? I'm not sure if this problem is solvable with rapidMiner and in particular not HOW. Can you help me with some advice either on rapidMiner or another tool which can help me with that?
Thank you very much!
DaC
I'm not sure if this is the right forum to post this problem, but I hope you guys can help me.
The scenario is: We have a lot of index-files in RTF-format like the glossaries at the end of an academic book.
We want to analyze which words and expressions occur the most and as such are the most important in this field of study.
I know that it is easy with rapidMinder to count all tokens in these files, but often the expressions are a combination of two or even more words which you can only detect if you look at the text layout, like:
user 154-167
behaviour 178-190
goal 32-38
....
You get what I mean? I'm not sure if this problem is solvable with rapidMiner and in particular not HOW. Can you help me with some advice either on rapidMiner or another tool which can help me with that?
Thank you very much!
DaC
0
Answers
Thought this should be possible with RapidMiner... :-\
it is - but does this help you? We have done something very similar to this and it involved a heady load of information extraction from the structured file information which can be really a pain if layout information is high. So if you want me to actually show you an out-of-the-box process doing this: I have somewhere a price tag sticked to my back
Seriously, this might turn out to be a hard task - depending on the set of files you are analyzing and how different they are. You can actually learn those dependencies (we had a masters thesis about that at my former department) but this quickly can become a multi-month project. So if you are interested (we certainly are) please contact Rapid-I directly.
Sorry for not having better news,
Ingo