The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Define established terms
Limegreenman900
Member Posts: 6 Contributor II
Hi everyone,
does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?
Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....
Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.
Any help appreciated!
does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?
Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....
Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.
Any help appreciated!
0
Answers
So:
Generally Accepted Accounting Practice = Generally_Accepted_Accounting_Practice
International Standards on Auditing = International_Standards_on_Auditing
At the end of your processing you can then run a replace tokens again and swap out the '_' for a ' ' so it will return to the established term again.
Thanks for your hint on that!