The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to read SGML (e.g., Reuters21578) by TextInput?

gfyanggfyang Member Posts: 29 Maven
edited November 2018 in Help
Hi,

I'd like to test a text classification algorithm on Reuters21578. However, I find that the TextInput in RM only allows directories, which could not directly deal with the format of SGML in Reuters21578.

Of course, I could write a new program to parse it by myself. But, is there any easier way by RM?

Thank you.

Sincerely yours,
gfyang
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there are two possibilities if you just want to extract the text from the data: You could just discard any tags, so that the pure text remains, or you could try to build an XPath querry, extracting what you need. The second solution will work with XML, but I don't know if your document contains any non XML elements.

    Greetings,
      Sebastian
  • gfyanggfyang Member Posts: 29 Maven
    Hi,

    Thank you for the help.

    Sincerely yours,
    gfyang
Sign In or Register to comment.