The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text Processing with Document type (how to use modified output?)
Hello everybody,
I want to do some text processing with the Document type. In a simple example I use "Read Document" to access a formerly crawled and stored web page (html file). The content shall be filtered and inspected with some regular expressions. For the beginning I just added the "Keep document parts" operator to discard everything but the <body>...</body> part. The Document output shows the desired modified content in the upper window. This is the part I need for further text processing but some operators seem to always work on the original document. For example a following "Extract information" with a regex "<head>" finds this content. Looking for other content which becomes available through filtering and transformation (left out in my simple example explained above) can never be found. "Write Document" also generates the original text ignoring all changes to Document made in my operator chain.
This results in my simple but important question: how to work with the modified document?
Thanks in advance!
I want to do some text processing with the Document type. In a simple example I use "Read Document" to access a formerly crawled and stored web page (html file). The content shall be filtered and inspected with some regular expressions. For the beginning I just added the "Keep document parts" operator to discard everything but the <body>...</body> part. The Document output shows the desired modified content in the upper window. This is the part I need for further text processing but some operators seem to always work on the original document. For example a following "Extract information" with a regex "<head>" finds this content. Looking for other content which becomes available through filtering and transformation (left out in my simple example explained above) can never be found. "Write Document" also generates the original text ignoring all changes to Document made in my operator chain.
This results in my simple but important question: how to work with the modified document?
Thanks in advance!
Tagged:
0
Answers
We meet again! Sailing similar waters I suspect.. It comes down to whether Rapido is passing its normal Data input/output, or Documents, which are a special type of I/O object. Sometimes you have an example set which needs handling by document handlers in which case you use the 'Process Documents from Data' operator, and so on. Here's a Beeb news title grabber.. not swift. PS Probably best to post code as above when it gets down to the detail.