The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Cut Document
CaptainChaos
Member Posts: 17 Contributor II
Hi Guys,
I think i have some kind of trivial problem but couldnt figure out how to solve it.
I am working with the reuters Dataset, i have a steemed version consisting of one big docuement which contains all the other documents. So it is a big .txt file in which the beginning and ending of each document is marked by the word "reuter". I tried to use the "Cut Document" operator to split them. As query expression I used "reuters" the problem is that all documents know have the same name(label) which makes it hard to work with them.
Does anybody know how to give different names to all documents like 1,2,3,4,5 for example and than write/export them to excell or a data base.
Thanky in advance
cheer
I think i have some kind of trivial problem but couldnt figure out how to solve it.
I am working with the reuters Dataset, i have a steemed version consisting of one big docuement which contains all the other documents. So it is a big .txt file in which the beginning and ending of each document is marked by the word "reuter". I tried to use the "Cut Document" operator to split them. As query expression I used "reuters" the problem is that all documents know have the same name(label) which makes it hard to work with them.
Does anybody know how to give different names to all documents like 1,2,3,4,5 for example and than write/export them to excell or a data base.
Thanky in advance
cheer
Tagged:
0
Answers
here is a little example of how you could write the single documents as files: If you prefer a list-based output like Excel or database, this is the way to go: Hope these examples help you a little. Feel free to ask if you have further questions.
Regards
Matthias
first off all thank you very much for your help my model no works a lot better than before.
But I would like to ask you one more question. In the next pic I copied your code and marked one line which is different to my once could you explain the line to me.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="607" width="758">
<operator activated="true" class="text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<parameter key="text" value="marker Document 1 content marker marker Document 2 content marker marker Document 3 content marker marker Document 4 content marker marker Document 5 content marker"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries">
<parameter key="content" value="marker.marker"/>
</list>
<list key="regular_expression_queries">
<parameter key="content" value="marker\s*(.*?)\s*marker"/> \s*(.*?)\s* --> Plural
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="607" width="758">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">
<parameter key="text_attribute" value="content"/>
<parameter key="add_meta_information" value="false"/>
</operator>
<operator activated="true" class="generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
<operator activated="true" class="generate_attributes" compatibility="5.1.011" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="30">
<list key="function_descriptions">
<parameter key="document" value=""document_" + str(id)"/>
</list>
</operator>
<operator activated="true" class="write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="581" y="165">
<parameter key="excel_file" value="C:\Dokumente und Einstellungen\mraeder\Desktop\output\documents.xls"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/>
<connect from_op="Write Excel" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="126"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Kind regards Roberto
you can use the code style by adding CODE-tags around it. It's the third symbol from the right just above the smileys.
The highlighted line is my split expression. You said there is a word marking the beginning and the end of each document. Since I manually typed some example document contents, I simply used "marker" for this. I think it should be "reuters" in your case. The regular expression used to cut the text collects anything between two marker words (the first capturing group) and also uses \s* to cut of whitespace surrounding the content (a newline between marker word and beginning of the actual document content for example).
Hope this clarifies things.
Regards
Matthias