"Split text into paragraphs"
Hi guys,
I have an excel file which consist article from Wikipedia. I want to split the text into paragraphs. I tried the Tokenize operator but there are no option to tokenize my text into paragraphs. I also tried the Cut Document Operator with the xPath query type. I used the query expression //h: p, but it doesn't work. Is there any posibilities to tokenize/split my text into paragraphs?
Thank you in advance.
Best Answers
-
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
hello @hbuggled - welcome to the community. I think you were on the right track with tokenize but I would choose the regex option in the parameters pane and try using \n as a expression.
Scott
0 -
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
hello @hbuggled - ok I understand. This is likely not the most elegant solution but it will do what you're looking for.
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
<parameter key="text" value="RapidMiner uses a client/server model with the server offered as either on-premise, or in public or private cloud infrastructures. According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. RapidMiner provides data mining and machine learning procedures including: data loading and transformation (Extract, transform, load (ETL)), data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Processes” in RapidMiner and they consist of multiple “Operators”. Each operator performs a single task within the process, and the output of each operator forms the input of the next one. Alternatively, the engine can be called from other programs or used as an API. Individual functions can be called from the command line. RapidMiner provides learning schemes, models and algorithms and can be extended using R and Python scripts. RapidMiner functionality can be extended with additional plugins which are made available via RapidMiner Marketplace. The RapidMiner Marketplace provides a platform for developers to create data analysis algorithms and publish them to the community. With version 7.0, RapidMiner included updates to its getting started materials, an updated user interface, and improvements to its data preparation capabilities."/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
<parameter key="mode" value="regular expression"/>
<parameter key="expression" value="\n+"/>
</operator>
<operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number" width="90" x="313" y="34"/>
<operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34">
<parameter key="text_attribute" value="text"/>
</operator>
<operator activated="true" class="split" compatibility="7.6.001" expanded="true" height="82" name="Split" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="text"/>
<parameter key="split_pattern" value="\n"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="715" y="34">
<parameter key="macro" value="tokenNumber"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="token_number"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="7.6.001" expanded="true" height="82" name="Generate Macro" width="90" x="849" y="34">
<list key="function_descriptions">
<parameter key="att" value="concat("text_",%{tokenNumber})"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="%{att}"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="1117" y="34">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="%{att}" value="1.0"/>
</list>
</operator>
<operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="1251" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="1385" y="34">
<parameter key="expression" value="\n+"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Extract Token Number" to_port="document"/>
<connect from_op="Extract Token Number" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Generate Macro" from_port="through 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Scott
1
Answers
regular expressions are your friend indeed. It just depends on how your content is structured. The linebreak (\n) could work, but it will not really break up into paragraphs but in sentences.
Typically paragraphs are created by a double (or more) linebreaks, so if you split on \n{2,} you may get them nicely by paragraph (in theory...)
ah yes well said @kayman - good catch.
Scott
Thank you very much for your help. My text was strucutred in linebreaks. So I could use the \n expression to tokenize it.
Now I only need the last paragraphs for tokenizing it in an non-letters structure. Unfortunately, I don't know how to realize it. Is there maybe an other expression to tokenize it or can I filter all paragraphs except the last one? Do you have an idea?
I am not sure, if it's allowed to write my next question after you solved my first problem. Please let me know if I should write it in a new post
Thank you in advance.
hello @hbuggled - hmm I am rather unclear by what you mean by "need the last paragraphs for tokenizing it in an non-letters structure". Could you please explain?
Scott
hi sgenzer,
sorry for my unclear question. I have in my Excel file Columns with article from Wikipedia. There's an article in each column. But for processing my text, I want only choose the last paragraph and tokenize it.
For example an article about RapidMiner in Wikipedia:
"RapidMiner uses a client/server model with the server offered as either on-premise, or in public or private cloud infrastructures.
According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. RapidMiner provides data mining and machine learning procedures including: data loading and transformation (Extract, transform, load (ETL)), data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Processes” in RapidMiner and they consist of multiple “Operators”. Each operator performs a single task within the process, and the output of each operator forms the input of the next one. Alternatively, the engine can be called from other programs or used as an API. Individual functions can be called from the command line. RapidMiner provides learning schemes, models and algorithms and can be extended using R and Python scripts.
RapidMiner functionality can be extended with additional plugins which are made available via RapidMiner Marketplace. The RapidMiner Marketplace provides a platform for developers to create data analysis algorithms and publish them to the community. With version 7.0, RapidMiner included updates to its getting started materials, an updated user interface, and improvements to its data preparation capabilities."
I want to tokenize the underlined text (last paragraph). The other parts can be ignored. I tried with the filter example operator with the expression "finds(article,/n)", but there is an error by typing "\".
Thank you in advance.
Thank you very much for your help. That's help me a lot