The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Read PDF Tables Extension - Need to
Hello - I am trying to use the "Read PDF Tables" Extension. I have successfully read my PDF but it has been split out into 21 different example sets. I would like to use the "Select" operator to choose the Example sets that I need. I am running into some issues. "Select" only lets you pick on example set whereas I will need to select 5. Second - not all of the example sets are the same with only 5 of the 21 sheets having the attribute headings that I actually need. Would anyone have any ideas on how I can pull what I need from this set. I have been trying to use Loops but unsuccessfully. Thanks!
Tagged:
1
Best Answers
-
miked Member Posts: 21 Contributor IIHi @sgenzer...Great thank you. That definitely helps narrow down which example sets have the attributes that I need. Would I then just follow @varunm1 method to connect the n amount of "Select" operators to Append the sets together? Is there a way of using a macro to count the example sets and just save "Select" loop n amount of times. If not..this should work for now and I thank you both for your help.
-Mike1 -
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Managerhi @miked if all the examplesets are the same (or similar), I'd just drop an Append(Superset) on the end. Like this:
<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="-1"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="pdf_table_extraction:pdfs2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Read PDF Tables" width="90" x="45" y="34"> <parameter key="resource_type" value="file"/> <parameter key="attribute" value=""/> <parameter key="tune extraction criteria" value="false"/> <parameter key="discard tables with no rows" value="false"/> <parameter key="discard empty attributes" value="false"/> <parameter key="heuristic ratio for table content" value="0.65"/> <parameter key="tune edge detection criteria" value="false"/> <parameter key="grayscale intensity threshold" value="25"/> <parameter key="minimum width of horizontal edge" value="50"/> <parameter key="minimum height of vertical edge" value="10"/> <parameter key="maximum cell corner distance" value="10"/> <parameter key="required text lines for edge" value="4"/> <parameter key="required cells for table" value="4"/> <parameter key="point snap distance threshold" value="8.0"/> <parameter key="table padding amount" value="1.0"/> <parameter key="identical table overlap ratio" value="0.9"/> </operator> <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34"> <parameter key="set_iteration_macro" value="false"/> <parameter key="macro_name" value="iteration"/> <parameter key="macro_start_value" value="1"/> <parameter key="unfold" value="false"/> <process expanded="true"> <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <description align="center" color="transparent" colored="false" width="126">enter the attribute of example sets you want to keep</description> </operator> <operator activated="true" class="branch" compatibility="9.6.000" expanded="true" height="82" name="Branch" width="90" x="179" y="34"> <parameter key="condition_type" value="min_attributes"/> <parameter key="condition_value" value="1"/> <parameter key="expression" value=""/> <parameter key="io_object" value="ANOVAMatrix"/> <parameter key="return_inner_output" value="true"/> <process expanded="true"> <connect from_port="condition" to_port="input 1"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="132" y="13">keep the ExampleSet</description> </process> <process expanded="true"> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="162" y="13">do not keep the ExampleSet</description> </process> <description align="center" color="transparent" colored="false" width="126">branch to some minimum # of attributes (1?)</description> </operator> <connect from_port="single" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Branch" to_port="condition"/> <connect from_op="Branch" from_port="input 1" to_port="output 1"/> <portSpacing port="source_single" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="operator_toolbox:advanced_append" compatibility="2.3.000" expanded="true" height="82" name="Append (Superset)" width="90" x="313" y="34"/> <connect from_op="Read PDF Tables" from_port="collection of pdf data tables as example sets" to_op="Loop Collection" to_port="collection"/> <connect from_op="Loop Collection" from_port="output 1" to_op="Append (Superset)" to_port="example set 1"/> <connect from_op="Append (Superset)" from_port="merged set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
6 -
ey1 Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 21 RM ResearchHi @mikedIf you are still thinking on a way how to automate the filtering of collection, you can think about different condition types in the Branch operator in the process proposed by @sgenzer such as min or max number of attributes or examples. If you want to use names of attributes, just inspect if Read PDF Tables operator gives you the attribute names you want (its not a guarantee, since it depends on detection and extraction method) in the output ExampleSet(s) but if it does once, it will do always. In this case, you can use the attribute names in macros and try to use complex expression in Branch operator to filter out ExampleSets with desired attribute name(s) and if they have exactly same header structure, you can Append them as @varunm1 suggested.I am attaching a test process for reference. It will log out an error message to give a hint if condition is not fulfilled.Cheers,Edwin7
Answers
Did you try using "multiply operator" after the collection and then connect the five select operators to pick each one of them based on their index in the collection? If all 5 have the same attribute names you can use append operator to append them into a single example set as well.
There may be some other solutions as well. @David_A or @mschmitz any ideas here?
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Thanks for the suggestion. That would definitely work for now. I think what I'm looking for is a bit more automation. My fear is that it won't always be the same 5 example sets. I was hoping for some way to identify which of those example sets has the attributes that I am looking for and pull those sets regardless of how many there are.
-Mike
Yep understood. Lets see if anyone responds
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
That's fantastic thank you all!
Two supplemental questions but not vital to solving the issue.
1 - I had 3 attributes that did not come through in the loop->select attributes so decided to just go with "all"..Two of the column headers is labeled in the PDF as "CurrentMonth's Sale" as well as "CYTD 2019" so assuming there are some limits to what Read PDF can do to as @ey stated above?
2 - If the example sets were not all the same...can I manipulate them in the collection or is it better to use "branch" and pull them out.
I'm a bit of a newbie especially with "Collections." I really appreciate the help of the group here.
-Mike