Extract Sheet name from an Excel file
Hi everyone,
I want to extract the name of the Excel sheet I'm reading to add as a new attribute.
For example:
Data Sheet_name
A my_sheet_name
B my_sheet_name
C my_sheet_name
Is there a way to do this?
Thanks
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
There is a way to perform your task with the Execute Python operator (to download and install from marketplace),
but you have to install Python language on your computer.
The new attribute with the sheet name appears on the last column (new created column) :
You have to modify two parameters in the Python code according to the sheets that you want study, but nothing complicated, all is commented.
Here the process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="112" y="34">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(): path = 'C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer'# path where the xls file is stored file_name = 'sheets_name_as_attribute.xlsx' # name #Read the sheet number 'x' of the Excel file data = pd.read_excel(path + '/' + file_name,sheetname = 0) #modify the value of parameter 'sheetname' #WARNING : for the "sheetname' parameter, the first sheet is 0, the second 1, the third 2 etc . # get the sheets name of the excel file xls = pd.ExcelFile(path + '/' + file_name , on_demand = True) sheets = xls.sheet_names data['Sheet_name'] = str(sheets[0])# modify [0] according to the value of parameter 'sheetname' above # connect 2 output ports to see the results return data"/>
</operator>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>In attached file, my fictive .xls file.
This is not the easier way, but maybe it will be helpful
Regards,
Lionel
0 -
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
ok I hate it when people cop out and use python RapidMiner all the way. Here you go:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="/Users/genzerconsulting/Desktop/FOOO/FOOO.xlsx"/>
<parameter key="imported_cell_range" value="A1:C263"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="foo.true.polynominal.attribute"/>
<parameter key="1" value="foo2.true.polynominal.attribute"/>
<parameter key="2" value="C.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="179" y="136"/>
<operator activated="true" class="operator_toolbox:get_parameters" compatibility="0.7.000" expanded="true" height="103" name="Get Parameters" width="90" x="313" y="34">
<parameter key="Operator name" value="Read Excel"/>
</operator>
<operator activated="true" class="converters:parameter_set_2_example_set" compatibility="0.3.001" expanded="true" height="103" name="Parameter Set to ExampleSet" width="90" x="447" y="34"/>
<operator activated="true" class="extract_macro" compatibility="8.0.001" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="34">
<parameter key="macro" value="sheetName"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="Read Excel.sheet_number"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="annotations_to_data" compatibility="8.0.001" expanded="true" height="82" name="Annotations to Data" width="90" x="313" y="238"/>
<operator activated="true" class="extract_macro" compatibility="8.0.001" expanded="true" height="68" name="Extract Macro (2)" width="90" x="447" y="238">
<parameter key="macro" value="file_path"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="value"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="loop_zipfile_entries" compatibility="8.0.001" expanded="true" height="82" name="Loop Zip-File Entries" width="90" x="581" y="238">
<parameter key="filename" value="%{file_path}"/>
<parameter key="filter" value="workbook.xml"/>
<parameter key="recursive" value="true"/>
<process expanded="true">
<operator activated="true" class="advanced_file_connectors:read_xml" compatibility="8.0.001" expanded="true" height="68" name="Read XML" width="90" x="179" y="34">
<parameter key="file" value="/Users/genzerconsulting/Desktop/FOOO/xl/workbook.xml"/>
<parameter key="xpath_for_examples" value="//default:workbook/default:sheets"/>
<enumeration key="xpaths_for_attributes">
<parameter key="xpath_for_attribute" value="default:sheets[1]/text()"/>
<parameter key="xpath_for_attribute" value="default:sheet[1]/attribute::name"/>
<parameter key="xpath_for_attribute" value="default:sheet[1]/attribute::r:id"/>
<parameter key="xpath_for_attribute" value="default:sheet[1]/attribute::sheetId"/>
<parameter key="xpath_for_attribute" value="default:sheet[1]/text()"/>
<parameter key="xpath_for_attribute" value="default:sheet[2]/attribute::name"/>
<parameter key="xpath_for_attribute" value="default:sheet[2]/attribute::r:id"/>
<parameter key="xpath_for_attribute" value="default:sheet[2]/attribute::sheetId"/>
<parameter key="xpath_for_attribute" value="default:sheet[2]/text()"/>
</enumeration>
<list key="namespaces">
<parameter key="mx" value="http://schemas.microsoft.com/office/mac/excel/2008/main"/>
<parameter key="x14" value="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"/>
<parameter key="x15ac" value="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac"/>
<parameter key="x15" value="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"/>
<parameter key="mc" value="http://schemas.openxmlformats.org/markup-compatibility/2006"/>
<parameter key="r" value="http://schemas.openxmlformats.org/officeDocument/2006/relationships"/>
<parameter key="default" value="http://schemas.openxmlformats.org/spreadsheetml/2006/main"/>
</list>
<parameter key="default_namespace" value="http://schemas.openxmlformats.org/spreadsheetml/2006/main"/>
<list key="annotations"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="default:sheets[1]/text().true.attribute_value.attribute"/>
<parameter key="1" value="default:sheet[1]/attribute::name.true.attribute_value.attribute"/>
<parameter key="2" value="default:sheet[1]/attribute::r:id.true.attribute_value.attribute"/>
<parameter key="3" value="default:sheet[1]/attribute::sheetId.true.attribute_value.attribute"/>
<parameter key="4" value="default:sheet[1]/text().true.attribute_value.attribute"/>
<parameter key="5" value="default:sheet[2]/attribute::name.true.attribute_value.attribute"/>
<parameter key="6" value="default:sheet[2]/attribute::r:id.true.attribute_value.attribute"/>
<parameter key="7" value="default:sheet[2]/attribute::sheetId.true.attribute_value.attribute"/>
<parameter key="8" value="default:sheet[2]/text().true.attribute_value.attribute"/>
</list>
</operator>
<operator activated="true" class="de_pivot" compatibility="8.0.001" expanded="true" height="82" name="De-Pivot" width="90" x="380" y="34">
<list key="attribute_name">
<parameter key="Sheet Name" value=".*name"/>
<parameter key="Sheet Number" value=".*sheetId"/>
</list>
<parameter key="index_attribute" value="foo"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Sheet Name|Sheet Number"/>
</operator>
<connect from_port="file object" to_op="Read XML" to_port="file"/>
<connect from_op="Read XML" from_port="output" to_op="De-Pivot" to_port="example set input"/>
<connect from_op="De-Pivot" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="82" name="Append" width="90" x="715" y="238"/>
<operator activated="true" class="generate_attributes" compatibility="8.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="849" y="136">
<list key="function_descriptions">
<parameter key="Sheet_name" value="%{sheetName}"/>
</list>
</operator>
<operator activated="true" class="join" compatibility="8.0.001" expanded="true" height="82" name="Join" width="90" x="983" y="187">
<parameter key="join_type" value="left"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="Sheet_name" value="Sheet Number"/>
</list>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Get Parameters" to_port="through 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 3" to_op="Annotations to Data" to_port="object"/>
<connect from_op="Get Parameters" from_port="parameters" to_op="Parameter Set to ExampleSet" to_port="parameters"/>
<connect from_op="Parameter Set to ExampleSet" from_port="exampleSet" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Annotations to Data" from_port="annotations" to_op="Extract Macro (2)" to_port="example set"/>
<connect from_op="Loop Zip-File Entries" from_port="out 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Join" to_port="right"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>Scott
3
Answers
hello @gabriela_sanche - absolutely if you use the "Operator Toolbox" extension.
Scott
Thank you @sgenzer for your answer. Great extension, I haven't used it before.
However I wish to extract the sheet NAME not the number.
For example, for the Excel in the image I wish to extract "Name 10".
Is there a way to do this?
@lionelderkrikor Thanks!
Now I know there's no direct (RM operator) to do it, but it's exactly the result I was looking for.
Hi @sgenzer
you have to be indulgent, i'm still in a learning phase......:smileyhappy:
OK, I recorded this method.
Best regards,
Lionel
@lionelderkrikor,
no worries. We are all learning. I am using the product for 7 years and @sgenzer just suprised me with this solution.
Best,
Martin
Dortmund, Germany
Hi,
reading the .xlsx file as Zip and processing XML is a brilliant idea, but quite involved and it won't work with old .xls files.
This should really be a setting in Read Excel (boolean: Return sheet name, text: Sheet name macro). Then you could even process it in loops that read all the sheets in a file and would get full metadata.
Regards,
Balázs
Hi,
@BalazsBarany i was thinking at some point to write a "Sheetname to ExampleSet" operator. It's easy to write. Input: either a file object or a path to a file. Output: Exampleset with sheetId and sheetname.
Would that do the trick for you? If yes - i would write it just for you I still owe you quite some favours. E.g. for the generic join script.
Cheers,
Martin
Dortmund, Germany
Hi Martin,
I'm not the original poster who requested the feature. I just described the idea for extending the operator.
Your solutions sounds interesting. Let's think about implications for the user and for the performance.
I think users would expect this functionality directly in Read Excel, not as a separate operator. If it's a special operator, you would need to join its results afterwards with the example set from the worksheet, or filter the example set and use Extract Macro to get the name of the one worksheet you're interested in. It's of course better if the user wants all worksheet numbers and names.
Performance-wise, I'm sure that reading worksheet names is a part of the backend library that already handles both .xls and .xlsx, and when the file is already open, the performance impact shouldn't be big. That would favor a solution integrated in Read Excel.
Another possible implementation would be a Loop Worksheets operator. It would be similar to Loop Zip-file entries. (And nobody would search for it under Process Control/Loops ;-))
Regards,
Balázs
I really like "loop worksheets" - that makes the most sense to me.
@lionelderkrikor - sorry didn't mean to put down your skills. Python is perfectly legitimate and what most people would do. I just take it as a personally challenge to do things purely in RapidMiner.
Scott
@BalazsBarany
i am not sure if i agree with you. The usual way to use "a table with sheetnames" would be a loop over them and read some of them (or all) with distinct Read Excel operators. Using a Read Excel to get this list seems odd in this case, right? So having a "Read Excel Sheet Names" sounds more reasonable to me.
But maybe our users can add a few thoughts. @gabriela_sanche @lionelderkrikor what would you prefer?
Side note: I can write a new operator but not extend the read excel. But i would of course file a feature request if read excel is the best way to do it.
Cheers,
martin
Dortmund, Germany
@mschmitz Indeed, I wish to read the sheet name to have control over multiple excel files (inside a Loop Files Operator and then a normal Loop to read the different sheets).
However, I often encounter important information as the sheet name, so having it as metadata directly from the Read Excel would be really helpful.
Thank you all for your interest in this problem
@sgenzer,
No problem : I understood of course that there was a personnal challenge and it was useful for me : It's a way for me to learn about the RM operators and to improve my skills on RapidMiner.
So for next topics, if I persist to use Python, don't hesitate to post a 100% RM process ........:smileyvery-happy:
Regards,
Lionel
@mschmitz,
On the substance of the subject, i have no preference between the suggested solutions BUT
for me, for a better user experience and more efficiency , a dedicated operator like "Read Excel Sheet Names" is more adapted :
In deed, based on my own experience, when I try to find an operator to perform a function,
it's easier to find and use an operator with an explicit name, with a simply input , a
simply output (in particular for non experiment users), few parameters. It's more difficult to discover that the function
can be performed by searching in the large numbers of parameters of an (existing) operator (which has sometimes no explicit name for the searched function).
I hope that it's understandable and it will help to improve the features of RapidMiner.
Best regards,
Lionel
Hi @lionelderkrikor,
the point here is that there is some friction between having more explicit operators for one task and having too many operators. To learn rapidminer you need to learn some kind of "dictionary" of operators and their purpose. We did some internal studies on how many operators you need to be good in RM and having more is not always better. I can see you argument for this but it's a two-sided one.
@gabriela_sanche thanks for your feedback! I will have a look if i can write this operator.
Cheers,
Martin
Dortmund, Germany