"python's subprocess.run() not working inside Rapidminer"
Hello friends, I am in a bit of trouble with Python's subprocess.run() inside the Execute Python operator. I am using Xpd Reader's pdftotext to extract text from a pdf file. It seems that the subprocess fails when I run the process, as I always get a blank text file.
System Details:-
Windows 10
RapidMiner Studio 8.0
Python 3.6
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="380" y="187">
<parameter key="script" value="import pandas import sys import subprocess # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(): def pdf_text(source, output, timeout=None): if sys.platform == "win32": args = ['pdftotext', '-simple', source, output] elif sys.platform == "linux" or sys.platform == "linux2": args = ['pdftotext', '-layout', source, output] with open(output,"w+"): process = subprocess.run( args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout, shell = True) input_file = "D:/pdf-sample.pdf" output_file = "D:/ouput.txt" pdf_text(input_file, output_file) return "/>
</operator>
</process>
I am unable to find any reason for the wrong output. Please help!
Best Answer
-
lplenka Member Posts: 11 Contributor II
Hey @lionelderkrikor,
Thanks for trying to help.
Sorry the previous xml file was having some error. This is the new xml file.
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="380" y="187">
<parameter key="script" value="import pandas import sys import subprocess # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(): def pdf_text(source, output, timeout=None): if sys.platform == "win32": args = ['pdftotext', '-simple', source, output] elif sys.platform == "linux" or sys.platform == "linux2": args = ['pdftotext', '-layout', source, output] with open(output,"w+"): process = subprocess.run( args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout, shell = True) input_file = "D:/pdf-sample.pdf" output_file = "D:/ouput.txt" pdf_text(input_file, output_file) return "/>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>Well, yes the python script works fine when I run in a notebook or calling the python script from cmd.
I am not taking any arguments in rm_main() because this script doesn't need any and I want the text to be extracted to "output.txt" in my drive. So no return statements also.
Note:
Surprisingly, I am getting the extracted text in "output.txt" text file now. I don't know why I was not getting output last night. Did the restart do the trick? Please cross-check in your system. Thank You
0
Answers
Hi @lplenka,
First, it seems that there is an error in the XML code, you shared : It can be loaded in RapidMiner. Maybe this code is incomplete :
click in the XML panel, then Ctrl + A, Ctrl + C (to copy the whole process) and then paste it.
1. For the python code to be executed, you have to use the function rm_main : In your case rm_main has no argument in entry - def rm_main() - and you define instead an other function : def pdf_text() .
2. I see too that the function rm_main() return any output : return........
3. Have you try to run your code in a Notebook ?
Regards,
Lionel
Hi again @lplenka,
It's just to report that if you want extract text from a .pdf file, you can use the "Text Processing" extension of RapidMiner.
Maybe you can use the operators of this extension to perform what you want.
Here a useful link :
https://community.rapidminer.com/t5/Getting-Started-Knowledge-Base/Keyword-Frequency-in-Text-Mining/ta-p/31618
Regards,
Lionel
Hi @lplenka,
In my case, the output.txt file is empty after running the Execute Python operator with your process.
However, to complete my last post, you can perform this operation of text extraction with the Read Document and
Write Document operator of the Text Processing extension.
Here the process :
Best regards,
Lionel
Thanks @lionelderkrikor for the help.
Will use the textmining operator from next time.
bdw you can restart your system and probabbly my process will start producing perfect result. This is just a hypothesis that worked in my case.
Thanks for all help