The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED]How to get the computational time for different modules in a process
huaiyanggongzi
Member Posts: 39 Contributor II
I have a workflow to perform document processing for a collection of plain text files. This workflow consists of three sub-processes
1) Tokenize
2) TransformCase
3) Filter token by length
I am trying to log the following information
1) The total computational time for the whole document processing workflow
2) The computational time for each three sub-processes
Here is the process I created. For illustration purposes, I just use 4 text files as input. In this process, I capture “CPU-Time” and “Time” in the log.
Based on this result, I have several questions:
1) Why the logged information are so different for different runs? And the first experiment run even has zero value for all three subprocesses.
2) I have four files, why the log information for three sub-processes has eight rows?
3) In Rapidminer, the CPU-time is measured in millisecond; What is measures in “Time” from log?
Thanks a lot.
1) Tokenize
2) TransformCase
3) Filter token by length
I am trying to log the following information
1) The total computational time for the whole document processing workflow
2) The computational time for each three sub-processes
Here is the process I created. For illustration purposes, I just use 4 text files as input. In this process, I capture “CPU-Time” and “Time” in the log.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="341" width="681">
<operator activated="true" class="select_subprocess" compatibility="5.2.008" expanded="true" height="94" name="Select Subprocess" width="90" x="112" y="30">
<process expanded="true" height="406" width="346">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files (3)" width="90" x="246" y="75">
<list key="text_directories">
<parameter key="Responsive" value="C:\Desktop\testdata"/>
</list>
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<process expanded="true" height="431" width="697">
<operator activated="true" class="select_subprocess" compatibility="5.2.008" expanded="true" height="94" name="subprocess1" width="90" x="45" y="30">
<process expanded="true" height="406" width="413">
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="246" y="165"/>
<connect from_port="input 1" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
<process expanded="true">
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_subprocess" compatibility="5.2.008" expanded="true" height="94" name="subprocess2" width="90" x="246" y="75">
<process expanded="true" height="388" width="314">
<operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="151" y="71"/>
<connect from_port="input 1" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
<process expanded="true">
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_subprocess" compatibility="5.2.008" expanded="true" height="94" name="subprocess3" width="90" x="447" y="30">
<process expanded="true" height="388" width="346">
<operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="239" y="58">
<parameter key="min_chars" value="1"/>
<parameter key="max_chars" value="200"/>
</operator>
<connect from_port="input 1" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
<process expanded="true">
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="112" name="Log" width="90" x="313" y="255">
<list key="log">
<parameter key="tokenCPU" value="operator.Tokenize.value.cpu-time"/>
<parameter key="transformCPU" value="operator.Transform Cases.value.cpu-time"/>
<parameter key="filterCPU" value="operator.Filter Tokens (by Length).value.cpu-time"/>
<parameter key="tokenTime" value="operator.Tokenize.value.time"/>
<parameter key="transformTime" value="operator.Transform Cases.value.time"/>
<parameter key="filterTime" value="operator.Filter Tokens (by Length).value.time"/>
</list>
</operator>
<connect from_port="document" to_op="subprocess1" to_port="input 1"/>
<connect from_op="subprocess1" from_port="output 1" to_op="subprocess2" to_port="input 1"/>
<connect from_op="subprocess1" from_port="output 2" to_op="Log" to_port="through 1"/>
<connect from_op="subprocess2" from_port="output 1" to_op="subprocess3" to_port="input 1"/>
<connect from_op="subprocess2" from_port="output 2" to_op="Log" to_port="through 2"/>
<connect from_op="subprocess3" from_port="output 1" to_port="document 1"/>
<connect from_op="subprocess3" from_port="output 2" to_op="Log" to_port="through 3"/>
<connect from_op="Log" from_port="through 1" to_port="document 2"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<portSpacing port="sink_document 3" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files (3)" to_port="word list"/>
<connect from_op="Process Documents from Files (3)" from_port="example set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="76" name="Log-DP" width="90" x="380" y="210">
<list key="log">
<parameter key="pdCPUTime" value="operator.Process Documents from Files (3).value.cpu-time"/>
<parameter key="pdTime" value="operator.Process Documents from Files (3).value.time"/>
</list>
</operator>
<connect from_op="Select Subprocess" from_port="output 1" to_port="result 2"/>
<connect from_op="Select Subprocess" from_port="output 2" to_op="Log-DP" to_port="through 1"/>
<connect from_op="Log-DP" from_port="through 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Based on this result, I have several questions:
1) Why the logged information are so different for different runs? And the first experiment run even has zero value for all three subprocesses.
2) I have four files, why the log information for three sub-processes has eight rows?
3) In Rapidminer, the CPU-time is measured in millisecond; What is measures in “Time” from log?
Thanks a lot.
0
Answers
I think the text of your post does not match the process you posted (I only see one subprocess, and the log file contains only one row because it placed on the top level of the process).
About the times: if I remember correctly, the CPU time is measured in microseconds, the time in milliseconds.
Btw, in RapidMiner we call the modules "operators"
Best, Marius
Thanks for answering my question. Please make “text directories” option of operator “Process Documents From Files” to point to a local directory which contains some text files. It will generate two log outputs.
On my machine, I run this process four times. I recorded the two log outputs for each of these four experiment runs.
The first run:
log for the "process document" operator, the whole workflow
31200200 63
0 0 0 0 0 0
15600100 15600100 15600100 16 16 16
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
The second run:
15600100 390
0 0 0 0 0 0
0 0 0 16 16 16
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
The third run:
46800300 62
0 0 0 0 0 0
15600100 0 0 15 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
The fourth run
31200200 47
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 16 16 16
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Secondly, we always get 8 rows for the log output of sub-processes. If I change the number of documents to 10, I will get 20 rows. Looks like the number of rows always double the number of input documents. I am not very clear where does this relationship come from?
In addition, the microsecond is 1/1000 of millisecond, But the result does not match this relationship. For instance, in the second run, the pdCPUTime = 15600100 microseconds, and pdTime=390 milliseconds It seems to me that pdTime should equal to 15600 milliseonds instead.
Thanks.
I looked it up in the code, the CPU time is in nanoseconds, not microseconds.
Still, the numbers don't make always sense. FYI, we are using ManagementFactory.getThreadMXBean().getThreadCpuTime(Thread.currentThread().getId()) to calculate the cpu time, and the documentation says: "Returns the total CPU time for a thread of the specified ID in nanoseconds. The returned value is of nanoseconds precision but not necessarily nanoseconds accuracy."
Maybe it has problems with these small times... on longer running operators it seems to deliver reasonable output - see the outer log operator.
Best, Marius
I still have one question:
As you mentioned, cpu time is in nanoseconds and time is in milliseconds. But one of the experiments shows that the CPU time and Time for a single process as follows
pdCPUTime pdTime
31200200 47
It looks like pdCPUTime does not equal to pdTime * 1000000
The CPU time shows how long the cpu has been used by that operator. But since you are running a multi-threading operating system, the cpu can also be used by other processes while the operator is executed, and thus the operator may take longer than it uses the cpu. In addition, if the operator performs network operations or accesses the disk, it needs to wait for responses from the respective devices, where it consumes time, but does not use the cpu.
Best,
Marius