The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"[SOLVED] java.lang.nullPointerException in simple text mining script"

wmarellawmarella Member Posts: 5 Contributor II
edited June 2019 in Help
Hello, I'm new to text mining and rapidminer, but I'm following a tutorial in "Practical Text Mining" and cant make a very simple script work. The process fails and returns the java.lang.nullpointerexception error. I'm running Mac OsX 10.6.8, Java 13.7.2, Rapidminer 5.2.006.

I'm using the Read Excel operator to load a simple three-column spreadsheet. The columns are ID, Year, and Abstract. Abstract contains the text I'm trying to mine. I've flagged ID as the id field, and Abstract is flagged as text on the import wizard. There are 901 examples in the example set, and the Read Excel operator is working because I see my data when hovering over the output node. It also looks correct going into the Process Document from Data (PDFD) operator at the exa node.

On the PDFD operator, create word vector is checked (TF-IDF), as is keep text. PDFD contains a subprocess: Transform Case and Tokenize. I've removed all other operators from the program in order to isolate PDFD as the problem. When I hover over the output node of PDFD, it says Examples=0 but still shows my 3 attribute names.

Here is the xml code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
   <parameter key="logverbosity" value="init"/>
   <parameter key="random_seed" value="2001"/>
   <parameter key="send_mail" value="never"/>
   <parameter key="notification_email" value=""/>
   <parameter key="process_duration_for_mail" value="30"/>
   <parameter key="encoding" value="SYSTEM"/>
   <process expanded="true" height="251" width="413">
     <operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
       <parameter key="excel_file" value="/Users/Bill/Desktop/Literature_Datsset_1994-2005.xls"/>
       <parameter key="sheet_number" value="1"/>
       <parameter key="imported_cell_range" value="A1:E902"/>
       <parameter key="encoding" value="SYSTEM"/>
       <parameter key="first_row_as_names" value="true"/>
       <list key="annotations">
         <parameter key="0" value="Name"/>
       </list>
       <parameter key="date_format" value=""/>
       <parameter key="time_zone" value="SYSTEM"/>
       <parameter key="locale" value="English (United States)"/>
       <list key="data_set_meta_data_information">
         <parameter key="0" value="ID.true.nominal.attribute"/>
         <parameter key="1" value="YEAR.true.nominal.attribute"/>
         <parameter key="2" value="JOURNAL.true.nominal.attribute"/>
         <parameter key="3" value="ABSTRACT.true.text.attribute"/>
       </list>
       <parameter key="read_not_matching_values_as_missings" value="true"/>
       <parameter key="datamanagement" value="double_array"/>
     </operator>
     <operator activated="true" class="text:process_document_from_data" compatibility="5.2.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="75">
       <parameter key="create_word_vector" value="true"/>
       <parameter key="vector_creation" value="TF-IDF"/>
       <parameter key="add_meta_information" value="true"/>
       <parameter key="keep_text" value="false"/>
       <parameter key="prune_method" value="absolute"/>
       <parameter key="prunde_below_percent" value="3.0"/>
       <parameter key="prune_above_percent" value="30.0"/>
       <parameter key="prune_below_absolute" value="3"/>
       <parameter key="prune_above_absolute" value="55"/>
       <parameter key="prune_below_rank" value="0.05"/>
       <parameter key="prune_above_rank" value="0.05"/>
       <parameter key="datamanagement" value="double_sparse_array"/>
       <parameter key="select_attributes_and_weights" value="false"/>
       <list key="specify_weights"/>
       <process expanded="true" height="340" width="634">
         <operator activated="true" class="text:transform_cases" compatibility="5.2.002" expanded="true" height="60" name="Transform Cases" width="90" x="59" y="109">
           <parameter key="transform_to" value="lower case"/>
         </operator>
         <operator activated="true" class="text:tokenize" compatibility="5.2.002" expanded="true" height="60" name="Tokenize" width="90" x="169" y="110">
           <parameter key="mode" value="non letters"/>
           <parameter key="characters" value=".:"/>
           <parameter key="language" value="English"/>
           <parameter key="max_token_length" value="3"/>
         </operator>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="297" y="111"/>
         <operator activated="true" class="text:filter_by_length" compatibility="5.2.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="456" y="104">
           <parameter key="min_chars" value="2"/>
           <parameter key="max_chars" value="55"/>
         </operator>
         <connect from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
         <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="45" y="165">
       <parameter key="add_cluster_attribute" value="true"/>
       <parameter key="add_as_label" value="false"/>
       <parameter key="remove_unlabeled" value="false"/>
       <parameter key="k" value="2"/>
       <parameter key="max_runs" value="10"/>
       <parameter key="determine_good_start_values" value="false"/>
       <parameter key="measure_types" value="BregmanDivergences"/>
       <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
       <parameter key="nominal_measure" value="NominalDistance"/>
       <parameter key="numerical_measure" value="EuclideanDistance"/>
       <parameter key="divergence" value="SquaredEuclideanDistance"/>
       <parameter key="kernel_type" value="radial"/>
       <parameter key="kernel_gamma" value="1.0"/>
       <parameter key="kernel_sigma1" value="1.0"/>
       <parameter key="kernel_sigma2" value="0.0"/>
       <parameter key="kernel_sigma3" value="2.0"/>
       <parameter key="kernel_degree" value="3.0"/>
       <parameter key="kernel_shift" value="1.0"/>
       <parameter key="kernel_a" value="1.0"/>
       <parameter key="kernel_b" value="0.0"/>
       <parameter key="max_optimization_steps" value="100"/>
       <parameter key="use_local_random_seed" value="false"/>
       <parameter key="local_random_seed" value="1992"/>
     </operator>
     <connect from_port="input 1" to_op="Read Excel" to_port="file"/>
     <connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
     <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="source_input 2" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>
Here is the stack trace:

Stack trace:
------------

Exception: java.lang.NullPointerException
Message: null
Stack trace:
 com.rapidminer.operator.nio.model.ExcelResultSetConfiguration.makeDataResultSet(ExcelResultSetConfiguration.java:275)
 com.rapidminer.operator.nio.model.AbstractDataResultSetReader.createExampleSet(AbstractDataResultSetReader.java:127)
 com.rapidminer.operator.io.AbstractExampleSource.read(AbstractExampleSource.java:52)
 com.rapidminer.operator.io.AbstractExampleSource.read(AbstractExampleSource.java:36)
 com.rapidminer.operator.io.AbstractReader.doWork(AbstractReader.java:123)
 com.rapidminer.operator.Operator.execute(Operator.java:834)
 com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
 com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
 com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
 com.rapidminer.operator.Operator.execute(Operator.java:834)
 com.rapidminer.Process.run(Process.java:925)
 com.rapidminer.Process.run(Process.java:848)
 com.rapidminer.Process.run(Process.java:807)
 com.rapidminer.Process.run(Process.java:802)
 com.rapidminer.Process.run(Process.java:792)
 com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)
Thanks in advance for any help you can offer!

Bill

Answers

  • wmarellawmarella Member Posts: 5 Contributor II
    I dont know why this should matter, but I fixed this problem simply by deleting the header row in my source Excel file. I discovered this by trying to determine whether the problem was with my data or whether the code for this operator was bad. So I made a simple 4-example set in Excel and by chance didnt bother labeling the columns. Everything imported fine and the Process Documents operator produced the text vector. So I tried this on my real 901- example file and it too worked fine. I'll leave it to the developers to see if this is a bug associated with this operator not being able to handle labels in the first row for some reason.
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee-RapidMiner, Member, University Professor Posts: 1,996 RM Engineering
    Hi,

    there was indeed a bug involved, should be fixed in the next release.

    Regards,
    Marco
  • balmerhevibalmerhevi Member Posts: 2 Contributor I

    NullPointerException is a RuntimeException . Runtime exceptions are critical and cannot be caught at compile time. They crash the program at run time if they are not handled properly. When a class is instantiated, its object is stored in computer memory. The NullPointerExceptions occur when you try to use a reference that points to no location in memory (null) as though it were referencing an object. These include:

     

    1. Calling the instance method of a null object.
    2. Accessing or modifying the field of a null object.
    3. Throwing null as if it were a Throwable value.

    Balmer

  • magui_taillefermagui_taillefer Member Posts: 1 Learner III

    I had the same problem (RapidMiner 6.5.2). Because I use some attributes and it's confusing to have no names, I just tried to import csv excelsheets and it works (slowly) but without errors.

    Cheers,

    ME. Taillefer

Sign In or Register to comment.