The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"text mining output"
Trying to text mine 30K email excerpts collated into one file. I know something is wrong because the frequency count for words that I would expect to be frequent are coming up as zero.
id id integer avg = 1 +/- 0 [1.000 ; 1.000] 0.0
label label nominal mode = bp (1), least = bp (1) bp (1) 0.0
regular Dear real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Wells real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Fargo real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular online real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular bill real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular transactions real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular National real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Benefit real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Life real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Insurance real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Company real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular another real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Both real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular were real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular deducted real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular checking real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
The log also references an issue with the example set at the end even though I have set it to overwrite.
P Aug 13, 2009 11:57:53 AM: Process:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[2] (StringTokenizer)
| +- StopwordFilterFile[2] (StopwordFilterFile)
| +- TokenLengthFilter[0] (TokenLengthFilter)
+- ExampleSetWriter[1] (ExampleSetWriter)
P Aug 13, 2009 11:57:53 AM: [Warning] TextInput: Warning: Encoding unknown. Using default.
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: The original example example set already contains an attribute named "label". This is likely to cause trouble. Please rename the attribute in the original example set.
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: There is a term that equals the class attribute, renaming it
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: Warning: Encoding unknown. Using default.
P Aug 13, 2009 11:57:59 AM: Process:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[2] (StringTokenizer)
| +- StopwordFilterFile[2] (StopwordFilterFile)
| +- TokenLengthFilter[2] (TokenLengthFilter)
+- ExampleSetWriter[1] (ExampleSetWriter)
P Aug 13, 2009 11:57:59 AM: Produced output:
IOContainer (1 objects):
SimpleExampleSet:
1 examples,
34729 regular attributes,
special attributes = {
id = #0: id (integer/single_value)
label = #34730: label (nominal/single_value)/values=[bp]
}
(created by TextInput)
P Aug 13, 2009 11:57:59 AM: [NOTE] Process finished successfully after 5 s
G Aug 13, 2009 11:57:59 AM: [NOTE] Cannot use plotter 'Scatter Matrix': Data table must have between 0 and 50 columns, was 34730.
G Aug 13, 2009 11:57:59 AM: [NOTE] Cannot use plotter 'Survey': Data table must have between 0 and 100 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'Andrews Curves': Data table must have between 0 and 1000 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'Quartile Color Matrix': Data table must have between 0 and 100 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'RadViz': Data table must have between 0 and 1000 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'GridViz': Data table must have between 0 and 10000 columns, was 34730.
Lastly how can I use visualization to see frequent terms words etc.
id id integer avg = 1 +/- 0 [1.000 ; 1.000] 0.0
label label nominal mode = bp (1), least = bp (1) bp (1) 0.0
regular Dear real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Wells real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Fargo real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular online real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular bill real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular transactions real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular National real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Benefit real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Life real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Insurance real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Company real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular another real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Both real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular were real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular deducted real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular checking real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
The log also references an issue with the example set at the end even though I have set it to overwrite.
P Aug 13, 2009 11:57:53 AM: Process:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[2] (StringTokenizer)
| +- StopwordFilterFile[2] (StopwordFilterFile)
| +- TokenLengthFilter[0] (TokenLengthFilter)
+- ExampleSetWriter[1] (ExampleSetWriter)
P Aug 13, 2009 11:57:53 AM: [Warning] TextInput: Warning: Encoding unknown. Using default.
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: The original example example set already contains an attribute named "label". This is likely to cause trouble. Please rename the attribute in the original example set.
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: There is a term that equals the class attribute, renaming it
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: Warning: Encoding unknown. Using default.
P Aug 13, 2009 11:57:59 AM: Process:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[2] (StringTokenizer)
| +- StopwordFilterFile[2] (StopwordFilterFile)
| +- TokenLengthFilter[2] (TokenLengthFilter)
+- ExampleSetWriter[1] (ExampleSetWriter)
P Aug 13, 2009 11:57:59 AM: Produced output:
IOContainer (1 objects):
SimpleExampleSet:
1 examples,
34729 regular attributes,
special attributes = {
id = #0: id (integer/single_value)
label = #34730: label (nominal/single_value)/values=[bp]
}
(created by TextInput)
P Aug 13, 2009 11:57:59 AM: [NOTE] Process finished successfully after 5 s
G Aug 13, 2009 11:57:59 AM: [NOTE] Cannot use plotter 'Scatter Matrix': Data table must have between 0 and 50 columns, was 34730.
G Aug 13, 2009 11:57:59 AM: [NOTE] Cannot use plotter 'Survey': Data table must have between 0 and 100 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'Andrews Curves': Data table must have between 0 and 1000 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'Quartile Color Matrix': Data table must have between 0 and 100 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'RadViz': Data table must have between 0 and 1000 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'GridViz': Data table must have between 0 and 10000 columns, was 34730.
Lastly how can I use visualization to see frequent terms words etc.
Tagged:
0
Answers
could you please post the complete process here inside a code area? Press on the # button for creating one. Otherwise I cannot say anything about the problem with the zeros.
Unfortunately direct visualization of the term frequency will be available in the next version. But you could switch from TFIDF to occurences and then aggregate the complete exampleset. You would have then the complete number of occurences for each word.
Greetings,
Sebastian
the problem is really simple: You have loaded your complete data as ONE example. In TFIDF encoding, every frequency will be zero then. The TextInput operator will read all files as a single example found in the directory specified.
Greetings,
Sebastian
If it's stored in something like csv, you could load it as exampleSet, change the AttributeType to String using the Nominal2String operator and then use the StringTextInput. This one will tread each row of the example set as one text.
Greetings,
Sebastian
Aug 27, 2009 3:49:44 PM: [Warning] StringTextInput: File C:\Program Files\Rapid-I\RapidMiner\no longer wanted bill pay not found. Assuming the text is directly encoded as document source...
for each and every record.
Here is my xml
you could increase the log verbosity in the process root operator to avoid this. Unfortunately nobody knows, WHY the text-plugin does this. Together with RapidMiner5 comes a from scratch redesigned new TextPlugin version, not showing this behavior.
Greetings,
Sebastian