The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"text mining"
hello everybody..
which operator I should use to load a serie of text files (.txt or .xml)?????
thank you,
laura
Tagged:
0
Answers
You can also use ExampleSource and then StringTextInput... I learned from the examples
I tried with the TextInput too, looking at the example, but the output of the node is not the one it should be:
instead of having a table with documents in the lines and terms as columns, I get a table with COLUMNS=DOCUMENTS
sure
My goal is to analyse a serie of articles in .txt format.
To do this I have to load the .txt files using for example TextInput.
Looking at this example http://nemoz.org/joomla/content/view/65/53/lang,de/ the output of this opertator SHOULD be a table like this:
-ROWS: articles
-COLUMNS: terms
(this is written right after the second image in the page I gave you the link).
This matrix, usually called Document Term Matrix, tells you each document (rows) which words (columns) contains, so is a sparse matrix of binary values, and it is used in the next steps of the analysis.
BUT...instead of this, I get a table like this:
-ROWS:progressive id of the article
-COLUMNS:article (i.e. all the text of each article is the label of an attribute!!!)
...and I don't know:
1) if this is correct...but I don't think so
2) how to solve the problem
I hope I explain myself better...thank you for the reply and for the help!!
ciao,
laura
It should be:
-ROWS: article id
-COLUMNS: terms
You see -ROWS: id number because you define the id_attribute_type as number.
if you change the id_attribute_type to use short or long instead, you will get the filename or filename+path of the article. The idea here is that you do not get the whole article just a reference id to the article.
You should get
-COLUMNS: terms (this output may look as the article's words but depends on the operators you add under TextInput. Those operators are a filter to get a better output)
e