The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Answers
Anyhow, I'm trying to do some text analysis and I'm reading in HTML pages, lowercasing everything, tokenizing everything, and then filtering out english stop words.
My question is, in the exampleset textinput view of the statistics, what does the statistics column represent? Is it the percent of times a word appears in the total set of words or is it the percent of documents that a word appears in?
Also, what is the range column?
I didn't see the answer in the GUI tutorial.
Any help is appreciated. Thanks,
mj
it's quite simple: These columns are independent from the actual source of data. It simply shows some general statistics as mean and standard deviation of all numerical attributes. If you have loaded your text in TFIDF representation, it shows you the mean and standard deviaiton of the TDIDF values. As does the range, whose name is quite self-explanatory I think...
Greetings,
Sebastian
I understand range mathematically, however what does it mean in the text mining domain? If I have a range of the word "Hello" from 0 to .003 and a mean of .002 (I'm making this up), the discrete nature of the word doesn't fit in the definition of the range in my head.
Forgive my empty head.
Again, thanks.
mj
I see that "html" has a value type of "real", average of 0.088 +/- 0.073, range of 0.003 to 0.0530.
mj
why not? it's just the standard deviation of the values of this attribute. Ignoring if it's the number of occurrences, a tf idf representation or simply a temperature. Where's the problem in calculating a standard deviation from two values?
Greetings,
Sebastian