The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Select most important words
Hi to everybody,
after a classical process documents where i create a word-vector (TF-IDF) , is possible to select for each document only the terms (attributes) whose sum of the values of tf-idf rapresents the upperf half of the total sum of tf-idf of the same document , or other percentual? Maybe i have to set a weight operator but i don't know which and how. I need it to reduce the number of attributes.
Thank you all!
Tagged:
0
Answers
Hi,
I am not 100% sure what you want to achieve but it sounds like you could potentially use the pruning parameters for this... I would suggest to check them out and give this a try.
Hope this helps,
Ingo
Hi Ingo,
thank you for your answer, the pruning is not good for my use because it can eliminate some words important for a single document.
Here is a screnshot to understand better what i need:
Cumulated is the aggregate of tf-idf value of a single text, i want to select only the hight value terms that rapresent, for example, the 50%, of the cumulated value. I transponse the matrix only to better visualization. So in this way i hope to obtain only important words for a single document. Sometimes
happens that a specific word has an high tf-idf value for a document and a low-value in another, the goal is to maintain only words with a strong weight for every document weighted against the "cumulated" or to set 0 the lower-value words so i can go on with my analysis .
Thank you, I hope someone could help me.
Here XML, sorry i forgot!
Hi,
Got it now, thanks :-)
If the threshold is exactly 50%, the easiest way to achieve this is to use a median aggregation. Here is the concept:
For other percentages than 50%, you will need to come up with a smarter threshold calculation. Or you simply keep the median and apply some "correction" factor, e.g. something like 0.9 * median or 1.1 * median to remove more features or include more.
Below are the links explaining how to work with macros as well a sample process (you will need to adapt the data sources).
Hope this helps,
Ingo
Information on macros:
Example process:
For all -
I have put @IngoRM's solution & data sets in the community repository:
Selecting 'Most Important Words' of a Document Corpus
Scott