Parsing attributes
Hello,
Perhaps this is a simple question with a simple answer.
I am building a predictive model. As input I have several attributes, two of which are actually lists of words. For example, one attribute is called "keywords", and it contains a variable number of key terms.
I'm wondering if this attribute, which is really a list of terms, is being treated as a single text string/blob, rather than being parsed into individual words/tokens. RapidMiner's Auto Model suggests that this attribute is NOT helpful to the predictive modeling process, but I think that is because it is treating this attribute - which is actually a list of terms - as a single text string.
Thus, my questions are:
1) I assume that most/all models will treat quite differently a field such as this if it is treated a single text string vs. a list of individual keywords?
2) I don't know how to parse/tokenize this attribute so that what the model sees is a list of individual keywords rather than a single text string/blob.
Thanks in advance for any assistance or clarification.
- Adam
Answers
Hi @adamf, have you tried text processing? https://community.rapidminer.com/t5/RapidMiner-Text-Analytics-Web/Text-Mining-Use-Cases-and-Capabilities-with-RapidMiner/ta-p/48592
You can leverage the term frequences from tf-idf for the predictive model.
Best,
YY
Hello YY,
I am familiar with the text procesing techniques that are described in your linked PDF file. However, I don't think that fully answers the question.
The text fields/attributes in question add information about each item/row in the example set. For example, one of the text field columns contains a list of "categories" (classification) into which each of the examples in the example set fall. Based on the class label of my training data, it appears to me that many of the examples in the example set labeled as "Fraudulent" (vs "Legitimate") mention "Extreme Graphic/Explicit Language" in the Categories column. However, additional categories may also appear in the example's Categories list, such as "Non-Standard Content". So, the field is a list of one or more categories and may look like this "Extreme Graphic/Explicit Language Non-Standard Content".
Thus, my question is multi-part:
1) My hypothesis is that a predictive model might take advantage of this "Categories" column by, for example, realizing that many examples that have "Extreme Graphic/Explicit Language" mentioned in the Categories column have class label of "Fraudulent".
2) However, since the Categories column is currently a concatenation of one or more categories, I am not sure that the data is parsed and processed as I intended.
3) I am also not sure which (if any) predictive models can take advantage of textual attributes such as my "Categories" attribute.
Regards,
Adam
If the catgories in text column are neat and seperated by some delimiter, you can use "split" to parse them into distributed columns for categories. Otherwise, you can still manually define the binary codes (1/0, true/false) for each seperate category.
After doing some reading/researching, I see that in order to be interpretted by most/all predictive models, I will need to convert/map my textual attributes into numeric values, possibly using either a mapping function (for my Categories attribute) and some other function (word2vec?) for the Keywords column. Please let me know if you have specific suggestions or recommendations.
Thanks,
Adam
Thank you. Your suggestions for the Categories field conversion/mapping is very helpful.
I have one other textual attribute that is called Keywords. It consists of a variable number of keywords (as calculated by an NLTK method). Is there a function (word2vec?) that would be appropriate to convert each keyword list into a "numeric" value, or do I need to separate the list into individual words first and then think about converting each?
- Adam
Word2vec is available from Marketplace. https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_word2vec
But I do not think word2vec is necessary. TF-Idf may be enough for phrase recognition. Just define a list of strings for the target categories, then use it as the wordlist input for process document for Tf-idf.
The key value from the unstructured text data is the term frequencies of keywords/phrases linked with each category.
Hi @yyhuang,
Would you please provide a short RM process/example. I'm still unclear about how TF-IDF helps in this scenario. I've used TF-IDF primarily for identifying important terms across a corpus of documents. I'm also uncertain how to combine/include the output of the TF-IDF operator with other attributes that will be input into the model for training/predicting.
Thanks,
Adam
Hi Adam @adamf,
Please refer to the process here for predicting the category of onsale items with text mining.
My input data has text descriptions of the purchased items (attached is an example input), and also some meta-attributes for the channel, merchant names. Of course you can create a customized wordlist and ust it as the input for text processing (word list input).
Regards,