The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Automatic Text Signal Finder for Binary Response
I have 2 datasets:
Dataset 1 - this has the response variable and some potential categorical predictors (the response is 1 or 0). Each entity has a unique record (let's call them entities A to Z)
Dataset 2 - this has thousands of records with lots of text for each entity. So each entity could have thousands of rows, each with paragraphs of information
I want to predict the response in Dataset 1 based on the text information in Dataset 2. So here is what I think should happen next:
1) Concatenating the thousands of rows for each entity in Dataset 2 such that the resulting table is one row per entity (with a ton of text information per record).
2) Join Dataset 1 with Dataset 2 based on entity ID
Assuming above is correct so far (please correct if better way as I haven't done this yet), I am wondering if there's a ML algorithm that could find me all the words/phrases/fuzzy combos that are predictive of the response variable in dataset 1. Please advise!
Thanks!
Dataset 1 - this has the response variable and some potential categorical predictors (the response is 1 or 0). Each entity has a unique record (let's call them entities A to Z)
Dataset 2 - this has thousands of records with lots of text for each entity. So each entity could have thousands of rows, each with paragraphs of information
I want to predict the response in Dataset 1 based on the text information in Dataset 2. So here is what I think should happen next:
1) Concatenating the thousands of rows for each entity in Dataset 2 such that the resulting table is one row per entity (with a ton of text information per record).
2) Join Dataset 1 with Dataset 2 based on entity ID
Assuming above is correct so far (please correct if better way as I haven't done this yet), I am wondering if there's a ML algorithm that could find me all the words/phrases/fuzzy combos that are predictive of the response variable in dataset 1. Please advise!
Thanks!
Tagged:
1
Answers
Typical approach would be to use the process documents from data operator, split your sentences in tokens, strip all stop words and create a TF-IDF vectorset. Be sure to prune enough, if you have plenty of data you can set the boundaries pretty big, but experience a bit with it.
This should give you the most meaningful words for your record set, and this reduced content set is what you can then use to setup a predictive model, where your entities will become your label. What model will work the best is depending on some variables, but SVM or a Naieve Bayesian are typically good starting points for this type of challenge.
All a bit dry and technical but there are quite some examples floating around so hopefully it get's you started.
In terms of the split you speak of, does this allow the flexibility of phrases? Also, sometimes some words/phrases are not entered in the same order or spelled consistently; is there a way to find predictors that are approximately the same text/phrase?
Thanks for the response
As for the label, it doesn't make a difference indeed if you have 2 possible options or more, it just changes the models you can use as you go from binary to multi label, but Bayes handles multioptions pretty good also.