The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Need help automating large demographic datasets into more manageable variables for SPSS
James_julian
Member Posts: 5 Learner I
in Help
Hi,
Is there a way to shorten the titles of long demographic variables in an excel spreadsheet to make them more concise? I would need to analyze them in SPSS. Would I be able to automate this process or shorten the manual labour that's required? Unfortunately, I'm not able to share the data here.
Thank you
Is there a way to shorten the titles of long demographic variables in an excel spreadsheet to make them more concise? I would need to analyze them in SPSS. Would I be able to automate this process or shorten the manual labour that's required? Unfortunately, I'm not able to share the data here.
Thank you
Tagged:
0
Answers
Do you have a desired way of shortening the names? The simplest option would be to rename with generic names (att1, att2, and so on) but there would be ways to do more complex rules such as taking the first n characters. If you’re interested I can also comment if the subsequent analysis you’d like to do is possible in RapidMiner.
Best,
Roland
Well, the names should capture the basic meaning of the variable but be kept as short as possible. For example the variable "Number of persons in semi-detached households" can be shortened to "persons_semi_HH". I'm very new to using RapidMiner. Could you please elaborate on what you mean with the simple version and steps necessary? I'd appreciate if you could also explain a bit about the complex rules.
Thanks
An example of a simple implementation is attached at the bottom here. It takes the first 12 characters of an attribute name as it renames. There's also error handling in case the attribute name is shorter than 10 characters, or there's subsequently duplicate attribute names (which isn't allowed in RapidMiner).
For a more complex approach which takes into account the meaning of the names (which it sounds like you may need based off your example), I would perhaps think about leveraging the new Generative Models extension. You could extract the variable names as a column, transform the names into something shorter but still meaningful, and then rename based off this transformed list. However, this would perhaps be quite a complex solution and it may be quicker to manually rename them. Could you give an idea as to how many variables you're looking to transform?
Best,
Roland
Thanks for clarifying the approaches you mentioned. Hmm, that's what I feared with the more complex way, as I'm guessing it would include more complex programming as well.
The number of variables depends on each group I'm looking at. Some groups have around 50-100 variables, where others have several hundred.
Thanks
Would you mind elaborating on the complex solution please? I may be able to get some help with the programming it requires.
Thanks!
I’ll look to build a small example. It might take a couple of days if that’s okay?
Best,
Roland
Yes, that is fine thanks.
If you need me to PM you with some abbreviations to consider I can do that if it will help.
Would the following process work for you? You'll need to install the Generative Models extension - see the steps here. Then replace the Titanic dataset with your dataset. Of course, further processing can be done to modify the output once it has been abbreviated. I have used a Summarization model which seemed to give reasonable results, but I was also debating using Text2Text Generation - have a play around here as I may have missed a more suitable model on Huggingface.
Let me know how this works for you. Always happy to explore if you'd like to perform subsequent analysis in RapidMiner.
Best,
Roland