The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Aggregating Categorical Values - Music Genre
I'm working with a dataset that has multiple genres per entry. For example, one row might have g1 = rap, g2 = demotrack, g3 = polish trap. None of these genres can be said to be the "primary" genre, so all need to be retained. I am attempting to train the set to predict the genre value, but am having a hard time finding a way to make a single genre column with multiple values per row. Is there a way to do this? Any suggestions are appreciated and I am happy to clarify.
0
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornAs Rodrigo said, you need to transform your data so you have only a single genre column but the same song can appear multiple times. This will allow you to build a single model to predict genre.
To accomplish this in RapidMiner, you need to De-Pivot. See the attached example process which works with your sample data (change the path to your data file in the Read CSV operator first).<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="120"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_csv" compatibility="9.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34"> <parameter key="csv_file" value="C:\Users\brian\Downloads\sample.csv"/> <parameter key="column_separators" value=","/> <parameter key="trim_lines" value="false"/> <parameter key="use_quotes" value="true"/> <parameter key="quotes_character" value="""/> <parameter key="escape_character" value="\"/> <parameter key="skip_comments" value="true"/> <parameter key="comment_characters" value="#"/> <parameter key="starting_row" value="1"/> <parameter key="parse_numbers" value="true"/> <parameter key="decimal_character" value="."/> <parameter key="grouped_digits" value="false"/> <parameter key="grouping_character" value=","/> <parameter key="infinity_representation" value=""/> <parameter key="date_format" value=""/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="encoding" value="windows-1252"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="artist.true.polynominal.attribute"/> <parameter key="1" value="genre_1.true.polynominal.attribute"/> <parameter key="2" value="genre_2.true.polynominal.attribute"/> <parameter key="3" value="genre_3.true.polynominal.attribute"/> <parameter key="4" value="genre_4.true.polynominal.attribute"/> <parameter key="5" value="genre_5.true.polynominal.attribute"/> <parameter key="6" value="genre_6.true.polynominal.attribute"/> <parameter key="7" value="genre_7.true.polynominal.attribute"/> <parameter key="8" value="genre_8.true.polynominal.attribute"/> <parameter key="9" value="genre_9.true.polynominal.attribute"/> <parameter key="10" value="genre_10.true.polynominal.attribute"/> <parameter key="11" value="genre_11.true.polynominal.attribute"/> <parameter key="12" value="genre_12.true.polynominal.attribute"/> <parameter key="13" value="genre_13.true.polynominal.attribute"/> <parameter key="14" value="genre_14.true.polynominal.attribute"/> <parameter key="15" value="genre_15.true.polynominal.attribute"/> <parameter key="16" value="genre_16.true.polynominal.attribute"/> <parameter key="17" value="genre_17.true.polynominal.attribute"/> <parameter key="18" value="genre_18.true.polynominal.attribute"/> <parameter key="19" value="genre_19.true.polynominal.attribute"/> <parameter key="20" value="genre_20.true.polynominal.attribute"/> <parameter key="21" value="genre_21.true.polynominal.attribute"/> <parameter key="22" value="genre_22.true.polynominal.attribute"/> <parameter key="23" value="genre_23.true.polynominal.attribute"/> <parameter key="24" value="genre_24.true.polynominal.attribute"/> <parameter key="25" value="genre_25.true.polynominal.attribute"/> <parameter key="26" value="title.true.polynominal.attribute"/> <parameter key="27" value="energy.true.real.attribute"/> <parameter key="28" value="liveness.true.real.attribute"/> <parameter key="29" value="speechiness.true.real.attribute"/> <parameter key="30" value="valence.true.real.attribute"/> <parameter key="31" value="acousticness.true.real.attribute"/> <parameter key="32" value="instrumentalness.true.real.attribute"/> <parameter key="33" value="danceability.true.real.attribute"/> <parameter key="34" value="time_signature.true.real.attribute"/> <parameter key="35" value="key.true.real.attribute"/> <parameter key="36" value="duration_ms.true.real.attribute"/> <parameter key="37" value="loudness.true.real.attribute"/> <parameter key="38" value="tempo.true.real.attribute"/> <parameter key="39" value="mode.true.real.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="de_pivot" compatibility="9.2.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34"> <list key="attribute_name"> <parameter key="genre" value="genre.+"/> </list> <parameter key="index_attribute" value="index"/> <parameter key="create_nominal_index" value="false"/> <parameter key="keep_missings" value="false"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"> <parameter key="attribute_name" value="genre"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="De-Pivot" to_port="example set input"/> <connect from_op="De-Pivot" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5
Answers
Can you share a sample of your dataset and from this sample give an example of what you want to obtain ?
thanks you
Regards,
Lionel
Be happy to. My goal is to find song attributes that can be used to predict song/artist genre. A single song/artist pair can be described by more than one genre at a time, with none being more "correct" than another. In the attached data sample for instance, Empire of the Sun can be categorized as electropop, indietronic, and new rave simultaneously. This is why they should be listed together in a single field and not as "genre_1", "genre_2", because there is no inherent order here.
I want to train a model to predict the genre of a song using all of an artist's genres as training targets variables. However, if I were to combine all into one "genre" column, the model will treat each combination, however similar, as a different target. For example, the model will treat the artist genre arrays [rock, grunge, nu-metal] and [nu-metal, grunge, indie-rock] as totally distinct responses, despite being virtually identical.
I'm looking for a way that I can train a model using all of a song's genres, but to receive only a single genre as prediction output. So, is there a way to have distinct multiple genres in a single column that won't be treated as a single value?
- Create a list of genres (select attributes and filter duplicates might do the work).
- Use loops to train one or a few algorithms per genre (e. g., one for rock, one for pop, one for jazz...). You could use "Validate" and "Optimize" to get the best results for each. Probably Naïve Bayes sounds good.
Loop over the examples on your testing data and apply all the models to these algorithms. The "Loop Examples" will allow you to get a list of genres a song can be classified for, as a list.