The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Filter Stopwords (Dictionary) -- Unicode support?"

pleonardpleonard Member Posts: 4 Contributor I
edited May 2019 in Help
Hi there, I'm having good luck with the Filter Stopwords (Dictionary) in creating a stoplist for Danish, but am finding that a-ring (å) is not obeyed. I've confirmed the file is in utf-8, as are my source texts, and that the linefeeds are correct. Other stopwords, that do not include non-ascii, are being filtered correctly. Anyone come across this before?

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,

    no I haven't but the number of danish text's I processes is close to zero :) Did you make sure that RapidMiner opens the text file in the UTF-8 encoding?

    Anyway: If you have  good stopword file for danish, would you like to contribute it? We could include it into core...

    Greetings,
      Sebastian
  • pleonardpleonard Member Posts: 4 Contributor I
    OK, I've confirmed this is a bug, I think. Let's move to German because that is a more common language:

    Set these two things:

    1) rapidminer.general.encoding to UTF-8
    2) Process Documents from Files to UTF-8

    Ensure both your text and stoplist are in UTF-8.

    Text: schloß means castle.
    Stoplist: schloß castle

    Result: schloß means

    This is with RapidMiner 5.1.001 on MacOS X 10.6.  Surely there must be people from Germany working with this who have noticed this problem before -- or a trick to get around it?

    Thanks!
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    i have added a parameter for choosing the encoding of the dictionary. This will be made available with the next TextExtension release. But it's uncertain when this will be.

    Greetings,
      Sebastian
  • pleonardpleonard Member Posts: 4 Contributor I
    Thanks! If you have any need of a beta-tester (I work with large Swedish, Danish and Norwegian texts) please let me know and I'd be glad to help out...
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we are currently working on a completely new Text Extension that will go beyond everything the old one was able to do. We will document our progress in our Special Interest Group for Text Mining. If you want to participate, you are very welcome. I just need your email in a PM to put you on the list.

    Greetings,
      Sebastian
Sign In or Register to comment.