The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
preprocessing: remove email signature
Best Answer
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornHi @Joos,
The problem is that messages have two parts: a header with a number of directions and a body containing text. You want to parse the "body".
The first answer can be demonstrated with tje following two "e-mails":<b>From: Rodrigo <rodrigo@example.com> To: Joos <joos@example.org></b> Hello Joos, This is an example message. -- Rodrigo Fuentealba Chile
<b>From: Joos <joos@example.org> To: Rodrigo <rodrigo@example.com></b> Hi Rodrigo, I see. Even though the footers are different, there is something many users do, which is putting a -- before their signature. Not everyone follows this but a big part do. -- Joos Netherlands
Now, finding the last -- wouldn't work on all e-mails, because it is a convention but not a fact that people use -- on their e-mails to separate the signature from the rest. Let's think of another solution Let's say you have 1000 e-mails from me. If 300 of these e-mails end up with the classic "Sent from my iPhone" as the last line, you can identify that parameter and delete the e-mail. But what about all the e-mails that I sent with my own signature? You may be able to identify that 600 e-mails from rodrigo@example.com always have the "Rodrigo Fuentealba / Chile" signature, thus it can be removed.
Answering your other questions:- Yes, you can use Python code inside RapidMiner with the Python Scripting extension. However, the mail parser extension probably won't help you, this is a natural language processing (or pattern recognition) issue.
- Yes, you can do the pattern recognition in Dutch. I don't speak it, but have done similar stuff in German.
Rodrigo.6
Answers
I can only recommend two ways. The first one is to remove everything from the last -- signs together to the end. Or, if you have the recipient of the e-mail, trim the message and check the last line on each e-mail until no last lines are the same.
Both aren't battle tested, and involve some processing that I wouldn't have done with RapidMiner but much earlier, while retrieving the e-mails, so you are better of trying your luck with loading your data with Python to remove the e-mail signatures, I'm afraid.
All the best,
Rodrigo.