The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Deleting text noise from large corpus
Legacy User
Member Posts: 0 Newbie
Hi
I have a pdf file which contains several thousand pages of emails. The problem is that each email contains a unique set of noise (unique because it does not repeat). For example:
Can anyone point me in the right direction on how to minimize this noise, or somehow go around it?
Thanks.
I have a pdf file which contains several thousand pages of emails. The problem is that each email contains a unique set of noise (unique because it does not repeat). For example:
x-Mail: hbcFNvIWLDtFlpP.yxyP9bkreUY5ZzdUGPpkOhYIoRThis noise sometimes fills entire pages.
Can anyone point me in the right direction on how to minimize this noise, or somehow go around it?
Thanks.
0
Answers
if you use the TF-IDF measure, the noise will be ignored (gets value 0), because it appears in only one document and thus does not bring in any advantage for text classification.
Furthermore, the Process Documents operator has parameters to filter out words that appear too seldom (or too often).
Best,
Marius