The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extracting date from textfiles
Hi everybody,
my name is Timo and I would be glad if you could please help me with my problem:
I have a lot of textfiles, especially press releases from different firms, and I would like to extract the date out of these press releases.
The problem is, that there is no standard format for the date, i.e. sometimes it's "14.08.2008" and sometimes "04 November 05" or "14 November 2005".
I know how to tokenize, generate n-grams,... and so on, but I don't know how I can extract the date Information from these files.
My idea was to work with the "generate n-grams" operator, but I don't know which Regex I have to insert.
Maybe you could help me
Thank you very much!
Timo
my name is Timo and I would be glad if you could please help me with my problem:
I have a lot of textfiles, especially press releases from different firms, and I would like to extract the date out of these press releases.
The problem is, that there is no standard format for the date, i.e. sometimes it's "14.08.2008" and sometimes "04 November 05" or "14 November 2005".
I know how to tokenize, generate n-grams,... and so on, but I don't know how I can extract the date Information from these files.
My idea was to work with the "generate n-grams" operator, but I don't know which Regex I have to insert.
Maybe you could help me
Thank you very much!
Timo
0
Answers
it is very hard to work with different timestamp-standards. I guess you need to go the complex way and filter out the dates via different Regex. and then Loop with Generate attribute and parse them.
Someting like [0-9][0-9]\.[0-9][0-9]\.[0-9]+ for the first one or some thing. Maybe Keep Documents part is the easiest operator to do this..
Cheers,
Martin
Dortmund, Germany
Here's a very quick example of a couple of RegEx ways to extract the dates & format them.
It uses Cut Document & Select Subprocess to allow you to add more date formats as you write the RegEx expressions. In this example it only selects the first date it finds in the document (as with a press release that's likely to be at the top).
I got a new building block! Thanks!
Dortmund, Germany
thank you very, very much for your help!
JEdward, your process is awesome, i couldn't have done this by myself
It works perfect!
Timo