The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
How to create new examples by spliiting at punctuation marks?
Hi all!
I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."
What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."
Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.
Any help is very appreciated!
Thanks
Chris
I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."
What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."
Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.
Any help is very appreciated!
Thanks
Chris
0
Answers
you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.
Best,
~Marius
great, that will do it!
Thanks a lot!
Chris