The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to use regular expression to extract chemical formula

LeiLei Member Posts: 12 Learner I
I currently start to use text mining extension to extract chemical formula from PDF files. I use process documents from files operator and tokenize operator with regular expression.
There are many chemical formulas in PDF files. I want to extract them. The chemical formulas are mostly like LiCoMnO4, 0.4Li2Mn0.06Ni0.2O4, K1/3Mn2/3Al2/9, H2(g), .... Is there anyone who can tell me what kind of regular expression can extract them?

Thank you very much.

Best Answer

  • kaymankayman Member Posts: 662 Unicorn
    edited September 2021 Solution Accepted
    One way would be to look for uppercase lowercase combinations inside a word boundary, but not (only) at the beginning. This isn't a combination you see in 'normal' words so it could work. 

    So something like \s[^ ]+[A-Z][a-z].+\s

    You'll probably need to tune the boundaries, as now it just looks for combis devided by spaces. 
Sign In or Register to comment.