Discussion:
[BlueObelisk-discuss] Is there a set of chemical characters/codepoints for machine-processable chemistry?
Peter Murray-Rust
2012-09-13 10:21:05 UTC
Permalink
Has anyone compiled a set of UTF-8 codepoints (characters) that are
essential for chemistry specifically in Anglophone countries and aimed at
machine processing?

[Note I am not including the glyphs in this mail in case of corruption]

For example chemistry uses over half of the printing characters in the
range 32-127, probably most of the Greek characters (they can be used for
locants), some of the ISO-latin, plus-minus (for racemates) , middot (e.g.
Et2O.BF3) [http://en.wikipedia.org/wiki/Interpunct].

I would exclude personal names (e.g. Hueckel) and units (e.g. Angstrom) as
they are used elsewhere.

Where possible it would be valuable to have a normalized value. Thus IMO
machine-processable chemical names should use '-' (char #45, hyphen-minus
http://en.wikipedia.org/wiki/Hyphen ) rather than true minus, or dashes.
Similarly minus should also use this character.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Craig James
2012-09-13 15:26:18 UTC
Permalink
Post by Peter Murray-Rust
Has anyone compiled a set of UTF-8 codepoints (characters) that are
essential for chemistry specifically in Anglophone countries and aimed at
machine processing?
We've found via the school of hard knocks that only ASCII characters work.
There are too many combinations of browsers, word processors, operating
systems and language settings out there. HTML character codes are OK, but
they're not valid anywhere but in HTML. UTF should work but doesn't. So
we just stick to ASCII. It's not that we don't want a richer character
set, just that they don't work in real life.

As the old joke says, "The great thing about standards is that there are so
many to choose from."

Your need may be different than ours, so my cynicism might not apply.

Craig
Post by Peter Murray-Rust
[Note I am not including the glyphs in this mail in case of corruption]
For example chemistry uses over half of the printing characters in the
range 32-127, probably most of the Greek characters (they can be used for
locants), some of the ISO-latin, plus-minus (for racemates) , middot (e.g.
Et2O.BF3) [http://en.wikipedia.org/wiki/Interpunct].
I would exclude personal names (e.g. Hueckel) and units (e.g. Angstrom) as
they are used elsewhere.
Where possible it would be valuable to have a normalized value. Thus IMO
machine-processable chemical names should use '-' (char #45, hyphen-minus
http://en.wikipedia.org/wiki/Hyphen ) rather than true minus, or dashes.
Similarly minus should also use this character.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Blueobelisk-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
Peter Murray-Rust
2012-09-13 16:38:41 UTC
Permalink
On Thu, Sep 13, 2012 at 4:26 PM, Craig James <***@emolecules.com> wrote:

Thanks - very helpful
Post by Craig James
Post by Peter Murray-Rust
Has anyone compiled a set of UTF-8 codepoints (characters) that are
essential for chemistry specifically in Anglophone countries and aimed at
machine processing?
We've found via the school of hard knocks that only ASCII characters work.
Agreed.
Post by Craig James
There are too many combinations of browsers, word processors, operating
systems and language settings out there. HTML character codes are OK, but
they're not valid anywhere but in HTML. UTF should work but doesn't. So
we just stick to ASCII. It's not that we don't want a richer character
set, just that they don't work in real life.
My current interest is transforming PDFs into chemistry and - not
surprisingly - they are full of garbled encodings. I normalize down to
ASCII wherever possible and use some sort of mnemonic when there isn't one
(e.g. "[beta]" rather than the actual codepoint #946 . I think Daniel does
something similar in OPSIN. And when people use symbols from Word sets it's
even worse.
Post by Craig James
As the old joke says, "The great thing about standards is that there are
so many to choose from."
Your need may be different than ours, so my cynicism might not apply.
I think if I can find out what the character *is* I am able to process
stuff OK. It was really trying to find an abstraction of the essential
symbols which I suspect occur in chemical names and chemical formulae only.
I would not use subscript characters.
Post by Craig James
Craig
Post by Peter Murray-Rust
[Note I am not including the glyphs in this mail in case of corruption]
For example chemistry uses over half of the printing characters in the
range 32-127, probably most of the Greek characters (they can be used for
locants), some of the ISO-latin, plus-minus (for racemates) , middot (e.g.
Et2O.BF3) [http://en.wikipedia.org/wiki/Interpunct].
I would exclude personal names (e.g. Hueckel) and units (e.g. Angstrom)
as they are used elsewhere.
Where possible it would be valuable to have a normalized value. Thus IMO
machine-processable chemical names should use '-' (char #45, hyphen-minus
http://en.wikipedia.org/wiki/Hyphen ) rather than true minus, or dashes.
Similarly minus should also use this character.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Blueobelisk-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Loading...