Peter Murray-Rust
2012-09-13 10:21:05 UTC
Has anyone compiled a set of UTF-8 codepoints (characters) that are
essential for chemistry specifically in Anglophone countries and aimed at
machine processing?
[Note I am not including the glyphs in this mail in case of corruption]
For example chemistry uses over half of the printing characters in the
range 32-127, probably most of the Greek characters (they can be used for
locants), some of the ISO-latin, plus-minus (for racemates) , middot (e.g.
Et2O.BF3) [http://en.wikipedia.org/wiki/Interpunct].
I would exclude personal names (e.g. Hueckel) and units (e.g. Angstrom) as
they are used elsewhere.
Where possible it would be valuable to have a normalized value. Thus IMO
machine-processable chemical names should use '-' (char #45, hyphen-minus
http://en.wikipedia.org/wiki/Hyphen ) rather than true minus, or dashes.
Similarly minus should also use this character.
essential for chemistry specifically in Anglophone countries and aimed at
machine processing?
[Note I am not including the glyphs in this mail in case of corruption]
For example chemistry uses over half of the printing characters in the
range 32-127, probably most of the Greek characters (they can be used for
locants), some of the ISO-latin, plus-minus (for racemates) , middot (e.g.
Et2O.BF3) [http://en.wikipedia.org/wiki/Interpunct].
I would exclude personal names (e.g. Hueckel) and units (e.g. Angstrom) as
they are used elsewhere.
Where possible it would be valuable to have a normalized value. Thus IMO
machine-processable chemical names should use '-' (char #45, hyphen-minus
http://en.wikipedia.org/wiki/Hyphen ) rather than true minus, or dashes.
Similarly minus should also use this character.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069