Andrew Dalke
2012-08-27 10:14:06 UTC
I recently published the "Structure Query Collection", at
https://bitbucket.org/dalke/sqc
This is a collection of different SMILES and SMARTS used as queries against a small molecule database. I include the original data in as raw a form as I can manage, and a processed form which extracts only the SMILES/SMARTS and may include some cleanup of the original data.
I have been struggling with how to define a license for the SQC data, or even if one is needed.
1) BindingDB
The largest data sets in the collection by far come from BindingDB. This contains almost a decade of user-submitted queries from BindingDB. They were extracted from the log files.
As best as I can tell, in the US there would be no legal protection for this data because there's no creative effort in its extraction. The SMILES are too short and non-notable to have individual protection by the submitter, and the US (see Feist) does not recognize database rights. BindingDB is from the US.
Therefore, I do not believe that those dataset are covered under copyright, patent, trademark or other sui generis rights. (Assuming that I understand the phrase 'sui generic' correctly.)
There is a niggling detail as I live in Sweden. However, I don't think my two days of work could be described as a "substantial investment in either the obtaining, verification or presentation of the contents", which is the text of "Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases". Should I include a disclaimer anyway, and if so, is there a suggested style?
2) SMARTS collections from RDKit
This is probably the easiest to deal with. Two SMARTS data sets derive from files found in RDKit, which is distributed under a BSD/MIT-style license. I trivially transformed the data into a simple list of SMARTS.
I don't believe I need to worry about these two data sets at all because I include the RDKit license in the distribution.
However, is the extracted data set even covered under copyright and/or database rights at all?
3) SMARTS from Ehrlich and Rarey's recent J. Cheminformatics paper
Ehrlich and Rarey published a list of 1235 SMARTS in
Systematic benchmark of substructure search in molecular graphs -
From Ullmann to VF2. Hans-Christian Ehrlich and Matthias Rarey.
Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13
These SMARTS were used as their benchmark. I am trying to figure out if these SMARTS are covered under copyright or database rights.
At first thought, one might conclude that the copyright is owned by Ehrlich and Rarey as the paper is published under the Creative Commons Attribution License 2.0. However, the SMARTS themselves were extracted from various other sources. Some of the sources include:
- the Daylight documentation
- works published in the other journals, including:
J Chem Inf Comput Sc
Adv Drug Delivery Rev
J Comput Aided Mol Des
This implies that neither the authors nor the Journal of Cheminformatics regard those original SMARTS as being protected under copyright or database rights. I have no idea if the work they did to assemble the SMARTS patterns means that their 1235 SMARTS are protected in the EU under a database right.
I will presume that they are not.
My conclusion is that the actual SMARTS and SMILES patterns from these different sources are not covered under copyright or database rights, and hence I do not need permission from the source providers in order to distribute the SQC. Likewise, the SQC data itself does not need a statement which grants additional rights to others, because the data is not protected under any law.
Is my understanding correct?
I also believe that people will look at the SQC and expect that there is a license statement of some sort - even though one is not required for the data. What is the proper way to disclaim copyright and (more importantly) database rights? I was thinking:
Andrew Dalke and Andrew Dalke Scientific AB disclaim any additional
copyright, database right, or other legal protections to the SMILES
and SMARTS data sets included in the Structure Query Collection.
Perhaps the Blue Obelisk "Open Data" page could describe what one should do in order to make their datasets open, or to disclaim any legal protections to data sets?
Andrew
***@dalkescientific.com
https://bitbucket.org/dalke/sqc
This is a collection of different SMILES and SMARTS used as queries against a small molecule database. I include the original data in as raw a form as I can manage, and a processed form which extracts only the SMILES/SMARTS and may include some cleanup of the original data.
I have been struggling with how to define a license for the SQC data, or even if one is needed.
1) BindingDB
The largest data sets in the collection by far come from BindingDB. This contains almost a decade of user-submitted queries from BindingDB. They were extracted from the log files.
As best as I can tell, in the US there would be no legal protection for this data because there's no creative effort in its extraction. The SMILES are too short and non-notable to have individual protection by the submitter, and the US (see Feist) does not recognize database rights. BindingDB is from the US.
Therefore, I do not believe that those dataset are covered under copyright, patent, trademark or other sui generis rights. (Assuming that I understand the phrase 'sui generic' correctly.)
There is a niggling detail as I live in Sweden. However, I don't think my two days of work could be described as a "substantial investment in either the obtaining, verification or presentation of the contents", which is the text of "Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases". Should I include a disclaimer anyway, and if so, is there a suggested style?
2) SMARTS collections from RDKit
This is probably the easiest to deal with. Two SMARTS data sets derive from files found in RDKit, which is distributed under a BSD/MIT-style license. I trivially transformed the data into a simple list of SMARTS.
I don't believe I need to worry about these two data sets at all because I include the RDKit license in the distribution.
However, is the extracted data set even covered under copyright and/or database rights at all?
3) SMARTS from Ehrlich and Rarey's recent J. Cheminformatics paper
Ehrlich and Rarey published a list of 1235 SMARTS in
Systematic benchmark of substructure search in molecular graphs -
From Ullmann to VF2. Hans-Christian Ehrlich and Matthias Rarey.
Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13
These SMARTS were used as their benchmark. I am trying to figure out if these SMARTS are covered under copyright or database rights.
At first thought, one might conclude that the copyright is owned by Ehrlich and Rarey as the paper is published under the Creative Commons Attribution License 2.0. However, the SMARTS themselves were extracted from various other sources. Some of the sources include:
- the Daylight documentation
- works published in the other journals, including:
J Chem Inf Comput Sc
Adv Drug Delivery Rev
J Comput Aided Mol Des
This implies that neither the authors nor the Journal of Cheminformatics regard those original SMARTS as being protected under copyright or database rights. I have no idea if the work they did to assemble the SMARTS patterns means that their 1235 SMARTS are protected in the EU under a database right.
I will presume that they are not.
My conclusion is that the actual SMARTS and SMILES patterns from these different sources are not covered under copyright or database rights, and hence I do not need permission from the source providers in order to distribute the SQC. Likewise, the SQC data itself does not need a statement which grants additional rights to others, because the data is not protected under any law.
Is my understanding correct?
I also believe that people will look at the SQC and expect that there is a license statement of some sort - even though one is not required for the data. What is the proper way to disclaim copyright and (more importantly) database rights? I was thinking:
Andrew Dalke and Andrew Dalke Scientific AB disclaim any additional
copyright, database right, or other legal protections to the SMILES
and SMARTS data sets included in the Structure Query Collection.
Perhaps the Blue Obelisk "Open Data" page could describe what one should do in order to make their datasets open, or to disclaim any legal protections to data sets?
Andrew
***@dalkescientific.com