Discussion:
[BlueObelisk-discuss] legal rights concerning SMILES and SMARTS collections
Andrew Dalke
2012-08-27 10:14:06 UTC
Permalink
I recently published the "Structure Query Collection", at
https://bitbucket.org/dalke/sqc

This is a collection of different SMILES and SMARTS used as queries against a small molecule database. I include the original data in as raw a form as I can manage, and a processed form which extracts only the SMILES/SMARTS and may include some cleanup of the original data.

I have been struggling with how to define a license for the SQC data, or even if one is needed.

1) BindingDB

The largest data sets in the collection by far come from BindingDB. This contains almost a decade of user-submitted queries from BindingDB. They were extracted from the log files.

As best as I can tell, in the US there would be no legal protection for this data because there's no creative effort in its extraction. The SMILES are too short and non-notable to have individual protection by the submitter, and the US (see Feist) does not recognize database rights. BindingDB is from the US.

Therefore, I do not believe that those dataset are covered under copyright, patent, trademark or other sui generis rights. (Assuming that I understand the phrase 'sui generic' correctly.)

There is a niggling detail as I live in Sweden. However, I don't think my two days of work could be described as a "substantial investment in either the obtaining, verification or presentation of the contents", which is the text of "Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases". Should I include a disclaimer anyway, and if so, is there a suggested style?


2) SMARTS collections from RDKit

This is probably the easiest to deal with. Two SMARTS data sets derive from files found in RDKit, which is distributed under a BSD/MIT-style license. I trivially transformed the data into a simple list of SMARTS.

I don't believe I need to worry about these two data sets at all because I include the RDKit license in the distribution.

However, is the extracted data set even covered under copyright and/or database rights at all?


3) SMARTS from Ehrlich and Rarey's recent J. Cheminformatics paper

Ehrlich and Rarey published a list of 1235 SMARTS in

Systematic benchmark of substructure search in molecular graphs -
From Ullmann to VF2. Hans-Christian Ehrlich and Matthias Rarey.
Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13

These SMARTS were used as their benchmark. I am trying to figure out if these SMARTS are covered under copyright or database rights.

At first thought, one might conclude that the copyright is owned by Ehrlich and Rarey as the paper is published under the Creative Commons Attribution License 2.0. However, the SMARTS themselves were extracted from various other sources. Some of the sources include:

- the Daylight documentation
- works published in the other journals, including:
J Chem Inf Comput Sc
Adv Drug Delivery Rev
J Comput Aided Mol Des

This implies that neither the authors nor the Journal of Cheminformatics regard those original SMARTS as being protected under copyright or database rights. I have no idea if the work they did to assemble the SMARTS patterns means that their 1235 SMARTS are protected in the EU under a database right.

I will presume that they are not.


My conclusion is that the actual SMARTS and SMILES patterns from these different sources are not covered under copyright or database rights, and hence I do not need permission from the source providers in order to distribute the SQC. Likewise, the SQC data itself does not need a statement which grants additional rights to others, because the data is not protected under any law.


Is my understanding correct?

I also believe that people will look at the SQC and expect that there is a license statement of some sort - even though one is not required for the data. What is the proper way to disclaim copyright and (more importantly) database rights? I was thinking:

Andrew Dalke and Andrew Dalke Scientific AB disclaim any additional
copyright, database right, or other legal protections to the SMILES
and SMARTS data sets included in the Structure Query Collection.


Perhaps the Blue Obelisk "Open Data" page could describe what one should do in order to make their datasets open, or to disclaim any legal protections to data sets?


Andrew
***@dalkescientific.com
Craig James
2012-08-27 15:37:12 UTC
Permalink
Post by Andrew Dalke
I recently published the "Structure Query Collection", at
https://bitbucket.org/dalke/sqc
This is a collection of different SMILES and SMARTS used as queries against a small molecule database. I include the original data in as raw a form as I can manage, and a processed form which extracts only the SMILES/SMARTS and may include some cleanup of the original data.
I have been struggling with how to define a license for the SQC data, or even if one is needed.
1) BindingDB
The largest data sets in the collection by far come from BindingDB. This contains almost a decade of user-submitted queries from BindingDB. They were extracted from the log files.
As best as I can tell, in the US there would be no legal protection for this data because there's no creative effort in its extraction. The SMILES are too short and non-notable to have individual protection by the submitter, and the US (see Feist) does not recognize database rights. BindingDB is from the US.
Therefore, I do not believe that those dataset are covered under copyright, patent, trademark or other sui generis rights. (Assuming that I understand the phrase 'sui generic' correctly.)
I'm not a lawyer, but ... There is a difference between the BindDB
data and what the users enter. The terms under which the data are
licensed have nothing to do with who owns the user-entered queries. I
looked around the BindDB web site and couldn't find a privacy policy
anywhere. If I were a user, I'd assume that what I entered was
private unless the site's privacy policy explicitely said otherwise.
There may be a legal precedent somewhere: if a web site has no policy,
does everything the user types (or draws) automatically become public
domain? I doubt it.

The very fact that someone entered a particular structure can be
highly revealing, even if you don't know who submitted the structure.

Before you release this data (if you haven't already), you might want
to ensure that the users of BindDB understood that their queries might
some day become public. At eMolecules, we have an absolute policy
that no query will ever be revealed. Without it, we would be blocked
at every major pharma and biotech company in the world. This isn't
speculation ... they've told us so (and in several cases, actually
blocked us until their legal department was assured of our policy and
reputation).

Craig
Andrew Dalke
2012-08-27 17:36:24 UTC
Permalink
Post by Craig James
I'm not a lawyer, but ... There is a difference between the BindDB
data and what the users enter. The terms under which the data are
licensed have nothing to do with who owns the user-entered queries. I
looked around the BindDB web site and couldn't find a privacy policy
anywhere.
The hardest part about getting that data set was to find a place which
1) had the data and 2) which was completely willing to release it.

Michael Gilson specifically said it was no problem, that there was no
assurance of privacy on the site, and that they had no concerns in
releasing the data to me for this project.
Post by Craig James
If I were a user, I'd assume that what I entered was
private unless the site's privacy policy explicitely said otherwise.
Why would you assume that? Every company I've worked for or consulted
for has specifically said that internal structures are never to be
sent out of the organization, excepting where certain agreements, which
spell out what can be done with the data, are in place. This was true
even when I was doing bioinformatics work in 1998, so it's nothing new.

There are very few limits on what a US organization can do with your
data. Privacy policies exist because the organization is voluntarily
limiting what they will do with your information, in exchange for
more trust, information, or money from you.

The limits I know of apply to personal information. SMILES strings
are not personal, nor covered under copyright (that I can tell), so I
know of no legal restriction to prevent BindingDB from doing what they did.

There are of course non-legal reasons. eMolecules, as a search provider,
would not want to do this because you all know that you might engender
bad trust from your clients. After all, people will violate the corporate
guidelines against revealing internal data on public sites. So even if
legal, I can see why you would not want to do this.

But BindingDB is supported by NIH grant R01GM070064, and not financially
by the users of the site. Hence I believe they are more buffered from
the handful of people who might protest. Of course, then they would need
to reveal to their in-house people that they broke the policy...


There are two well-known examples of publicly released query sets
which ended up with troubles. Both were problematical because the
anonymized data could still be de-anonymized to reveal personal
information. These are:

http://en.wikipedia.org/wiki/Netflix_Prize#Privacy_concerns
http://en.wikipedia.org/wiki/AOL_search_data_leak

Neither are relevant to the BindingDB data. I don't even know
the year when the queries were done, much less the source IP
address. (Although the data looks to be time ordered, so there
might be some hint of time information.)
Post by Craig James
There may be a legal precedent somewhere: if a web site has no policy,
does everything the user types (or draws) automatically become public
domain? I doubt it.
You are confusing privacy with copyright. If I sketch a structure
in Marvin, which Marvin converts to a SMILES string, then do I
own the copyright to that SMILES string?

No, I don't believe it does. Copyright doesn't cover that case.
Just like copyright doesn't cover trademarks, or personal names.

Do you think that the SMILES strings "exhibit the minimal
creativity required for copyright protection"?

They don't contain medical information restricted under HIPPA.

They don't disclose video tape rental or sale records or the like,
so aren't covered under the Video Privacy Protection Act.

And so on. Nothing I know of makes this private or restricted
information.


Oh, and here's another example. I published information about
some of the search queries people used to get to my web site:
http://www.dalkescientific.com/writings/diary/archive/2007/12/23/navel_gazing.html

I am not the only one who does this. Surely these queries are
not covered under copyright or privacy protection. I have no
privacy statement on my web site. Do you believe the contents
of the "referer" line, which your browser by default sends to
each and every server, are required under law to be treated as
private by the people who run the server?
Post by Craig James
The very fact that someone entered a particular structure can be
highly revealing, even if you don't know who submitted the structure.
Yes. Which is why you're not supposed to do that to untrusted sites.
And by default, everyone is untrusted.
Post by Craig James
Before you release this data (if you haven't already), you might want
to ensure that the users of BindDB understood that their queries might
some day become public. At eMolecules, we have an absolute policy
that no query will ever be revealed. Without it, we would be blocked
at every major pharma and biotech company in the world. This isn't
speculation ... they've told us so (and in several cases, actually
blocked us until their legal department was assured of our policy and
reputation).
Yes, but these are very different circumstances. You want your pharma
clients to come to your site and pay you money. BindingDB wants to
collect and distribute binding data, and make more data publicly available.

In any case, if I read you correctly, since BindingDB doesn't have
an established policy, shouldn't the major pharmas and biotechs
already be blocking access to their site? So what's the problem? Who
is going to get mad? What are the possible consequences to me or to
BindingDB? What are the advantages to either of us for retracting
those data sets?

Cheers,


Andrew
***@dalkescientific.com
Peter Murray-Rust
2012-08-27 17:24:23 UTC
Permalink
larifies
Post by Andrew Dalke
I recently published the "Structure Query Collection", at
https://bitbucket.org/dalke/sqc
This is a collection of different SMILES and SMARTS used as queries
against a small molecule database. I include the original data in as raw a
form as I can manage, and a processed form which extracts only the
SMILES/SMARTS and may include some cleanup of the original data.
I have been struggling with how to define a license for the SQC data, or
even if one is needed.
We struggled for this for 2 years on the Panton Principles (
pantonprinciples.org) and believe that a licence is highly desirable as it
clarifies the position.
Post by Andrew Dalke
Therefore, I do not believe that those dataset are covered under
copyright, patent, trademark or other sui generis rights. (Assuming that I
understand the phrase 'sui generic' correctly.)
You may be right but I suspect that some people could claim "sui generis"
and that we would only find out in court if they had a case.
Post by Andrew Dalke
At first thought, one might conclude that the copyright is owned by
Ehrlich and Rarey as the paper is published under the Creative Commons
Attribution License 2.0. However, the SMARTS themselves were extracted from
- the Daylight documentation
J Chem Inf Comput Sc
Adv Drug Delivery Rev
J Comput Aided Mol Des
This implies that neither the authors nor the Journal of Cheminformatics
regard those original SMARTS as being protected under copyright or database
rights. I have no idea if the work they did to assemble the SMARTS patterns
means that their 1235 SMARTS are protected in the EU under a database right.
I will presume that they are not.
This looks like attribution cascading and there are different views on
this. It is probably a question of estimating risk as I doubt it is
possible to unravel all the history exactly.


My conclusion is that the actual SMARTS and SMILES patterns from these
Post by Andrew Dalke
different sources are not covered under copyright or database rights, and
hence I do not need permission from the source providers in order to
distribute the SQC. Likewise, the SQC data itself does not need a statement
which grants additional rights to others, because the data is not protected
under any law.
Is my understanding correct?
I also believe that people will look at the SQC and expect that there is a
license statement of some sort - even though one is not required for the
data. What is the proper way to disclaim copyright and (more importantly)
Andrew Dalke and Andrew Dalke Scientific AB disclaim any additional
copyright, database right, or other legal protections to the SMILES
and SMARTS data sets included in the Structure Query Collection.
I would recommend either CC0 or PDDL (
http://opendatacommons.org/licenses/pddl/) which was specifically developed
for this purpose. If you are still unsure I suggest you post this to OKFN
open-discuss and I'd expect that you get useful answers.
Post by Andrew Dalke
Perhaps the Blue Obelisk "Open Data" page could describe what one should
do in order to make their datasets open, or to disclaim any legal
protections to data sets?
It's difficult to give concrete generic advice - the BO does not have
legal expertise.

P.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Andrew Dalke
2012-08-27 18:52:45 UTC
Permalink
We struggled for this for 2 years on the Panton Principles (pantonprinciples.org) and believe that a licence is highly desirable as it clarifies the position.
In the first few days of summer I run errands outside and enjoy the warmth. But when I get up to leave, I look around with consternation before I remember that I didn't bring or need a jacket.

I agree. Even if a license isn't needed, people will still be confused about its lack.
You may be right but I suspect that some people could claim "sui generis" and that we would only find out in court if they had a case.
Luckily for me, the biggest and most important of the data sets is in the US, which doesn't recognize database rights.
This looks like attribution cascading and there are different views on this. It is probably a question of estimating risk as I doubt it is possible to unravel all the history exactly.
Yes, this is a cascade. Though I thought of it in different terms; the "obnoxious BSD advertising clause" and its "escalating advertising requirements."
I would recommend either CC0 or PDDL (http://opendatacommons.org/licenses/pddl/) which was specifically developed for this purpose. If you are still unsure I suggest you post this to OKFN open-discuss and I'd expect that you get useful answers.
The CC0 page is exactly what I want. Thank you. I found it easier to understand than the PDDL.
Perhaps the Blue Obelisk "Open Data" page could describe what one should do in order to make their datasets open, or to disclaim any legal protections to data sets?
It's difficult to give concrete generic advice - the BO does not have legal expertise.
There is a wide gap between providing legal advice and the current page. The only license I can find from the available links (following "good example in chemoinformatics is the NMRShiftDB") suggests using the GNU Free Documentation License.


Andrew
***@dalkescientific.com
Stefan Kuhn
2012-08-27 20:16:57 UTC
Permalink
Post by Andrew Dalke
Post by Andrew Dalke
You may be right but I suspect that some people could claim "sui generis"
and that we would only find out in court if they had a case.
Luckily for me, the biggest and most important of the data sets is in the
US, which doesn't recognize database rights.
I am surprised to hear this. Database works are mentioned in Article 5 of the
WIPO copyright treaty and the WIPO copyright treaty has been incorporated
into US law by the WIPO Copyright and Performances and Phonograms Treaties
Implementation Act (part of the infamous dmca). Now I am neither a lawyer nor
an exprt on us or common law, but from this, I reckoned that databases are
protected in the US. Can you give me some reason why you think it is not like
this? You may well be right, I am interested in learning.
Btw, for my understanding the compilation of all queries entered into on a
website never forms a database in the sense of the wipo copyright treaty,
because it has not been created by somebody deliberately, but by accident
without a creative intention. So I think such a collection is not protected
anywhere, for my understanding.
Last thing: If there is no declaration on the BindingDB site that the query
data are in fact public, I would be very carefull to publish the data. Apart
from the legal side (where the absence of some sort of license does not mean
that anything goes, but the contrary, even if many people don't understand
this), I would consider it a gross misuse of the trust users put into a
(scientific) website.
Stefan
David García Aristegui
2012-08-27 19:42:22 UTC
Permalink
Maybe the confusion is about the Non-Original Databases?
http://www.wipo.int/copyright/en/activities/databases.html
Andrew Dalke
2012-08-27 23:52:07 UTC
Permalink
Post by Stefan Kuhn
Now I am neither a lawyer nor
an exprt on us or common law, but from this, I reckoned that databases are
protected in the US. Can you give me some reason why you think it is not like
this? You may well be right, I am interested in learning.
http://en.wikipedia.org/wiki/Feist_v._Rural
...an important United States Supreme Court case establishing that
information alone without a minimum of original creativity cannot be
protected by copyright


Quoting from the actual decision:
As applied to a factual compilation, assuming the absence of
original written expression, only the compiler's selection
and arrangement may be protected; the raw facts may be copied
at will. This result is neither unfair nor unfortunate. It
is the means by which copyright advances the progress of
science and art.


Going back to Wikipedia:
The standard for such originality is fairly low; for example,
business listings have been found to meet this standard when
deciding which companies should be listed and categorizing
those companies required some kind of expert judgment.
Post by Stefan Kuhn
Btw, for my understanding the compilation of all queries entered into on a
website never forms a database in the sense of the wipo copyright treaty,
because it has not been created by somebody deliberately, but by accident
without a creative intention. So I think such a collection is not protected
anywhere, for my understanding.
Thank you.
Post by Stefan Kuhn
Last thing: If there is no declaration on the BindingDB site that the query
data are in fact public, I would be very carefull to publish the data. Apart
from the legal side (where the absence of some sort of license does not mean
that anything goes, but the contrary, even if many people don't understand
this), I would consider it a gross misuse of the trust users put into a
(scientific) website.
When is it appropriate to collect and publish user-submitted data?

Assume for now that this is non-copyrightable data, so the question
is only one of privacy.


Every guideline I know of says that the main issue is personal privacy.
For example, the EU Data Protection Directive "regulates the processing
of personal data regardless of whether such processing is automated or
not" and the US Fair Information Practice Principles says "Consumers
should be given notice of an entity's information practices before
any personal information is collected from them. Without notice, a
consumer cannot make an informed decision as to whether and to what
extent to disclose personal information."

But anonymized structure queries aren't personal information, so
those guidelines don't really apply.... or do you think/should it
be otherwise?


Who publishes user search terms?


We know that Google publishes their top trending searches:
http://www.google.com/trends/hottrends

Alexa publishes their top reasons for people going to (say) Chemspider:
http://www.alexa.com/siteinfo/chemspider.com
1 chemspider 6.68%
2 search spider database programs 3.15%
3 acetic acid 1.14%
4 water marbles 0.96%
5 5800 0.95%
6 vinylidine difluoride 0.88%
7 c5h5o6 name 0.75%
8 benzyloxy structure 0.69%
9 bfj 12 0.49%
10 h3aso3

I pointed out the NetFlix data set, and the AOL one. There the
problem was lack of full de-anonymization, but academic research
based on large-scale user-submitted queries is nothing new. Here's
one report from 2001 based on Excite data.
http://comminfo.rutgers.edu/~tefko/JASIST2001.pdf

A MEDLINEplus analysis for 2002-2003
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839623/

"the TRIP database—a meta-search engine covering 150 health
resources including MEDLINE, The Cochrane Library, and a
variety of guidelines"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852632/

I point these out because they all report a small number of highly
reduced user-submitted queries. In all cases, these papers are
used to improve the general understanding of how people do
real-world queries. That's sorely lacking in cheminformatics,
but a need should not drive one to break ethical considerations.

Since those are acceptable (even as with the PDB when the privacy
statement explicitly says "We do not share server log information
with third parties for marketing or other purposes.") then at
what point does something go from acceptable to a "gross
misuse of the trust users put into a (scientific) website"?

MEDLINEplus, BTW, says:
This information is used to measure the number of visitors
to the various sections of our site and improve organization,
coverage, system performance or problem areas. This information
is not used for associating search terms or patterns of site
navigation with individual users. When search features offer
suggested terms, these suggestions are based on aggregated data
only. NLM periodically deletes its Web logs. On occasion, NLM
may provide aggregated information to third party entities it
contracts with for the purposes of research analysis. Aggregated
data cannot be linked back to an individual user.

What does "aggregated information" mean? Why can a third-party
get access to the data but not the public? Is that access
acceptable so long as it's only for research analysis?


Okay, so perhaps the problem the lack of a privacy statement?
Looking at the list of resources from
http://pipeline.corante.com/archives/2012/08/02/public_domain_databases_in_medicinal_chemistry.php#comments

BindingDB - no privacy statement
ChEMBL - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
PubChem - by law they can't reveal anything
Binding MOAD - no privacy statement
ChemSpider - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
DrugBank - no privacy policy
GRAC and IUPHAR-DB - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
PDBbind - no privacy policy
PDSP Ki - no privacy policy
Supertarget - no privacy policy
Therapeutic Targets Database - no privacy policy
Zinc - "Thus our bias is to be open", "Limited Privacy Option for DOCK Blaster jobs",
"Results of ZINC upload and subset requests and DOCK Blaster protonate requests
remain on our server for seven days and cannot be PIN protected",
"Any way around these restrictions? / You may request a private copy of DOCK,
DOCK Blaster, and ZINC, and run them on your own servers. This is subject to
licensing from the Regents of the University of California, and may not be free."

I found no evidence that these chemistry databases have said that
they will keep your anonymous search data private. Not only that,
but ZINC has specifically said that they will make your data public
excepting that you can delete some of your search results before
the public gets access to it!

Who knows, perhaps had BindingDB a privacy policy it would be
more like ZINC's.



Still, let's see if there's something which would make you [Stefan]
and others like Craig happier. What would make for a reasonable
compromise data set?

- Is the problem that the submitted SMILES in the BindingDB data
sets may contain proprietary structures? In that case, would
removing all structures which aren't also in PubChem be acceptable?
That's a bit harsh because someone tuning their search engine also
needs an appropriate number of negatives.

- Is there a really a wide-spread problem of people submitting
proprietary compound information to public servers, with the
expectation that it will be private, or is mostly an abstract worry?

- Is the problem that the data set is too fine-grained, which means
that someone very clever may detect patterns in what someone else
is doing? Would releasing a randomly selected sample of 1% of the
structures be acceptable? Is there some way to dirty up the structure
to make it be more acceptable? (Rather like what the US Census does
to improve confidentiality protection in their reports.)

- Suppose I was the only person who had access to the BindingDB data
set. I wrote a report about the highly-aggregated results of my
study, but don't release the data. Would that not be contrary to
the principles of "Open Data"?

- What is an acceptable level of complaints? No matter what the
final policy is, someone may say that it's unacceptable. Would
one complaint out of 1,000 users be acceptable? 1 out of 100?
What level of complaint would there be in releasing the full,
unfiltered and anonymous data set?

- Who will make up the institutional review board which decides
if a given report is sufficiently aggregated so as to no longer
be a gross misuse of trust? Do they have the right experience
to judge, guided by past incidents, what is and is not appropriate?



Let's go beyond that. The consumer guidelines are well described at:

http://www.microsoft.com/security/online-privacy/prevent.aspx
Privacy policies should clearly explain what data the website
gathers about you, how it is used, shared, and secured, and
how you can edit or delete it. (For example, look at the
bottom of this and every page on Microsoft.com.) No privacy
statement? Take your business elsewhere.

and

• Do not post anything online that you would not want made public.

Surely the research guidelines should be no less strict than
someone looking for cat pictures.


I keep ending up with the conclusion that there is no specific
expectation that submissions sent to an arbitrary web site,
suitably anonymized and untraceable to the originating person
or organization, must be strictly private, and I find no
guidelines which suggest an appropriate intermediate level
of data privacy. And when I do find examples of large data set
releases, the only issues have been the lack of full anonymity;
which is not the problem with this data set.

Why then is the release of this data set a gross misuse of trust?

Cheers,

Andrew
***@dalkescientific.com
John P. Overington
2012-08-28 07:46:03 UTC
Permalink
Hi,

I think there are some quite big issues with releasing query sets without explicit permission (informed consent) from the users. I think most users do not think that their structures will be disclosed when they use an on-line resource, especially if this is not made crystal clear that future exposure of these queries is likely, or allowed.

If the queries are published, it could present a major challenge to subsequent filing of patents of composition of matter, for novel compounds.

jpo

--
John P. Overington, PhD FRSC C.Chem.

Computational Chemical Biology
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton, Cambs. CB10 1SD, United Kingdom
--
mail: ***@ebi.ac.uk
office phone: +44-1223-492666
admin: ***@ebi.ac.uk
admin phone: +44-1223-494574
fax: +44-1223-494468
twitter: @chembl
skype: john.overington
Post by Andrew Dalke
Post by Stefan Kuhn
Now I am neither a lawyer nor
an exprt on us or common law, but from this, I reckoned that databases are
protected in the US. Can you give me some reason why you think it is not like
this? You may well be right, I am interested in learning.
http://en.wikipedia.org/wiki/Feist_v._Rural
...an important United States Supreme Court case establishing that
information alone without a minimum of original creativity cannot be
protected by copyright
As applied to a factual compilation, assuming the absence of
original written expression, only the compiler's selection
and arrangement may be protected; the raw facts may be copied
at will. This result is neither unfair nor unfortunate. It
is the means by which copyright advances the progress of
science and art.
The standard for such originality is fairly low; for example,
business listings have been found to meet this standard when
deciding which companies should be listed and categorizing
those companies required some kind of expert judgment.
Post by Stefan Kuhn
Btw, for my understanding the compilation of all queries entered into on a
website never forms a database in the sense of the wipo copyright treaty,
because it has not been created by somebody deliberately, but by accident
without a creative intention. So I think such a collection is not protected
anywhere, for my understanding.
Thank you.
Post by Stefan Kuhn
Last thing: If there is no declaration on the BindingDB site that the query
data are in fact public, I would be very carefull to publish the data. Apart
from the legal side (where the absence of some sort of license does not mean
that anything goes, but the contrary, even if many people don't understand
this), I would consider it a gross misuse of the trust users put into a
(scientific) website.
When is it appropriate to collect and publish user-submitted data?
Assume for now that this is non-copyrightable data, so the question
is only one of privacy.
Every guideline I know of says that the main issue is personal privacy.
For example, the EU Data Protection Directive "regulates the processing
of personal data regardless of whether such processing is automated or
not" and the US Fair Information Practice Principles says "Consumers
should be given notice of an entity's information practices before
any personal information is collected from them. Without notice, a
consumer cannot make an informed decision as to whether and to what
extent to disclose personal information."
But anonymized structure queries aren't personal information, so
those guidelines don't really apply.... or do you think/should it
be otherwise?
Who publishes user search terms?
http://www.google.com/trends/hottrends
http://www.alexa.com/siteinfo/chemspider.com
1 chemspider 6.68%
2 search spider database programs 3.15%
3 acetic acid 1.14%
4 water marbles 0.96%
5 5800 0.95%
6 vinylidine difluoride 0.88%
7 c5h5o6 name 0.75%
8 benzyloxy structure 0.69%
9 bfj 12 0.49%
10 h3aso3
I pointed out the NetFlix data set, and the AOL one. There the
problem was lack of full de-anonymization, but academic research
based on large-scale user-submitted queries is nothing new. Here's
one report from 2001 based on Excite data.
http://comminfo.rutgers.edu/~tefko/JASIST2001.pdf
A MEDLINEplus analysis for 2002-2003
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839623/
"the TRIP database—a meta-search engine covering 150 health
resources including MEDLINE, The Cochrane Library, and a
variety of guidelines"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852632/
I point these out because they all report a small number of highly
reduced user-submitted queries. In all cases, these papers are
used to improve the general understanding of how people do
real-world queries. That's sorely lacking in cheminformatics,
but a need should not drive one to break ethical considerations.
Since those are acceptable (even as with the PDB when the privacy
statement explicitly says "We do not share server log information
with third parties for marketing or other purposes.") then at
what point does something go from acceptable to a "gross
misuse of the trust users put into a (scientific) website"?
This information is used to measure the number of visitors
to the various sections of our site and improve organization,
coverage, system performance or problem areas. This information
is not used for associating search terms or patterns of site
navigation with individual users. When search features offer
suggested terms, these suggestions are based on aggregated data
only. NLM periodically deletes its Web logs. On occasion, NLM
may provide aggregated information to third party entities it
contracts with for the purposes of research analysis. Aggregated
data cannot be linked back to an individual user.
What does "aggregated information" mean? Why can a third-party
get access to the data but not the public? Is that access
acceptable so long as it's only for research analysis?
Okay, so perhaps the problem the lack of a privacy statement?
Looking at the list of resources from
http://pipeline.corante.com/archives/2012/08/02/public_domain_databases_in_medicinal_chemistry.php#comments
BindingDB - no privacy statement
ChEMBL - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
PubChem - by law they can't reveal anything
Binding MOAD - no privacy statement
ChemSpider - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
DrugBank - no privacy policy
GRAC and IUPHAR-DB - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
PDBbind - no privacy policy
PDSP Ki - no privacy policy
Supertarget - no privacy policy
Therapeutic Targets Database - no privacy policy
Zinc - "Thus our bias is to be open", "Limited Privacy Option for DOCK Blaster jobs",
"Results of ZINC upload and subset requests and DOCK Blaster protonate requests
remain on our server for seven days and cannot be PIN protected",
"Any way around these restrictions? / You may request a private copy of DOCK,
DOCK Blaster, and ZINC, and run them on your own servers. This is subject to
licensing from the Regents of the University of California, and may not be free."
I found no evidence that these chemistry databases have said that
they will keep your anonymous search data private. Not only that,
but ZINC has specifically said that they will make your data public
excepting that you can delete some of your search results before
the public gets access to it!
Who knows, perhaps had BindingDB a privacy policy it would be
more like ZINC's.
Still, let's see if there's something which would make you [Stefan]
and others like Craig happier. What would make for a reasonable
compromise data set?
- Is the problem that the submitted SMILES in the BindingDB data
sets may contain proprietary structures? In that case, would
removing all structures which aren't also in PubChem be acceptable?
That's a bit harsh because someone tuning their search engine also
needs an appropriate number of negatives.
- Is there a really a wide-spread problem of people submitting
proprietary compound information to public servers, with the
expectation that it will be private, or is mostly an abstract worry?
- Is the problem that the data set is too fine-grained, which means
that someone very clever may detect patterns in what someone else
is doing? Would releasing a randomly selected sample of 1% of the
structures be acceptable? Is there some way to dirty up the structure
to make it be more acceptable? (Rather like what the US Census does
to improve confidentiality protection in their reports.)
- Suppose I was the only person who had access to the BindingDB data
set. I wrote a report about the highly-aggregated results of my
study, but don't release the data. Would that not be contrary to
the principles of "Open Data"?
- What is an acceptable level of complaints? No matter what the
final policy is, someone may say that it's unacceptable. Would
one complaint out of 1,000 users be acceptable? 1 out of 100?
What level of complaint would there be in releasing the full,
unfiltered and anonymous data set?
- Who will make up the institutional review board which decides
if a given report is sufficiently aggregated so as to no longer
be a gross misuse of trust? Do they have the right experience
to judge, guided by past incidents, what is and is not appropriate?
http://www.microsoft.com/security/online-privacy/prevent.aspx
Privacy policies should clearly explain what data the website
gathers about you, how it is used, shared, and secured, and
how you can edit or delete it. (For example, look at the
bottom of this and every page on Microsoft.com.) No privacy
statement? Take your business elsewhere.
and
• Do not post anything online that you would not want made public.
Surely the research guidelines should be no less strict than
someone looking for cat pictures.
I keep ending up with the conclusion that there is no specific
expectation that submissions sent to an arbitrary web site,
suitably anonymized and untraceable to the originating person
or organization, must be strictly private, and I find no
guidelines which suggest an appropriate intermediate level
of data privacy. And when I do find examples of large data set
releases, the only issues have been the lack of full anonymity;
which is not the problem with this data set.
Why then is the release of this data set a gross misuse of trust?
Cheers,
Andrew
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Blueobelisk-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
Peter Murray-Rust
2012-08-28 09:12:15 UTC
Permalink
Post by John P. Overington
Hi,
I think there are some quite big issues with releasing query sets without
explicit permission (informed consent) from the users. I think most users
do not think that their structures will be disclosed when they use an
on-line resource, especially if this is not made crystal clear that future
exposure of these queries is likely, or allowed.
If the queries are published, it could present a major challenge to
subsequent filing of patents of composition of matter, for novel compounds.
This is an important discussion.

We should distinguish between *factual data* and *queries*. The latter are
covered by more constraints (e.g. privacy) than intellectual property
(copyrights, patents, trademarks). I'm just commenting here on data.

The problems with data are:
* it is not clearly covered by or exempt from IP laws (copyright, sui
generis database). Therefore there is often an element of argument. "Facts
are not copyright, therefore XYZ is not copyright" .
* scientific data is - by its nature re-sable and re-used. Every re-use may
or may not carry some rights or licence. This leads to a cascade.

I think there are two positions:
* the absolute. Unless the whole history of the data cascade is known the
rights are unclear and so it cannot be re-used.
* there is a pragmatic limit after which attribution decays sufficiently to
be ignored. There is also the concept of acceptable risk.

As an example. I read a table in a closed access (but "public") paper
(e.g. J. Med Chem) which lists compounds and activities. The *information*
is "in the public domain" - i.e. everyone can know it in principle. If I
extract one compound with its melting point, I can claim that is a fact.
(There is no other way of expressing the information). Such extraction has
been going on for 150 years and is accepted and valuable practice.

Someone aggregates my extraction (and perhaps builds a melting point
predictor). They may acknowledge my extraction and they may acknowledge the
original paper. They publish a list of melting points. So far OK, I think.
Then someone else uses the predictor to publish a set of predicted MPts.
The original data is not acknowledged. This is because it's almost
impossible. We see the decay in the cascade.

Now someone else extracts all the tables in the original papers. The
journal screams "copyright!". The melting point predictor is potentially
contaminated. (For example the CCDC has refused people permission to
distribute force-fields derived from the CCDC collection). Whatever the
legal position , it is clearly messy and probably with no clear resolution.

We faced this with Crystaleye which extracts crystal structures from
supplemental data. These carry no licence and are usually outside the
paywall. We (and Chemspider) tried to get an answer out of the ACS - no
joy. So we have put the structures up anyway. (They are created by the
instruments and the authors, not the ACS). It's a risk, but it's a smaller
risk than cycling round Hyde Park Corner. And what's the worst? A
take-down? I don't think I shall go to jail.

Almost all data is like this - it depends on other data. We have to change
the culture so that publicly visible scientific data is regarded as Openly
re-usable. That's effectively what happens in bioinformatics - few
databases have explicit clear licences. Without this fluency bioinformatics
would collapse. Where we know we have problems - such as CAS registry
numbers (copyright CAS) we try to avoid them and use InChI or Wikipedia.

P.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Craig James
2012-08-28 23:01:06 UTC
Permalink
Post by John P. Overington
I think there are some quite big issues with releasing query sets without explicit
permission (informed consent) from the users. I think most users do not think
that their structures will be disclosed when they use an on-line resource,
especially if this is not made crystal clear that future exposure of these queries
is likely, or allowed.
This is exactly what I was trying to say, but you said it more concisely.

There's legal, and then there's what people expect from their fellow scientists.
Post by John P. Overington
Post by Craig James
If I were a user, I'd assume that what I entered was
private unless the site's privacy policy explicitely said otherwise.
Why would you assume that?
Because that's the proper way to operate a web site -- treat your
customers' data as confidential by default.
Post by John P. Overington
Every company I've worked for or consulted
for has specifically said that internal structures are never to be
sent out of the organization, excepting where certain agreements, which
spell out what can be done with the data, are in place. This was true
even when I was doing bioinformatics work in 1998, so it's nothing new.
Right, but that's to protect the organization from unscrupulous
web-site operators. That doesn't mean it's OK to be unscrupulous.
Just because it's legal to give out customer's data doesn't mean it's
ethical.

Craig
Andrew Dalke
2012-08-29 03:18:11 UTC
Permalink
Post by Craig James
Post by John P. Overington
I think there are some quite big issues with releasing query sets without explicit
permission (informed consent) from the users. I think most users do not think
that their structures will be disclosed when they use an on-line resource,
especially if this is not made crystal clear that future exposure of these queries
is likely, or allowed.
This is exactly what I was trying to say, but you said it more concisely.
There's legal, and then there's what people expect from their fellow scientists.
I forwarded Craig and Stefan's comments to Mike and double-checked
that he's okay with releasing the BindingDB user-contributed data set.
I want it to be released, I want more Open Data, I'm not convinced that
there's an ethical problem, and I'm willing to take heat for
complaints. But I want to understand the alternate viewpoints here
and see if there's something which would help establish guidelines
for the future.


Has anyone anywhere ever elaborated upon those expectations?

The only thing I could find, after much searching, was
http://research.microsoft.com/en-us/um/people/sdumais/chi2011-logcourse-share.pdf
which suggests ways for "Using the Data Responsibly"
• Control access to the data
Externally: Risky (e.g., AOL, Netflix, Enron, Facebook public)
...
Transparency and user control
• Publicly available privacy policy
• Give users control to delete, opt-out, etc.
and also that there are:
Emerging industry standards and best practices

I can't find these standards or best practices.


Let's take it for granted that the best action is
• Publicly available privacy policy
• Give users control to delete, opt-out, etc.


My question is, "What part of this data set can be published?"


I pointed out already several big query sets (from AOL and
NetFlix) were released to the public. The issues were that
the data wasn't anonymized enough; not that they were released.
So it is generally acceptable to release some data.

I pointed out that Google, Alexa, the PDB, MEDLINEplus, and
others have all published some information about the top
query terms, and without complaints (that I could find) that this
information is supposed to be kept confidential.

I have pointed out that the NLM specially states that it will
give out aggregated and non-personal-identifying information
to "third party entities it contracts with for the purposes
of research analysis."


Here's a followup set of questions. Suppose BindingDB has a privacy
statement similar to MEDLINEplus:

BindingDB may provide aggregated information to
third party entities for the purposes of research
analysis. Aggregated data cannot be linked back
to an individual user, site, or IP address.

1) Would this statement be out of place? If so, why? And if
so, then what makes MEDLINEplus's existing statement acceptable?

2) Would you shy away from any research site which included
this clause in their privacy statement? Why? Do you expect
most researchers would be put off by it?

3) Which third parties may be granted access to the BindingDB
query data, in order to carry out research analysis?

- If the answer is 'anyone' or 'anyone who agrees to not
redistribute the data' then that's effectively saying
it's public.

- If the answer is 'academic researchers only' then that
makes it completely worthless for me, for RDKit and
Indigo (both developed by non-academics), and for
any other industry researchers.

- What does 'research analysis' mean? What analysis should
be excluded?


4) Is "scientific data" different than most other queries?
Is it because of patentability issues? Is it because of
the scientific tradition of recognition by priority?

Perhaps the Haumea Controversy is relevant:
http://en.wikipedia.org/wiki/Controversy_over_the_discovery_of_Haumea
because it's based in part on a Google indexed search of the
Caltech observation logs.

Jean-Claude Bradley's (of "Useful Chemistry" blog) gave it
an Open Science slant at:
http://usefulchem.blogspot.com/2010/07/secrecy-in-astronomy-and-open-science.html
and said:
"Secrecy only works if everyone competing follows the same rules."

What are the rules?

If a group of people who are proponents of Open Data can come up
with rules for when non-personal, non-legally protected data should
be kept private, then I think that would be remarkable.
Post by Craig James
Because that's the proper way to operate a web site -- treat your
customers' data as confidential by default.
Excepting that I can point to paper upon paper where
people have studied the internal logs of various general
search engines and reported some of the "confidential"
queries therein.

Here's another such paper:
http://www.sigir.org/forum/F2002/broder.pdf
which reports that some of the AltaVista search terms were:
"Greyhound Bus"
"compaq"
"haaretz"
"normocytic anemia"
"Scoville heat units"

Obviously there's some amount of data release which is
acceptable. And useful, since publications of this sort
of log analysis has been essential in helping improve the
general understanding of how the general public uses
large-scale query engines, compared to, say library search.
Quoting from:

http://research.microsoft.com/en-us/um/people/teevan/publications/talks/jitp11.pptx
Surprises About Query Log Data
From early log analysis
Examples: Jansen et al. 2000, Broder 1998
- Queries are not 7 or 8 words long
- Advanced operators not used or “misused”
- Nobody used relevance feedback
- Lots of people search for sex
- Navigation behavior common
Post by Craig James
Post by John P. Overington
Every company I've worked for or consulted
for has specifically said that internal structures are never to be
sent out of the organization, excepting where certain agreements, which
spell out what can be done with the data, are in place. This was true
even when I was doing bioinformatics work in 1998, so it's nothing new.
Right, but that's to protect the organization from unscrupulous
web-site operators. That doesn't mean it's OK to be unscrupulous.
Just because it's legal to give out customer's data doesn't mean it's
ethical.
That is one of the reasons. You know it's not the only one.

It's also because there are sites which are poorly configured,
so that their log files are public. There's probably even some
which are deliberately configured, but that's hard to search for.

It's also because there are sites which are insecure, and
others might (illegally) get access to the data and release it.

It's also because there are sites like ZINC, which lean much
more towards openness than most companies would like. Again,
from their privacy policy:

DOCK Blaster, ZINC, DUD, and other docking.org services are completely
free public services, run on US-taxpayer-funded computers at a public
university. Thus our bias is to be open.
...
After 7 days, your PIN will be deleted and your [DOCK Blaster jobs]
data will be visible without limitation to anyone.

None of these are "unscrupulous web-site operators."


Andrew
***@dalkescientific.com
Craig James
2012-08-29 15:29:28 UTC
Permalink
Post by Andrew Dalke
Let's take it for granted that the best action is
• Publicly available privacy policy
• Give users control to delete, opt-out, etc.
One final comment regarding BindingDB: I'm pretty sure that Mike is on
good legal ground in releasing the data (but IANAL). There was no
privacy statement that I could find, so legally speaking there was
probably no expectation of privacy. My comments about the
repercussions of releasing the data are all from a social and
scientific point of view.

Your example about the planetary discovery is very apt.
Post by Andrew Dalke
My question is, "What part of this data set can be published?"
I pointed out already several big query sets (from AOL and
NetFlix) were released to the public. The issues were that
the data wasn't anonymized enough; not that they were released.
So it is generally acceptable to release some data.
NetFlix screwed up badly. They released enough information that
specific users who had watched particular genres of erotic material
could be individually identified. It's a lesson: large data sets
contain far more information than just the individual line items.
Post by Andrew Dalke
I pointed out that Google, Alexa, the PDB, MEDLINEplus, and
others have all published some information about the top
query terms, and without complaints (that I could find) that this
information is supposed to be kept confidential.
I'll tell you without hesitation that the top substructure query at
www.emolecules.com is for isothiocyanate ... because it's a clickable
example on our home page. No surprise there ... but we can't go
beyond that.
Post by Andrew Dalke
Here's a followup set of questions. Suppose BindingDB has a privacy
BindingDB may provide aggregated information to
third party entities for the purposes of research
analysis. Aggregated data cannot be linked back
to an individual user, site, or IP address.
1) Would this statement be out of place? If so, why? And if
so, then what makes MEDLINEplus's existing statement acceptable?
A web site can publish any privacy policy it likes. If you don't like
it, don't use the site.
Post by Andrew Dalke
2) Would you shy away from any research site which included
this clause in their privacy statement? Why? Do you expect
most researchers would be put off by it?
As I mentioned earlier, this would be death to eMolecules. Although we
serve the academic community where privacy may not be so important,
our primary customer base is the pharmaceutical/biotech industry. Our
main web site (www.emolecules.com) is our public face, but in fact we
operate many dozens of private web sites that can only be seen by one
specific customer.

Privacy is paramount, and our policy that queries will never be shared
with outside parties. Anything else would result in an industry-wide
ban on emolecules.com.

Your question was about research sites, and we are definitely a commercial site.
Post by Andrew Dalke
4) Is "scientific data" different than most other queries?
Is it because of patentability issues? Is it because of
the scientific tradition of recognition by priority?
Patents, patents, patents. The costs to bring a single drug to market
are expressed in billions of dollars. A single leak can invalidate a
patent application. End of story.
Post by Andrew Dalke
Post by Craig James
Because that's the proper way to operate a web site -- treat your
customers' data as confidential by default.
Excepting that I can point to paper upon paper where
people have studied the internal logs of various general
search engines and reported some of the "confidential"
queries therein.
http://www.sigir.org/forum/F2002/broder.pdf
"Greyhound Bus"
"compaq"
"haaretz"
"normocytic anemia"
"Scoville heat units"
You'd have to read their privacy statement to see whether they
violated their own rules.

OKCupid.com is an interesting example. They constantly publish
interesting analyses of which profiles get the most dates. You might
be surprised to find that atheists are at the top of the list!

http://blog.okcupid.com/index.php/online-dating-advice-exactly-what-to-say-in-a-first-message/
Post by Andrew Dalke
Post by Craig James
Post by Andrew Dalke
Every company I've worked for or consulted
for has specifically said that internal structures are never to be
sent out of the organization, excepting where certain agreements, which
spell out what can be done with the data, are in place. This was true
even when I was doing bioinformatics work in 1998, so it's nothing new.
Right, but that's to protect the organization from unscrupulous
web-site operators. That doesn't mean it's OK to be unscrupulous.
Just because it's legal to give out customer's data doesn't mean it's
ethical.
That is one of the reasons. You know it's not the only one.
It's also because there are sites which are poorly configured,
so that their log files are public. There's probably even some
which are deliberately configured, but that's hard to search for.
It's also because there are sites which are insecure, and
others might (illegally) get access to the data and release it.
It's also because there are sites like ZINC, which lean much
more towards openness than most companies would like. Again,
...
None of these are "unscrupulous web-site operators."
Right on all counts.

Craig
Andrew Dalke
2012-08-29 23:45:10 UTC
Permalink
My comments about the repercussions of releasing the data
are all from a social and scientific point of view.
Understood. I'm thinking that this would be an
interesting topic for the free software track
at GCC this fall.
NetFlix screwed up badly. They released enough information that
specific users who had watched particular genres of erotic material
could be individually identified. It's a lesson: large data sets
contain far more information than just the individual line items.
Yes, the complaints are that personal information could be
extracted from the data set. The complaints were not (that I
can tell) that the information was published.

For example, in reading the lawsuit filing at
http://www.wired.com/images_blogs/threatlevel/2009/12/doe-v-netflix.pdf
I see over and over again that the problems are:
- the release was against the privacy policy
- the data could be de-anonymized
- the restrictions of the Video Privacy Protection act also apply.

and with quoted comments like:

The only way I would ever be willing to participate in
any of these community features would be if I could
remain completely anonymous. This has been mentioned
by myself and others a few times on this blog but it
doesn't seem to be something Netflix wants to work on.

I don't see any evidence that specific people were named.
Doing so would be in violation of the Video Privacy Protection
Act. What I see are reports of how people could be identified,
and worries that people would be identified.
Post by Andrew Dalke
1) Would this statement be out of place? If so, why? And if
so, then what makes MEDLINEplus's existing statement acceptable?
A web site can publish any privacy policy it likes. If you don't like
it, don't use the site.
Ahh, I think I was trying to do two things with this question.
I was trying to see if that clause would be generally acceptable,
int that people wouldn't blink twice upon reading it.

And if so, if that clause would allow this sort of general
data release.
Your question was about research sites, and we are definitely a commercial site.
Indeed. Very different factors are at play.
Post by Andrew Dalke
4) Is "scientific data" different than most other queries?
Is it because of patentability issues? Is it because of
the scientific tradition of recognition by priority?
Patents, patents, patents. The costs to bring a single drug to market
are expressed in billions of dollars. A single leak can invalidate a
patent application. End of story.
No, not end of story. By this you are saying that only
potentially patentable data needs this sort of protection.

For example, would it be okay to release all of the query
structures where the query structure is also in PubChem?
If that's the case, then I'll gladly publish that data
dump from eMolecules.

And under this guideline, it would be acceptable for ChEMBL
to publish the keywords used in compound data searches,
because there's nothing patentable there.

I don't think you mean that. I think there's *also* another
principle at work. I just can't figure out what it is.

It's like, I have the legal right to take pictures in
public. You might coincidentally be in the scene, but you
can't prohibit my taking of those pictures. You might
consider it to be a violation of your privacy, but it's
not. Because you're in public, after all, and we've
decided, legally, that that's considered private.

I don't have a name for that sense of false privacy betrayal.
OKCupid.com is an interesting example. They constantly publish
interesting analyses of which profiles get the most dates. You might
be surprised to find that atheists are at the top of the list!
Indeed, it's a good reference. Most of the privacy statements
I've read specifically say that they can collect aggregate data
which cannot be tracked back to specific people. I believe
now most of the publications exist because of this clause.



Andrew
***@dalkescientific.com
Andrew Dalke
2012-09-06 20:55:29 UTC
Permalink
Post by Andrew Dalke
Understood. I'm thinking that this would be an
interesting topic for the free software track
at GCC this fall.
Here's the abstract I sent in a few days ago (the deadline
was the 1st) for the Goslar conference:

===========

Scientific openness meets the real world

The ideal of scientific openness doesn't come automatically
just because you're working with scientific software or data.
Copyright, patents, database rights, and other legal principles
by default inhibit the free exchange of knowledge. It isn't
hard to license or disclaim those legal protections, but you
have to know that they exist. It's best to choose from one of
the, sometimes confusing, diversity of existing licenses. In
my presentation I'll guide you through the basic issues behind
freeing your work, suggest specific licenses you should consider,
and point out some of the legal and ethical reasons why some
things should not be made free. I'll draw from historical
examples in chemistry, astronomy, and other fields to highlight
some of the issues.

============



Andrew
***@dalkescientific.com
Greg Landrum
2012-09-08 05:31:25 UTC
Permalink
Post by Andrew Dalke
Here's the abstract I sent in a few days ago (the deadline
===========
Scientific openness meets the real world
The ideal of scientific openness doesn't come automatically
just because you're working with scientific software or data.
Copyright, patents, database rights, and other legal principles
by default inhibit the free exchange of knowledge. It isn't
hard to license or disclaim those legal protections, but you
have to know that they exist. It's best to choose from one of
the, sometimes confusing, diversity of existing licenses. In
my presentation I'll guide you through the basic issues behind
freeing your work, suggest specific licenses you should consider,
and point out some of the legal and ethical reasons why some
things should not be made free. I'll draw from historical
examples in chemistry, astronomy, and other fields to highlight
some of the issues.
Sounds like that will be a good one. I hope (a) you get a slot and (b)
people other than me actually show up for it.

-greg

John P. Overington
2012-08-29 13:01:59 UTC
Permalink
There are several examples of data services on the chemical arena where there is a fairly explicit use of user provided queries, and the results of queries.

Molplex have a service where you can calculate properties/do QSAR on a molecule - you can do that for free, or you can do it for payment/subscription. If you don't pay, the molecule and the rights to any of the data you give away (in the first instance to Molplex I guess, but they are free to do what they want). They are a biotech, so sounds fair to me, you want privacy you pay for it. They are pretty explicit about this, or they were when I last looked.

Some of the other compound providers are also pretty clear that they will do what they like with your query.

I think the key thing is to be clear and specific in a terms of use, assign a license to the data and be clear with what you do with people's data - and I do think that the default position for most rational users is that their queries will be private.

We answer quite a few questions from users of chembl (primarily from large pharma) about our app security, and how the app works with the message passing from client to server, how long things are in the cache, etc. Bottom line is that we don't store/ or have any capability to analyse any queries. Conversely, we understand that there is interest in collecting this data, and we have been approached about four times now for this, and have declined.

There is a rather sad tale from the field of bioinformatics about 'surprising' use of queries, the end result being that the resource lost a huge amount of support, credibility and trust from the community.

jpo

--
John P. Overington, PhD FRSC C.Chem.

Computational Chemical Biology
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton, Cambs. CB10 1SD, United Kingdom
--
mail: ***@ebi.ac.uk
office phone: +44-1223-492666
admin: ***@ebi.ac.uk
admin phone: +44-1223-494574
fax: +44-1223-494468
twitter: @chembl
skype: john.overington
Post by Craig James
Post by John P. Overington
I think there are some quite big issues with releasing query sets without explicit
permission (informed consent) from the users. I think most users do not think
that their structures will be disclosed when they use an on-line resource,
especially if this is not made crystal clear that future exposure of these queries
is likely, or allowed.
This is exactly what I was trying to say, but you said it more concisely.
There's legal, and then there's what people expect from their fellow scientists.
Post by John P. Overington
Post by Craig James
If I were a user, I'd assume that what I entered was
private unless the site's privacy policy explicitely said otherwise.
Why would you assume that?
Because that's the proper way to operate a web site -- treat your
customers' data as confidential by default.
Post by John P. Overington
Every company I've worked for or consulted
for has specifically said that internal structures are never to be
sent out of the organization, excepting where certain agreements, which
spell out what can be done with the data, are in place. This was true
even when I was doing bioinformatics work in 1998, so it's nothing new.
Right, but that's to protect the organization from unscrupulous
web-site operators. That doesn't mean it's OK to be unscrupulous.
Just because it's legal to give out customer's data doesn't mean it's
ethical.
Craig
Craig James
2012-08-29 14:32:08 UTC
Permalink
Post by John P. Overington
There is a rather sad tale from the field of bioinformatics about 'surprising'
use of queries, the end result being that the resource lost a huge amount
of support, credibility and trust from the community.
Details? Surely it's no secret.

Thanks,
Craig
Andrew Dalke
2012-08-29 22:54:28 UTC
Permalink
Post by John P. Overington
There are several examples of data services on the chemical arena where there is a fairly explicit use of user provided queries, and the results of queries.
The Molplex one is interesting. I haven't considered pay-for-privacy
property prediction as a viable business model.
Post by John P. Overington
I think the key thing is to be clear and specific in a terms of use, assign a license to the data and be clear with what you do with people's data - and I do think that the default position for most rational users is that their queries will be private.
To get back to the original topic; I don't think any of this user-contributed
data is covered under a copyright, patent protection, or other legal
protection, so that the SMILES strings do not need a license.

But that's not your main point. My observation is that users must assume
a statistical model. They don't have a 100% assurance of privacy, for
reasons I outlined, so submitting proprietary structures to a server is
playing with fire.
Post by John P. Overington
We answer quite a few questions from users of chembl (primarily from large pharma) about our app security, and how the app works with the message passing from client to server, how long things are in the cache, etc. Bottom line is that we don't store/ or have any capability to analyse any queries
I'm surprised to hear that. Do you have ways to detect and prevent
abuse? Do you know where your queries are coming from? Do you not
have anyone who does performance tuning?

For example, the CACTVS substructure keys, used at PubChem, were
developed by Wolf-Dietrich Ihlenfeldt. He looked at user queries in
order to discern gaps in the fingerprint coverage. He was and is
bound by the privacy requirements for PubChem.

Or perhaps analyzes the error logs? I just found that submitting
a "CC" through JME gives me an "Internal Server Error"; the query is

• radio:compound_ids
• searchInput:Please enter a list of Compound IDs, keywords, or SMILES separated by newlines
• sketch_selector2:ignore
• query_type:Substructure
• molfile:
CC
JME 2009.08 Thu Aug 30 00:47:38 CEST 2012

2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.9324 1.0443 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END

• smiles:CC
• chime:

You're saying that no one can analyze that query data to figure out
what's wrong? And that you need people like me to send reproducibles
instead?




I do realize that you are talking about ChEMBL, but does that same
security policy apply to all of the EBI and EMBL provided services?
For example,

http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/valdar/scorecons_server.pl

has no privacy statement link. I can submit sequence data to it. Is
that sequence data stored in a log file? Who has access to the log
file? What are the restrictions on what people can do with those logs?


In any case, the general-purpose EBI-EMBL privacy statement says nothing
about what you all do with user-submitted search data. If the "default
position for most rational users is that their queries will be private"
then how come personal data is less private than query submissions? I
thought it would be the other way around.

That is, the EMBL-EBI privacy statement says that "Categories of recipients
to whom we may be disclosing your personal data include ... Service
providers processing your information on our behalf which are required
to keep your information confidential ... Scientific review committees",
while you're saying that these people can't get access to the search data.

That doesn't seem logical, so I would think that a rational user would
think that their query data is less confidential than their personal data
and can be seen by service providers and scientific review committees.
Post by John P. Overington
There is a rather sad tale from the field of bioinformatics about
'surprising' use of queries, the end result being that the resource
lost a huge amount of support, credibility and trust from the community.
As with Craig, I would like to know more about this example.

For example, I know that journal peer-reviewers have used knowledge
gleaned from the submitted paper, which they then turned around and
published on their own, even while rejecting the paper. That's
clearly outside the bounds of acceptable behavior.

So if the case you're thinking of is someone who received gene
sequences through their own service then turned around and patented
them, then I think that's beyond the pale, but also not the same
class as the BindingDB data.


Andrew
***@dalkescientific.com
Andrew Dalke
2012-08-29 04:00:47 UTC
Permalink
Post by John P. Overington
especially if this is not made crystal clear that future exposure of these queries is likely, or allowed
Would that be disclosed in the privacy statement?

What percentage of the people actually read those privacy statements?

Assuming that number is south of 1%, am I morally obligated to have
a check box next to the search functionality asking "may we collect
and distribute this query information?" I assume that it's supposed
to be disabled unless specifically enabled.

(It's rather like going to all of the UK sites now, which ask me
every so politely "can we leave a cookie in your web browser"?)


Interestingly, I found an attempt at doing something like this for web search data.
http://lemurstudy.cs.umass.edu/

The Lemur community query log project was started over one year ago
with the aim of building up a query log that could be used by the
IR research community. Despite the privacy controls and assurances
that data would only be released after review and in a controlled
manner to researchers using a TREC-like protocol, the response from
the community has been underwhelming. Given that we have gathered
the equivalent of less than 6 seconds of Google traffic (assuming
500 million queries per day) in one year, we have decided to
terminate the project. The statistics of the query log data we
gathered are listed below. Due to the small amount of data, we feel
that we cannot do a general release without compromising privacy,
and there simply is not enough data for most techniques that use
query logs.


Hence we have the unfortunate case that industry (Google, Microsoft,
Yahoo, Yandex, Baidu, and the like) have much better data about user searches
than the academic research field does.

Just like the large, proprietary chemistry search system providers
have a much better knowledge of how chemists search than the rest
of us - they actually have the data that most of us don't!

(Well, Craig probably has some of it. :)
Post by John P. Overington
If the queries are published, it could present a major challenge
to subsequent filing of patents of composition of matter, for novel compounds.
That's why you're not supposed to send proprietary structures
to public, unvetted web sites. Do you assume that they have the
same fondness of the patent system as you do? Do you assume that
their internal logs aren't accidentally searchable by the public?
Post by John P. Overington
Computational Chemical Biology
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton, Cambs. CB10 1SD, United Kingdom
Which brings up the observation I had yesterday - what is
EBI's stated policy on user-submitted query data? I can
find nothing about the:

• Categories of recipients to whom query data may be disclosed
• Period for which query data will be stored

Could you look it up and let us know? Else I can email then,
but I figured you're there and can do it more easily than I.


Andrew
***@dalkescientific.com
Craig James
2012-08-29 15:01:04 UTC
Permalink
Post by Andrew Dalke
What percentage of the people actually read those privacy statements?
Very few ... but that's not the point. The point is that it's *caveat
emptor* -- if there's a stated policy and we ignore it, then too bad. But
if there's no stated policy, who knows?

Here's an interesting article: it would take a month every year to read
every single privacy policy that the typical person encounters in a year:


http://www.techdirt.com/articles/20120420/10560418585/to-read-all-privacy-policies-you-encounter-youd-need-to-take-month-off-work-each-year.shtml

And the part that's relevant to this discussion is this:

"The reality is that the incentives of a privacy policy are to not use
it to keep
your info private. In fact, the incentives are to make a privacy policy
as
permissive as possible. Because the only time you get in trouble is not
if you
fail to protect someone's privacy... but if you violate your own privacy
policy."

This suggests (remember, I'm talking social expectations, not law) that
people expect their activities to be private unless they're notified
otherwise.

"Privacy" policies are really just the opposite: they're designed to
*remove* privacy in a way that keeps the web-site operator out of trouble.
Post by Andrew Dalke
Hence we have the unfortunate case that industry (Google, Microsoft,
Yahoo, Yandex, Baidu, and the like) have much better data about user searches
than the academic research field does.
Just like the large, proprietary chemistry search system providers
have a much better knowledge of how chemists search than the rest
of us - they actually have the data that most of us don't!
(Well, Craig probably has some of it. :)
Indeed.

Craig
Stefan Kuhn
2012-08-28 19:09:29 UTC
Permalink
Hi Andrew,
I can't see how the decision you quote says that in the US database works are
not protected. It basically "repeats" the wipo copyright convention, which
says: "Compilations of data or other material, in any form, which by reason
of the selection or arrangement of their contents constitute intellectual
creations, are protected as such. This protection does not extend to the data
or the material itself and is without prejudice to any copyright subsisting
in the data or material contained in the compilation." So databases are
protected, given they are "intellectual creations", i. e. are original, and
protection only is for the database, not its contents. This is the case in
Europe as well, so I can't see a difference here. Btw, you decision is from
1991, the wipo copyright treaty is from 1996 and was subsequently
incorporated into US law - not being an expert on us law, I think this might
have altered the situation.
With respect to the publication of the search queries: I don't think this is a
matter of privacy, privacy is really not involved as long as it's anonymous.
Still, the actual structures submitted contain information. There may not be
a legal reason not to publish them (I can't find a convincing one right now),
but I still feel that most people do not expect their structures to show up
in the public (except when clearly stated during submit, of course). So I
think you pointed to an important issue. I will put a notice about this in
the privacy statements in the future and I think anybody should do - the many
sites you list not saying a word about it show that this is an open issue.
Stefan
Post by Andrew Dalke
Post by Stefan Kuhn
Now I am neither a lawyer nor
an exprt on us or common law, but from this, I reckoned that databases
are protected in the US. Can you give me some reason why you think it is
not like this? You may well be right, I am interested in learning.
http://en.wikipedia.org/wiki/Feist_v._Rural
...an important United States Supreme Court case establishing that
information alone without a minimum of original creativity cannot be
protected by copyright
As applied to a factual compilation, assuming the absence of
original written expression, only the compiler's selection
and arrangement may be protected; the raw facts may be copied
at will. This result is neither unfair nor unfortunate. It
is the means by which copyright advances the progress of
science and art.
The standard for such originality is fairly low; for example,
business listings have been found to meet this standard when
deciding which companies should be listed and categorizing
those companies required some kind of expert judgment.
Post by Stefan Kuhn
Btw, for my understanding the compilation of all queries entered into on
a website never forms a database in the sense of the wipo copyright
treaty, because it has not been created by somebody deliberately, but by
accident without a creative intention. So I think such a collection is
not protected anywhere, for my understanding.
Thank you.
Post by Stefan Kuhn
Last thing: If there is no declaration on the BindingDB site that the
query data are in fact public, I would be very carefull to publish the
data. Apart from the legal side (where the absence of some sort of
license does not mean that anything goes, but the contrary, even if many
people don't understand this), I would consider it a gross misuse of the
trust users put into a (scientific) website.
When is it appropriate to collect and publish user-submitted data?
Assume for now that this is non-copyrightable data, so the question
is only one of privacy.
Every guideline I know of says that the main issue is personal privacy.
For example, the EU Data Protection Directive "regulates the processing
of personal data regardless of whether such processing is automated or
not" and the US Fair Information Practice Principles says "Consumers
should be given notice of an entity's information practices before
any personal information is collected from them. Without notice, a
consumer cannot make an informed decision as to whether and to what
extent to disclose personal information."
But anonymized structure queries aren't personal information, so
those guidelines don't really apply.... or do you think/should it
be otherwise?
Who publishes user search terms?
http://www.google.com/trends/hottrends
http://www.alexa.com/siteinfo/chemspider.com
1 chemspider 6.68%
2 search spider database programs 3.15%
3 acetic acid 1.14%
4 water marbles 0.96%
5 5800 0.95%
6 vinylidine difluoride 0.88%
7 c5h5o6 name 0.75%
8 benzyloxy structure 0.69%
9 bfj 12 0.49%
10 h3aso3
I pointed out the NetFlix data set, and the AOL one. There the
problem was lack of full de-anonymization, but academic research
based on large-scale user-submitted queries is nothing new. Here's
one report from 2001 based on Excite data.
http://comminfo.rutgers.edu/~tefko/JASIST2001.pdf
A MEDLINEplus analysis for 2002-2003
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839623/
"the TRIP database—a meta-search engine covering 150 health
resources including MEDLINE, The Cochrane Library, and a
variety of guidelines"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852632/
I point these out because they all report a small number of highly
reduced user-submitted queries. In all cases, these papers are
used to improve the general understanding of how people do
real-world queries. That's sorely lacking in cheminformatics,
but a need should not drive one to break ethical considerations.
Since those are acceptable (even as with the PDB when the privacy
statement explicitly says "We do not share server log information
with third parties for marketing or other purposes.") then at
what point does something go from acceptable to a "gross
misuse of the trust users put into a (scientific) website"?
This information is used to measure the number of visitors
to the various sections of our site and improve organization,
coverage, system performance or problem areas. This information
is not used for associating search terms or patterns of site
navigation with individual users. When search features offer
suggested terms, these suggestions are based on aggregated data
only. NLM periodically deletes its Web logs. On occasion, NLM
may provide aggregated information to third party entities it
contracts with for the purposes of research analysis. Aggregated
data cannot be linked back to an individual user.
What does "aggregated information" mean? Why can a third-party
get access to the data but not the public? Is that access
acceptable so long as it's only for research analysis?
Okay, so perhaps the problem the lack of a privacy statement?
Looking at the list of resources from
http://pipeline.corante.com/archives/2012/08/02/public_domain_databases_in_
medicinal_chemistry.php#comments
BindingDB - no privacy statement
ChEMBL - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
PubChem - by law they can't reveal anything
Binding MOAD - no privacy statement
ChemSpider - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
DrugBank - no privacy policy
GRAC and IUPHAR-DB - they will keep your personal information private, but
they say *nothing* about anonymous reports of your query data
PDBbind - no privacy policy
PDSP Ki - no privacy policy
Supertarget - no privacy policy
Therapeutic Targets Database - no privacy policy
Zinc - "Thus our bias is to be open", "Limited Privacy Option for DOCK
Blaster jobs", "Results of ZINC upload and subset requests and DOCK Blaster
protonate requests remain on our server for seven days and cannot be PIN
protected", "Any way around these restrictions? / You may request a private
copy of DOCK, DOCK Blaster, and ZINC, and run them on your own servers.
This is subject to licensing from the Regents of the University of
California, and may not be free."
I found no evidence that these chemistry databases have said that
they will keep your anonymous search data private. Not only that,
but ZINC has specifically said that they will make your data public
excepting that you can delete some of your search results before
the public gets access to it!
Who knows, perhaps had BindingDB a privacy policy it would be
more like ZINC's.
Still, let's see if there's something which would make you [Stefan]
and others like Craig happier. What would make for a reasonable
compromise data set?
- Is the problem that the submitted SMILES in the BindingDB data
sets may contain proprietary structures? In that case, would
removing all structures which aren't also in PubChem be acceptable?
That's a bit harsh because someone tuning their search engine also
needs an appropriate number of negatives.
- Is there a really a wide-spread problem of people submitting
proprietary compound information to public servers, with the
expectation that it will be private, or is mostly an abstract worry?
- Is the problem that the data set is too fine-grained, which means
that someone very clever may detect patterns in what someone else
is doing? Would releasing a randomly selected sample of 1% of the
structures be acceptable? Is there some way to dirty up the structure
to make it be more acceptable? (Rather like what the US Census does
to improve confidentiality protection in their reports.)
- Suppose I was the only person who had access to the BindingDB data
set. I wrote a report about the highly-aggregated results of my
study, but don't release the data. Would that not be contrary to
the principles of "Open Data"?
- What is an acceptable level of complaints? No matter what the
final policy is, someone may say that it's unacceptable. Would
one complaint out of 1,000 users be acceptable? 1 out of 100?
What level of complaint would there be in releasing the full,
unfiltered and anonymous data set?
- Who will make up the institutional review board which decides
if a given report is sufficiently aggregated so as to no longer
be a gross misuse of trust? Do they have the right experience
to judge, guided by past incidents, what is and is not appropriate?
http://www.microsoft.com/security/online-privacy/prevent.aspx
Privacy policies should clearly explain what data the website
gathers about you, how it is used, shared, and secured, and
how you can edit or delete it. (For example, look at the
bottom of this and every page on Microsoft.com.) No privacy
statement? Take your business elsewhere.
and
• Do not post anything online that you would not want made public.
Surely the research guidelines should be no less strict than
someone looking for cat pictures.
I keep ending up with the conclusion that there is no specific
expectation that submissions sent to an arbitrary web site,
suitably anonymized and untraceable to the originating person
or organization, must be strictly private, and I find no
guidelines which suggest an appropriate intermediate level
of data privacy. And when I do find examples of large data set
releases, the only issues have been the lack of full anonymity;
which is not the problem with this data set.
Why then is the release of this data set a gross misuse of trust?
Cheers,
Andrew
---------------------------------------------------------------------------
--- Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Blueobelisk-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
Andrew Dalke
2012-08-28 23:20:41 UTC
Permalink
Post by Stefan Kuhn
I can't see how the decision you quote says that in the US database works are
not protected. It basically "repeats" the wipo copyright convention, which
says: "..." So databases are protected, given they are "intellectual creations",
i. e. are original, and protection only is for the database, not its contents.
Feist was what we studied in my CS ethics class. It was a while
ago, and as you pointed out, before the WIPO treaty.

Perhaps this document, from the US Copyright Office, explains better
the consequences of the Feist decision:
http://www.copyright.gov/reports/db4.pdf

The Supreme Court sounded the death knell for the sweat of
the brow doctrine in Feist Publications v. Rural Telephone Service Co.
In finding a white pages telephone directory to be uncopyrightable,
the Court held that the sole basis for protection under U.S. copyright
law is creative originality.


The Supreme Court decided that:

The Court did not limit its holding to statutory interpretation,
however. It held that “[o]riginality is a constitutional requirement.”

Hence, that requirement cannot be changed by a treaty.

What is protected under copyright is the selection and presentation
of a database, not the contents. For example, see:

http://www.weintraub.com/Publications/The_11th_Circuit_Reminds_All_That_Copyright_Protection_For_Databases_Is_Alive_And_Well


In BUC International, the 11th Circuit considered the way
in which the plaintiff selected, categorized, and presented
certain factual information about yachts listed for sale,
and determined that this was entitled to protection under
the Copyright Act. And while the defendant may have been
entitled to the underlying information, it could not arrange,
organize or display this information in a manner that was
substantially similar to the way in which plaintiff arranged,
organized and displayed the same information. Because the
defendant listed the data in the exact same way, and in the
exact same manner as plaintiff, the 11th Circuit upheld the
trial courts finding of infringement and a damage award in
excess of one million dollars.

While this case does not offer a new twist on Feist, it does
serve as a good reminder of the extent of protect ability of
factual complications. Under Feist a third party may copy
and freely use any factual information contained in a database,
as long as the third party does not use the same selection
and arrangement.

Compare this to the UK database right, where

A property right (“database right”) subsists, in accordance
with this Part, in a database if there has been a substantial
investment in obtaining, verifying or presenting the contents
of the database.
http://www.legislation.gov.uk/uksi/1997/3032/regulation/13/made

I believe that "verifying" would fall under what the US calls
"sweat of the brow", and would not in the US be allowed under copyright.


Here's a 2008 summary on the differences between Feist and WIPO:


Feist struck down the "sweat of the brow" doctrine, which some
courts had used to find copyright protection in databases created
by the industrious efforts of their authors. Under Feist, no
amount of effort or expense incurred in creating a database will
bring about copyright protection unless the database is original
in its selection, coordination, or arrangement.

WIPO and the European countries have always disagreed with Feist.
In 1993, the European Community implemented a directive to
recognize the copyright protection in databases excluded by Feist.
The 1996 WIPO Copyright Treaty obligates its members, including
the U.S., to recognize such copyright protection.

A provision to restore database protection was dropped at the last
minute over objections by members of the scientific and educational
community complaining about abridgements to fair use. The issue
will be taken up by the new Congress.

I have heard of no changes to this in the last 4 years, and I'm
pretty certain that a change which would affect Feist would have been
important enough that I would have heard mention of it.


Andrew
***@dalkescientific.com
Craig James
2012-08-28 23:32:22 UTC
Permalink
Andrew's explanation is exactly right and his citations are great.
The way I've heard it explained on a legal site uses the phone book,
the very case decided by the U.S. Supreme Court:

Merely collecting all phone numbers in a town, no matter how much
work, does not produce a copyrightable database. Arranging it in an
"obvious" way, like sorting alphabetically, also fails the copyright
test.

Collecting all the phone numbers of eligible bachelors in town who are
good looking and honest, then arranging them handsomest to plainest,
requires a great deal of creativity and artistic judgement, and
produces a copyrightable phone book.

Craig
Post by Andrew Dalke
Post by Stefan Kuhn
I can't see how the decision you quote says that in the US database works are
not protected. It basically "repeats" the wipo copyright convention, which
says: "..." So databases are protected, given they are "intellectual creations",
i. e. are original, and protection only is for the database, not its contents.
Feist was what we studied in my CS ethics class. It was a while
ago, and as you pointed out, before the WIPO treaty.
Perhaps this document, from the US Copyright Office, explains better
http://www.copyright.gov/reports/db4.pdf
The Supreme Court sounded the death knell for the sweat of
the brow doctrine in Feist Publications v. Rural Telephone Service Co.
In finding a white pages telephone directory to be uncopyrightable,
the Court held that the sole basis for protection under U.S. copyright
law is creative originality.
The Court did not limit its holding to statutory interpretation,
however. It held that “[o]riginality is a constitutional requirement.”
Hence, that requirement cannot be changed by a treaty.
What is protected under copyright is the selection and presentation
http://www.weintraub.com/Publications/The_11th_Circuit_Reminds_All_That_Copyright_Protection_For_Databases_Is_Alive_And_Well
In BUC International, the 11th Circuit considered the way
in which the plaintiff selected, categorized, and presented
certain factual information about yachts listed for sale,
and determined that this was entitled to protection under
the Copyright Act. And while the defendant may have been
entitled to the underlying information, it could not arrange,
organize or display this information in a manner that was
substantially similar to the way in which plaintiff arranged,
organized and displayed the same information. Because the
defendant listed the data in the exact same way, and in the
exact same manner as plaintiff, the 11th Circuit upheld the
trial courts finding of infringement and a damage award in
excess of one million dollars.
While this case does not offer a new twist on Feist, it does
serve as a good reminder of the extent of protect ability of
factual complications. Under Feist a third party may copy
and freely use any factual information contained in a database,
as long as the third party does not use the same selection
and arrangement.
Compare this to the UK database right, where
A property right (“database right”) subsists, in accordance
with this Part, in a database if there has been a substantial
investment in obtaining, verifying or presenting the contents
of the database.
http://www.legislation.gov.uk/uksi/1997/3032/regulation/13/made
I believe that "verifying" would fall under what the US calls
"sweat of the brow", and would not in the US be allowed under copyright.
Feist struck down the "sweat of the brow" doctrine, which some
courts had used to find copyright protection in databases created
by the industrious efforts of their authors. Under Feist, no
amount of effort or expense incurred in creating a database will
bring about copyright protection unless the database is original
in its selection, coordination, or arrangement.
WIPO and the European countries have always disagreed with Feist.
In 1993, the European Community implemented a directive to
recognize the copyright protection in databases excluded by Feist.
The 1996 WIPO Copyright Treaty obligates its members, including
the U.S., to recognize such copyright protection.
A provision to restore database protection was dropped at the last
minute over objections by members of the scientific and educational
community complaining about abridgements to fair use. The issue
will be taken up by the new Congress.
I have heard of no changes to this in the last 4 years, and I'm
pretty certain that a change which would affect Feist would have been
important enough that I would have heard mention of it.
Andrew
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Blueobelisk-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
Noel O'Boyle
2012-08-27 19:56:34 UTC
Permalink
IANAL but I grew up watching Matlock. I would just ask the respective
sources whether they are happy for you to license your dataset under
CC0 (rather than puzzle through the legal ramifications).

- Noel
Post by Andrew Dalke
I recently published the "Structure Query Collection", at
https://bitbucket.org/dalke/sqc
This is a collection of different SMILES and SMARTS used as queries against a small molecule database. I include the original data in as raw a form as I can manage, and a processed form which extracts only the SMILES/SMARTS and may include some cleanup of the original data.
I have been struggling with how to define a license for the SQC data, or even if one is needed.
1) BindingDB
The largest data sets in the collection by far come from BindingDB. This contains almost a decade of user-submitted queries from BindingDB. They were extracted from the log files.
As best as I can tell, in the US there would be no legal protection for this data because there's no creative effort in its extraction. The SMILES are too short and non-notable to have individual protection by the submitter, and the US (see Feist) does not recognize database rights. BindingDB is from the US.
Therefore, I do not believe that those dataset are covered under copyright, patent, trademark or other sui generis rights. (Assuming that I understand the phrase 'sui generic' correctly.)
There is a niggling detail as I live in Sweden. However, I don't think my two days of work could be described as a "substantial investment in either the obtaining, verification or presentation of the contents", which is the text of "Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases". Should I include a disclaimer anyway, and if so, is there a suggested style?
2) SMARTS collections from RDKit
This is probably the easiest to deal with. Two SMARTS data sets derive from files found in RDKit, which is distributed under a BSD/MIT-style license. I trivially transformed the data into a simple list of SMARTS.
I don't believe I need to worry about these two data sets at all because I include the RDKit license in the distribution.
However, is the extracted data set even covered under copyright and/or database rights at all?
3) SMARTS from Ehrlich and Rarey's recent J. Cheminformatics paper
Ehrlich and Rarey published a list of 1235 SMARTS in
Systematic benchmark of substructure search in molecular graphs -
Andrew Dalke
2012-08-27 23:47:01 UTC
Permalink
Post by Noel O'Boyle
IANAL but I grew up watching Matlock. I would just ask the respective
sources whether they are happy for you to license your dataset under
CC0 (rather than puzzle through the legal ramifications).
That's the $10,000 question, isn't it. Who are the sources?

For BindingDB, is it the people who sketched the Marvin
structures which got converted to SMILES?

Now that I've decided on CC0, I've gone back to Mike Gilson and
asked about using that license for their contribution.


It's been 10 days and I still haven't gotten a reply about
the Ehrlich and Rarey data set .. and in any case, they are
not the original sources of those SMARTS.

Cheers,

Andrew
***@dalkescientific.com
Egon Willighagen
2012-08-28 09:36:55 UTC
Permalink
Dear Andrew,

On Mon, Aug 27, 2012 at 12:14 PM, Andrew Dalke
Post by Andrew Dalke
3) SMARTS from Ehrlich and Rarey's recent J. Cheminformatics paper
Ehrlich and Rarey published a list of 1235 SMARTS in
Systematic benchmark of substructure search in molecular graphs -
Loading...