This FAQ document is complementary to the practical use-orientated Help documentation and the About page. This focuses more on the "how, why and wherefore" of our database content.
Our open access paper in the 2014 NAR Annual database issue summarises the database as of Sept 2013. More recently technical background to some of the curatorial issues below is provided in our blog posts, the Newsletter, and recent presentations by team members on SlideShare.
Content expansion and feature enhancements are announced on the front page. Releases are approximately quarterly, designated as increments (e.g. 2014.1 in April, 2014.2 in June and 2014.3 in November).
Yes, "GtoPdb". Note in PubChem our source name is "IUPHAR/BPS Guide to PHARMACOLOGY" (for Substances, Compounds and BioAssays). When you inspect our substance records the ligand ID number is prefixed with "GTPL". The Concise Guide to PHARMACOLOGY snapshot publication in the British Journal of Pharmacology can be abbreviated as "CGTP".
It grew out of the IUPHAR Database (IUPHAR-DB) from 2011. This was described in a series of publications and a Wikipedia entry but now has a re-direct from the original website. The pre-existing website was integrated into GtoPdb and expanded with information from an established series of journal articles called the ‘Guide to Receptors and Channels’ (GRAC). GRAC is now superseded by the "Concise Guide to PHARMACOLOGY" (CGTP) series.
While the earlier publications pre-date the database they were conceived as adjuncts to facilitate reciprocal navigation. All ligand and protein entries from articles designated as NC-IUPHAR reviews now have records in the database. There is obviously more contextual detail in the stand-alone review articles than we could feasibly index but summary information (e.g. for protein families) is not only captured in the database but we also link-out to the relevant articles.
Yes, the British Journal of Pharmacology and its publisher Wiley have been piloting this for some time. You can see examples both in the CGTP 2013/14 series and recent NC-IUPHAR reviews (e.g. this 2014 one on epigenetic pathways)
As an open resource, we encourage re-distribution and value-added integration of our content. We also maintain an expanding collaboration network for reciprocal cross-linking from other high-utility sources (these are listed on our website and we know most of the teams personally). However, they may not always refresh our updated links on their side (e.g. you may still come across the superseded IUPHAR-DB links). This link-refreshing is a major issue in the global database ecosystem (see this blogpost) but we are addressing this by alerting those we know, to new GtoPdb releases. However, of more concern is where our content has been downloaded or even web-scraped and then re-surfaced into additional databases without contacting us. The problem is certainly not confined to just our resource but we obviously have no control over this. Links can consequently include deprecated records (if you are on the team of such a resource we would be pleased to discuss the technicalities of refreshing our links).
The database is hosted by the University of Edinburgh which is registered as a Charity in Scotland (SC005336). Our major funders are the UK Wellcome Trust (via grant number 099156/Z/12/Z), the British Pharmacological Society (BPS) and International Union of Basic and Clinical Pharmacology (IUPHAR). We also have some unrestricted educational grants from companies. Note we are always open to new sponsors (please email us)
You will note that the downloadable contents of the database have generous licencing terms (share, copy, redistribute, adapt, remix, transform, build upon, including commercially). Notwithstanding, we would request that any parties incorporating the content of GtoPdb into their own work, including their own integrations should contact us. This is not only for the courtesy of us knowing who-is-doing-what with our funded work but we can also help with technical aspects. We are currently engaging with the OpenPhacts consortium about integration as well as some commercial information providers and major pharmaceutical companies.
Yes; as described, IUPHAR-DB originated from receptor and channel pharmacology. While GtoPdb continues to maintain this focus, over 2013/14, as a grant directed objective it has extended into molecular mechanisms of action (mmoa), with corresponding human protein annotation for new target classes. In addition, there have been selective expansions driven by user interests and specific collaborations. For example, we now have a broad capture of development candidates and research compounds for the treatment of Alzheimer’s Disease. One difference users may notice between old and new records is that receptor or channel ligands often have a range of activity values from different publications judged as pharmacologically relevant (this is not to be confused with +/- ranges reported in individual for the same determination). However, for our recent expansion into enzyme drug targets we typically only select one value. Also, most of these newer targets have only one ligand as opposed to many ligands for well-studied receptors and channels. We are remediating older relationship mappings, particularly those without referenced activity values. In addition, we will convert some of our historical peptide entries into more defined molecular representations.
Our first source is peer-reviewed primary literature. We exploit the different feature sets of both UK PubMed Central and NCBI Entrez (note this occasionally gives matches in PubMed Central that are not in PubMed/MeSH) and last but not least, the full range of public databases. We are fortunate to have extensive Journal access via the University of Edinburgh but there are cases where they do not have subscriptions to some we would like to check. The only commercial database we currently use is CAS SciFinder, also courtesy of a University of Edinburgh licence. It should be noted that CAS content is extracted from primary public sources but we occasionally use it to locate entities that are difficult to resolve elsewhere.
We apply selective criteria if there is a choice of either a) multiple references for established ligands and/or well characterised targets or, b) if we need to select more recent ligands for an emerging research target. In both cases, we first go for primary publications in top-ranking journals in the field with detailed SAR, unequivocal resolution of the ligand and target entities (i.e. molecular structures and species for the proteins) as well as defined assay conditions. Note there is a time-shift between primary medicinal chemistry papers from which we can capture in vitro kinetic parameters and later papers on in vivo pharmacology as well as eventual clinical trial results. These can span anywhere between 5-20 years but we select key references across the range. In terms of activity types, we prioritise the more standardised Ki over an IC50 but will include both if they are available. We are aware of the somewhat grey line for the target assignments of binding data for receptors expressed with coupled read-outs in cell-lines (as opposed to purified proteins typically used for soluble enzyme assays). Given the intrinsic value of the cell-binding data, we judge if the mapping to a protein identifier is sufficiently resolved (and the reference details the cell-based assay). For research targets, authors may indicate optimised lead compounds in the paper or, if not, the highest potency is selected. If a ligand is indexed in PDB as the correct target-ligand pair we will obviously prioritise this but also try to find the activity values from earlier papers. In terms of activity types, we generally defer to authors (i.e. if they call it a Kd or an EC50 then so do we) but our target subcommittees may decide to change the annotation in respect of the assay type (as detailed in the IUPHAR guidelines). Note we have also been accommodating new proposed ligand nomenclature expansions as exemplified by the IUPHAR recommendations for the nomenclature of receptor allosterism and allosteric ligands.
The vast majority of our ~20K references have PubMed IDs but we do include a small number of reference links for data we think is important to capture but is either in journals not indexed in PubMed, patents, book chapters, slide sets, or meeting abstracts. In the same way as we assess the quality of peer reviewed papers, we also judge these sources on a) credible provenance b) entity resolution and c) a stable URL or DOI. We also set up a literature alert for company code numbers where our initial extraction was only from abstracts or slides. More recently we have extracted data from (and thus linked to) pharmaceutical company open information sheets for repurposing candidates (in lieu of the expected papers). We note the increase in other types of non-PubMed data surfacing (e.g. Open ELNs and Figshare) so groups intending to surface pharmacological data sets via these new routes are welcome to contact us.
We select patent references for generally two cases; a) ligands where we can find either the only published SAR or b) the data is extensively complementary to that from papers from the same team (because many more analogues are included, along with quantitative data and synthesis descriptions). The big bang of patent chemistry extraction has resulted in the submission of over 15 million patent-extracted structures into PubChem and EBI has taken over the new SureChEMBL resource. Consequently, it is becoming easier for us to resolve structure and data links between papers, database entries and patents. We generally select only those from pharmaceutical companies and academic institutions with an established medicinal chemistry reputation. There are some ligands where we add more speculative mappings (e.g. as a curatorial comment pointer to the patent wherein the lead series can be identified with high probability). This is for cases where a company code name is blinded (i.e. no publically declared name-to-structure) but pharmacologically important information (including quantitative target binding data), has been disclosed on company websites, in open repurposing lists, or in clinical trials entries. In a few cases we have also been able to exploit patents as a source of target binding data for monoclonal antibodies published studies.
Regardless of author primacy, we do not report significant figures that clearly exceed the variance of experimental assays (i.e. anywhere up to +/- 50%) and therefore we only maximally quote three (i.e. 1%). Note also our log-transformation to pAct also produces three figures. There is a caveat in regard to the surfacing of different figures (for rounded vs as-written) in other literature extraction sources for the same report (e.g. in PubChem BioAssay) but the consequences cannot be detailed here.
Yes. This occurs principally via a) our target committees of ~650 global experts, b) the NC-IUPHAR and co-grant holders steering group and c) the University of Edinburgh database team (who are collectively authors on 129 PubMed papers). This manifests itself as an intense and continuously reviewed selection (also factoring-in user feedback) of what to include or leave out and sometimes remove. Populating the database, by definition, seeks to impose structure via entity relationship abstraction from a large unstructured document corpus. However, the complexities and nuances of pharmacology as well as (it must be said) variable publication scientific quality, all mean that brainless rule-parsing (and maximal coverage), are incompatible with our "utility biased" vision. Thus, our rules are implemented more as curatorial guidelines (i.e. we can bend or even break them where it is useful and the database consequences are not too problematic). The challenge is balancing speed and flexibility of capture against the necessary constraints of a formalised data model. We thus make extensive use of curators’ notes (a.k.a. nano-publications) to capture tacit facts as free-text and relationships via cross-pointers that are not indexed fields in the current schema. In the longer term we may accommodate new relationships as necessary. These aspects differentiate us from other valuable resources but with different capture scales and objectives. Some of our choices are pragmatic. For example we will add ligands from the earliest reports of chemical modulators for a novel target (possibly patent-only) even if these are of such low potency and/or specificity to be unpublishable for an established target (e.g. surrogate ligands for orphan receptors). Importantly our annotations are reversible as we remediate and make committee- or user-communicated corrections (note we will add improved ligands as they are published but do not typically remove older ligands with solid citations).
Only for kinases. Matrix data constitutes standardised result sets from large-scale n-ligands x n-proteins parallel assays (also called panel screening or profiling). The ones we surface from DiscoverRx, Reaction Biology and Millipore are valuable for users to access, via a separate table. However, if these results were target-mapped into the main database they would interpose a confounding set of relationships, compared to the stringent mappings we curate from the literature.
Yes, and note you are welcome to forward your own new papers (and we may get back to you for entity clarifications). Our committee members regularly alert us to new content during our target-family update cycles and in between for "hot topics" as announced on the web page (note you will have to permit us to judge the relevance before it goes in the curation queue). We hope you will find our entity resolution guide useful.
This refers to the practice of manually or automatically copying annotation and links between databases without evaluation. We always check the linkages we add and, crucially, also read the papers to check the activity data to enter in the database are correct. In many cases we have drawn chemical structures de novo from papers where no CID matches were available at curation time. This is why over 250 of our CIDs are novel in PubChem. For some quality control tasks we do use computational cross-checking (e.g. to ensure that our source links are concordant within PubChem) but only where we have established the operational consistency of the automated approach.
As an umbrella term, this has a range of meanings. One of them, as detailed in other sections, is associated with drugs to treat human diseases that typically have a data-supported protein target for their mechanism of action. Secondly, ligands are usually selected on the basis of their protein interactions. They can thus be termed "targets" in the wider sense even if they are not encompassed within drug discovery efforts for human diseases. Notably, our targets include those newer receptor-ligand pairings judged as credible by the committees (i.e. de-orphanisation see PMID 23957221). Thirdly, we perform selected orthologue curation in that we either include non-human binding data, or annotate the rodent reference orthologues. The fourth category (discussed in PMID 21569515) arises from the drug development context of undesirable ligand interactions (sometimes termed "anti-targets"). An database example is that between the withdrawn drug terfenadine and the HERG channel (KCNH2) as a liability target for cardiac toxicity. As a fifth category our ligand mapping extends to functional orphans. By this we mean specific chemical modulation reports for proteins that do not yet have sufficient validation data to be considered bona fide therapeutic drug targets but are being investigated to both establish their normal function and assessed for possible causative disease involvement (e.g. Cathepsin A). We have consequently curated ligands directed against kinases, proteases, chromatin modifying enzymes and GPCRs that can be classified as functional orphans. This is clearly a transient categorisation in the context of functional genomics and equally intense efforts to validate new therapeutic targets in the both the academic and commercial sectors.
As well as an overview and background reading, target family pages include concise at-a-glance summaries for each target. These describe nomenclature, genes and key ligands including expert-recommended selective tool compounds, endogenous ligands and approved drugs. For the most important proteins (including targets of approved drugs) we are working to include more detailed subcommittee-directed curation of detailed pharmacology, physiology, molecular function, assays, human disease relevance, and clinically significant variants. This includes extended family introductions adapted from review articles.
One of the founding (and continuing) objectives for NC-IUPHAR is to oversee the nomenclature of human receptors and channels so these human protein classes are complete in the database (with the exception of the olfactory and opsin-type GPCRs). More recently, as part of our grant objectives, we have expanded into other families. This includes: transporters, the full complement of kinases, a subset of characterised proteases, other hydrolases as well as enzymes involved in histone modifications. You can find the current families breakdown on the About page which as of November 2014 includes 2708 UniProt identifiers.
Where possible, we resolve the literature reference to a UniProtKB/Swiss-Prot ID as our primary identifier. Note that for non-human species, such as rodents, we restrict these links to the sequences in Swiss-Prot, as these are curated and reviewed. There are many reasons for choosing UniProt primacy but they include a) the utility of the Swiss-Prot canonical philosophy of protein annotation, b) species specificity, c) global reciprocal cross-referencing, d) persistence as an EBI core resource e) as of 2014, we have collaborative control over our own cross-references and f) we can correct entries via our own feedback to the UniProt team. It is important to note that this choice is protein-centric rather than gene-centric. While the dichotomy can cause problems (e.g. Swiss-Prot protein names, HGNC gene names and NC-IUPHAR nomenclature are not completely harmonised) the mappings between the core tetrad (UniProt, HGNC, Ensembl, and Entrez Gene) are concordant (i.e. have a 1:1:1:1 sequence cross-mapping) for only 18,787 human entries (as of Sept 2014). The other protein and gene resources we link to are listed in Help. Note, for us, these are secondary sources. What this means practically is that we ensure the fidelity of our curated "out" links, but we can neither control their correct reciprocity in pointing back "in" nor between themselves. While ambiguous cases in our database are few, it does affect nearly 1500 human proteins with discordances between the major protein and gene annotation pipelines. Users should also be aware that Swiss-Prot entries can have one-to-many mappings to RefSeq (since the latter are non-canonical). Where we cannot resolve members of protein family to UniProt IDs from the authors description (but the paper was judged pharmacologically important) we comment on this ambiguity.
The UniProt entries which link out to GtoPdb entries were selected for "has ligand" relationships of any type. As of 2014 this represents over 2,000 proteins of which around two thirds are from human Swiss-Prot. Note that UniProt out-links to us are now in a new category of "Chemistry" cross-references which includes ourselves, BindingDB, DrugBank and ChEMBL (but note the synchronisation times between these sources are different). We are currently exploring extending selection options (e.g. for the primary targets of approved drugs).
This is a challenge for curation since many targets are heteromeric complexes in vivo (e.g. they consist of multiple UniProt IDs). We include their NC-IUPHAR designations as complexes and provide page links to pharmacology references (as specified in PMID 17329545). However, for ligands we annotate UniProt mappings as 1:1 wherever possible. The mechanistic justification is that, for most complexes at least, the data indicates only one or two proteins participate directly in ligand binding. This more stringent annotation enhances the precision of the database in three ways, by a) taking a minimal rather than a maximal target mapping approach (see PMID 24533037) b) restricting targets to those with tractable binding sites c) putative ligand binding by homology extrapolation becomes more reliable. For example, we have mapped the current gamma secretase inhibitors to PSEN1 rather than increasing the complexity of our our target mapping matrix by adding an additional five UniProt IDs for which there is no evidence of significant inhibitor binding.
Our target pages include biologically significant alternative splicing variants and these have links out to the corresponding RefSeq nucleotide and protein entries. The increasing importance of splice variants in pharmacology is recognised in this recent IUPHAR review (PMID 24670145). Future updates may thus include splice-specific binding data. If the splice variant is clearly defined in the paper we should be able to match this to a Swiss-Prot feature line and a RefSeqNP ID.
We do capture selected pharmacologically important ligand interactions in these domains (e.g. high-affinity transporter binding for some drugs, metabolites with substrate binding data for selected drug targets and certain toxins used as pharmacological tools). However, we leave the broader matrix of molecular interaction capture for these domains to other specialist databases.
Yes. This arises from individual members of the protein families that do not yet have recorded ligand interactions in the database. Note that this absence of ligands always has the "so far" caveat and the numbers of proteins with curated interactions expands with every release.
No, for three reasons, a) While we generally curate human data, where this is unavailable we may include rodent data (and in fewer cases other orthologues such as dog) where this is available b) our collation of approved drugs has captured structures for a number of anti-infectives. These may be consolidated by adding molecular mechanisms of action (mmoa) mappings in due course but our current curatorial focus is human c) There are cases for older drugs (i.e. before target expression-cloning was routine) where human in vitro data was never published so we can only find data for a test animal (e.g. ACE inhibitors with IC50s against the rabbit enzyme).
We could expand to cover well over 2,000 human proteins that have chemical modulation reports in papers or patents that would pass our curation criteria. However, this is a future funding issue.
As explained above we use citable activity data to define a pharmacologically significant molecular interaction. The concept of a primary target is where a ligand has been optimised in a drug discovery context with a distinct molecular mechanism of action (mmoa) (usually measurable in vitro with plausible potency and specificity), and is assumed to be causatively responsible for observed pharmacology in vivo (e.g. the effect of ACE inhibitors in lowering blood pressure is due to substrate-competitive binding). This should not be confused with "target validation" where translation of the primary mmoa into therapeutic efficacy has been clinically demonstrated (but, as we know, many drug candidates with a data-supported mmoa still fail to improve disease outcomes in clinical trials).
Ligand is used as an umbrella term for pharmacologically important small-molecule < > large molecule interactions, but there is no strict size cut-off (i.e. it extends to certain protein-protein interactions such as cytokine to receptor). Ligands are captured in the database because publications (or other data we judge as having adequate provenance) have experimentally characterised their interaction with a protein or macromolecular complex. These interactions are selected as a) being mediated by direct binding (i.e. thermodynamically driven), b) specific (i.e. limited cross-reactivity), c) experimentally measurable, d) result in activity modulation with biochemical consequences e) the mechanistic consequences are pharmacologically relevant and f) we can resolve the ligand identity to a molecular structure (see below). Our high-level classification is divided into endogenous (e.g. metabolites, hormones and cytokines) and exogenous ligands (e.g. drugs, toxins and tool compounds). The deeper categories are in the ligand list tabs.
The basic concept is that a ligand needs to have defined molecular structure (with some exceptions, such as heparin as a fractionated polymer extract). The majority are organic molecules described as chemical entities in a number of formal ways (detailed in Help). We have consequently resolved (and performed automated cross-checking on) over 70% of our ligands to the primary mapping of a PubChem Compound Identifier (CID). This means the structure is defined by the PubChem chemistry rules (documented in their own FAQ). There are many reasons for this choice (which we can detail) but the first is the detailed and transparent relationship mapping between over 60 million CIDs and 350 data sources. A second is our active collaboration with the PubChem team on aspects of our own ligand molecular resolution, BioAssay data mapping, and iterative quality control checking of entries (see our current SID and CID sets). Aside from a small number of inorganic entries, we use peptides and proteins as the two other levels of ligand structural description. These can be mixed, for example where a moderate-size peptide has a CID that defines the backbone with a chemical modification (e.g. C-terminal amidation). Note we also curate the primary sequence string into the record, as well as including the human UniProt ID within which that native sequence has an identical match (many of these correspond to Swiss-Prot cross-references for cleavage-excised peptides). Protein-only ligands are designated by UniProt IDs (e.g. cytokines). Note that mAbs are a special class of ligand since the sequences are usually defined but do not have UniProt IDs (for various reasons). From our collaboration with IMGT mAb-DB we provide pointers to their INN-derived sequence assignments.
The images of small molecules depicted on our site are generated by an online identifier resolver which takes the ligand SMILES as input. This free service from the NCI/CADD group of the National Cancer Institute is built upon the CACTVS software. Note this has an advantage over some other rendering styles in explicitly marking the stereo centres.
We provide a "Similar ligands" tab on ligand pages. These are pre-computed for each release by clustering via a modified sphere-exclusion approach. This is based on similarity of both the properties and structural fingerprints of the molecules. Users can also explore intra-database similarities via the provided substructure and SMARTS pattern-matching searching (see Help). For those ligands that do not display any neighbours in our collection (or even if they do, but you want to extend this into a larger chemical structure space) we recommend using the PubChem "Similar Compounds" link. This will show all CIDs with a pre-computed Tanimoto similarity above 90% and these can be displayed as 2D or 3D clustering (n.b. if these are very large because of many close analogues in PubChem, a higher stringency of related search can be executed, for example 95%).
Given their importance for pharmacology and medicine the problematic divergence in database molecular structures for approved drugs has been pointed out (PMID 20298516). For this reason, we have chosen to use consensus sets compiled from within PubChem as curatorial starting points (described in this poster). This is because an exact chemical structure match between multiple sources is more likely to be correct. However, at only ~900 CIDs this consensus is about 65% of the expected total. Most of these are now curated and include their drug-target relationship mappings. In addition we have front-filled to include new approvals from 2010 up to 2Q 2014. Some back-filing will be explored via the consensus approach. However, since the concept of drug "correctness" is complex and somewhat abstract we have developed stringency guidelines to maximise database utility. These reduce the internal consequences of external different structural representations of the same drugs and associated splitting of activity mappings. By controlling relationship expansion these simplifications maintain the precision of queries. Our guidelines encompass a range of complexities but two can be illustrated. Since drugs can have many salt forms in PubChem we choose (i.e. normalise to) the parent CID for target and activity mapping since this usually corresponds to the INN name-to-structure mapping. However, records in PubChem BioAssay may map to salt forms whereas inspection of the assay details in a paper indicates the assumption of assigning, for example, an IC50 to the parent molecule is reasonable (e.g. if dilution and pH buffering are used). The other major naming ambiguity and data-splitting problem is stereochemistry. An example is where an approved drug INN is assigned to an enantiomeric mixture (that does not interconvert in vivo) but assay data is mapped to three different molecular representations (i.e. both the R and S isomers and the "flat" form). In this case, we assign the drug tag to the mixture and map data to this. We then add cross-pointers to the CIDs for the R and S only if data has been reported and/or mapped to them. A well-known example is omeprazole as the mixture and esomeprazole as the S isomer, as separately approved drugs. It is important to note that we include both discontinued and withdrawn drugs (generally superseded by newer drugs) to maximise our capture and cheminformatic analysis of drug sets but these can be filtered out of queries if necessary.
No, because the database is focused on quantitative molecular pharmacology, captured as a ligand-target relationship matrix to facilitate data navigation and mining. It is thus neither a substitute for a British Pharmacopoeia as a national example, nor a Drugs.com type of patient-centric resource. Many substances approved for medicinal purposes would negatively impact the precision of our database if we mapped-in their molecular interactions as "drugs". These include nutraceuticals that are principally metabolites (e.g. the DrugBank "approved drug" entry for NADH lists 144 targets), endogenous hormone replacements and inorganic salts (with the important exception of Lithium). We still face the challenge of finding unique nationally approved drugs that are not FDA- or EMA-listed but we do have some Japan-only examples.
Cases where clinical efficacy is thought to be mediated by multiple mmoas (molecular mechanisms of action) are termed polypharmacology. The archetype for this is a dual inhibitor, such as fasidotrilat that acts on both ACE and NEP. For curation, we will map the most potent cross-reactivity but generally not large SAR result sets. If the author convincingly proposes polypharmacology on the basis of the data, we will assign the ligands as multiple primary targets. The challenge here can be the limited evidence that multiple reported mmoas in vitro are actually translated to synergistic efficacy in clinical trials. Kinases are a particular difficult example. Since we include the three sets of matrix panel results, as well selected activity data from individual papers if available, at least the cross-reactivity data is surfaced for users to make their own judgments.
We certainly capture approved drugs and some advanced clinical candidates with clear evidence of therapeutic effects but where the complete mechanism of action is unknown or remains equivocal (e.g. Lithium). We also have some research compounds that have a phenotypic read-out and/or are pathway-mapped as a partial mechanism of action. These have curator comments indicating this (e.g. CCG-1423).
The majority of our ligand entries are small organic molecules, proteins, unmodified peptides, and smaller unmodified nucleotides or polysaccharides, However we are well aware that increasing numbers of new therapeutic molecular entities in clinical development are covalently linked permutations of these basic forms. Consequently, we are currently looking at the options. This includes assessing HELM, Sugar & Splice and InChI for large molecules and other formal ways for representing hybrid moieties. In addition, we are discerning how companies are adapting their registration systems to handle this. We are also observing the new INN, FDA and USAN guidelines being developed as well as PubChem engagement in this area. In general, we have not added large recombinant protein drugs to the database where these are effectively replacements for endogenous proteins.
There are many examples that do not fit into standard rules for ligand-target relationship mapping. One of these is drug-to-prodrug where we specifically introduced a new relationship. Complications arise where we cannot activity-map the drug to the target where the pro-drug is inactive. As you can see from the ACE inhibitor examples, the challenge is compounded since both forms are assigned an INN. We make another "rule-bend" where we map both prodrug and drug to the primary target (otherwise, it would become complicated since some pro-drugs are active against the target at lower potency. The consequence is thus a slight ligand over-count. However (specifically for ACE inhibitors) this is balanced by some "missing" human target activity mappings (e.g. only rabbit data was published). We use curators' notes to cross-reference the prodrug > drug ligand relationships. Note we also do this for drug > metabolite relationship where these metabolites were reported as significantly bioactive in their own right. Another important exceptional relationship is our recording of ligand-to-ligand binding interactions in the form of therapeutic monoclonal antibodies (mAbs) and their target interactions with cytokines or receptors.
Experimentalists find it valuable for us to point them to isotopically labelled ligand derivatives reported in the literature as probes. However, if the radiolabel positions are not explicit we can neither represent the molecular structure nor match it to a PubChem CID. We therefore introduced a pragmatic solution for unspecified label positions by duplicating the record of the unlabelled structure in order to link to the reference for the results from using the (unspecified) labelled version. Some of these are being remediated as more radiochemical vendors are submitting PubChem entries.
We do not link to any specific supplier because the 2014 increase in PubChem vendor submissions (currently over 55 million CIDs) means we can no longer maintain curated links. The good news is that ~80% of our CIDs have a vendor match. These are accessible via the "Chemical Vendors" link on the right-hand side of a CID entry.
We use two types of call-outs as on-the-fly queries (as opposed to manually curated out-links). For drug names the searches are executed against PubMed. The main advantage is the rapid return of multiple results that would be impossible to keep up with manually (e.g. 100s of publications on each approved drug). For a standardised INN we offer three levels of PubMed search specificity, titles, titles or abstracts and clinical. Since INNs are specifically chosen to be "clean" false-positives are rare. While call-outs include null-returns (i.e. no hits) this is rare in our implementations. The specificity of returns obviously depends on the fidelity of PubMed indexing (e.g. "clinical" can include reviews of trials as well as primary reports but this is also useful). Note the returned result shows the syntax of the PubMed query. You can then sort, adapt or extend this with your own edits such as adding a date cut-off. Another useful example is substituting a company code name for the INN (particularly if the latter is a new one). This extends the literature recall range to publications before the INN was assigned. For chemical structures, when you simply click the InChIKey for the call-out search against Google, this will instantly return an extensive list of database matches (see PMID 23399051). These will not only include ChemSpider (if there are matches) but you should also see a self-hit (i.e. to us) fairly high in the rankings. Note the backbone connectivity layer (the first part of the InChIKey) also brings back alternative stereo forms. Note if you see the second part of the Key as UHFFFAOYSA-N it indicates the absence of stereo information (i.e. the connectivity is "flat").