How to add markup to IDP resources
In this how-to, we will guide you through the necessary steps in order to get a JSON-LD markup describing your own IDP resource using a Bioschemas profile
1. Overview
This tutorial will introduce you to the necessary steps for a successful implementation of Bioschemas markup to an Intrinsically Disordered Protein (IDP) Community resource, providing a detailed description of Bioschemas profiles, their format and deployment on web pages. Adding a sitemap to a web site as well as registering persistent identifiers to resource data records complete the markup of a resource.
Following the instructions below you will learn how to implement Bioschemas markup, aiding resource data findability and interoperability.
IDP resources with Bioschemas markup will benefit from being included in the IDPcentral registry which will act as a domain search engine covering all community resources.
2. Bioschemas profiles for IDP resources
Main IDP resources are primary or aggregating databases describing aspects of IDPs. As such, they are marked up with Bioschemas as a typical database using three profiles:
DataCatalog
, a profile informing on the site providing the dataDataset
, a profile describing the data releases from the siteProtein
, a profile describing data records, which can be supplemented among others with:SequenceAnnotation
andSequenceRange
to denote annotations on the protein sequenceScholarlyArticle
to denote publications describing protein annotations
Every resource is primarily described with a DataCatalog
profile, specifying the provider of the resource, its version, license, keywords, description, format and so on.
In the DataCatalogue
profile you will find one or more Dataset
profiles with their own version, license, keywords and description.
For example, in the resource UniProt, the DataCatalog
is the UniProt knowledgebase, while the Dataset
s are Swiss-Prot and TrEMBL.
3. Format and placement of Bioschemas profiles
Format
Bioschemas is an extension of schema.org, therefore it uses the same formats for embedding web pages: Microdata, RDFa and JSON-LD. The Bioschemas community mainly uses the JSON-LD markup format, thus you are also recommended to use this format.
Placement
When placing profiles in your resource, you must remember the following best practice:
DataCatalog
andDataset
(s) profiles must be included on the resource home page exclusivelyProtein
profile must be included on entry pages, i.e. pages holding data records, only- all other pages (About, Help and others) should be void of markup.
While developing your resource, find a way to add and remove profiles from individual web pages, especially in single page applications. Within a page, you should place the markup as JSON-LD script in the document head with an id attribute (one for DataCatalog
and Dataset
schema and one for data record schema). By including id attributes, it will be relatively simple for you to toggle the visibility of the element containing the markup depending on the web page.
4. Example profiles for IDP resources
DataCatalog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
{
"@context": "https://schema.org/",
"@type": "DataCatalog",
"@id": "https://disprot.org/#DataCatalog",
"http://purl.org/dc/terms/conformsTo": {
"@type": "CreativeWork",
"@id": "https://bioschemas.org/profiles/DataCatalog/0.3-RELEASE-2019_07_01"
},
"sameAs": "https://registry.identifiers.org/registry/disprot",
"url": "https://disprot.org/",
"identifier": "https://registry.identifiers.org/registry/disprot",
"name": "DisProt, The database of intrinsically disordered proteins",
"description": "DisProt is a database of…",
"datePublished": "2019-09",
"dateModified": "2021-08",
"citation": {
"@type": "ScholarlyArticle",
"@id": "https://doi.org/10.1093/nar/gkz975",
"name": "DisProt: intrinsic protein disorder annotation in 2020",
"url": "https://doi.org/10.1093/nar/gkz975",
"sameAs": [
"https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz975/5622715",
"https://pubmed.ncbi.nlm.nih.gov/31713636/"
]
},
"keywords": [
"IDP",
"IDPs",
…
],
"sourceOrganization": [
{
"@type": "Organization",
"@id": "https://biocomputingup.it/#Organization",
"http://purl.org/dc/terms/conformsTo": {
"@id": "https://bioschemas.org/profiles/Organization/0.2-DRAFT-2019_07_19",
"@type": "CreativeWork"
},
"description": "University of Padua, Department of …",
"name": "BioComputing UP, Department of …",
"legalName": "University of Padua",
"sameAs": "https://biocomputingup.it"
}
],
"provider": [
{
"@type": "Person",
"givenName": "Silvio",
"familyName": "Tosatto",
"identifier": "https://orcid.org/0000-0003-4525-7793",
"name": "Silvio Tosatto",
"email": "user@domain.org",
"url": "https://biocomputingup.it/people/silvio"
}
],
"encodingFormat": [
"text/html",
"application/json"
],
"license": {
"@type": "CreativeWork",
"@id": "https://creativecommons.org/licenses/by/4.0/",
"name": "Creative Commons CC4 Attribution",
"url": "https://creativecommons.org/licenses/by/4.0/"
},
"dataset": {
"@type": "Dataset",
…
}
}
Dataset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"@type": "Dataset",
"@id": "https://disprot.org/#2021-08",
"http://purl.org/dc/terms/conformsTo": {
"@id": "https://bioschemas.org/profiles/Dataset/0.3-RELEASE-2019_06_14",
"@type": "CreativeWork"
},
"includedInDataCatalog": {
"@id": "https://disprot.org/#DataCatalog"
},
"url": "https://disprot.org/",
"dateModified": "2021-08",
"version": "8.3",
"name": "DisProt",
"description": "DisProt is …",
"identifier": "https://disprot.org/#2020-12",
"keywords": [
"IDP",
"IDPs",
…
],
"creator": {
"@id": "https://biocomputingup.it/#Organization"
},
"license": {
"@type": "CreativeWork",
"@id": "https://creativecommons.org/licenses/by/4.0/",
"name": "Creative Commons CC4 Attribution",
"url": "https://creativecommons.org/licenses/by/4.0/"
}
}
You can include the whole Dataset
profile in the DataCatalog
JSON object under the dataset key (DataCatalog
lines 66-69).
A deployed DataCatalog
and Dataset
markup can be seen in MobiDB, DisProt and PED
resource landing pages.
Data record profile: Protein
When data records describe a single protein entity, you can use a Bioschemas Protein 0.11 profile, as in the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
{
"@context": "https://schema.org",
"includedInDataset": "https://disprot.org/#2021-08",
"@type": "Protein",
"@id": "https://disprot.org/DP00003",
"http://purl.org/dc/terms/conformsTo": {
"@id": "https://bioschemas.org/profiles/Protein/0.11-RELEASE",
"@type": "CreativeWork"
},
"identifier": "https://identifiers.org/disprot:DP00003",
"sameAs": "http://purl.uniprot.org/uniprot/P03265",
"name": "DNA-binding protein",
"taxonomicRange": {
"@type": "DefinedTerm",
"termCode": "28285",
"url": "http://purl.bioontology.org/ontology/NCBITAXON/28285",
"sameAs": [
"http://purl.uniprot.org/taxonomy/28285",
"https://identifiers.org/taxonomy:28285",
"http://purl.obolibrary.org/obo/NCBITaxon_28285"
],
"inDefinedTermSet": {
"@type": "DefinedTermSet",
"name": "NCBI taxon",
"url": "https://bioportal.bioontology.org/ontologies/NCBITAXON"
}
},
"hasBioPolymerSequence": "MASREEEQRET…",
"hasSequenceAnnotation": [
{
"@type": "SequenceAnnotation",
"@id": "https://disprot.org/DP00003#disorder-content",
"http://purl.org/dc/terms/conformsTo": {
"@id": "https://bioschemas.org/profiles/SequenceAnnotation/0.7-DRAFT",
"@type": "CreativeWork"
}
"sequenceLocation": {
"@type": "SequenceRange",
"rangeStart": 1,
"rangeEnd": 529
},
"additionalProperty": {
"@type": "PropertyValue",
"name": "Protein disorder content",
"propertyID": {
"@id": "https://disprot.org/assets/data/IDPO_v0.2.owl#IDPO:00499"
},
"value": 0.09829867674858223
}
},
{
"@type": "SequenceAnnotation",
"@id": "https://disprot.org/DP00003r002",
"http://purl.org/dc/terms/conformsTo": {
"@id": "https://bioschemas.org/profiles/SequenceAnnotation/0.7-DRAFT",
"@type": "CreativeWork"
}
"sequenceLocation": {
"@type": "SequenceRange",
"rangeStart": 294,
"rangeEnd": 334
},
"additionalProperty": [
{
"@type": "PropertyValue",
"name": "Term",
"value": {
"@type": "DefinedTerm",
"@id": "https://disprot.org/assets/data/IDPO_v0.2.owl#IDPO:00076",
"inDefinedTermSet": {
"@type": "DefinedTermSet",
"@id": "https://disprot.org/assets/data/IDPO_v0.2.owl",
"name": "IDP ontology"
},
"termCode": "IDPO:00076",
"name": "disorder"
}
}
],
"subjectOf": {
"@type": "ScholarlyArticle",
"@id": "https://identifiers.org/pubmed:8632448"
}
}
]
}
Note the presence of two SequenceAnnotation
profiles.
- the first one (lines 30-50) shows how to annotate a protein region (in this case the whole protein) to an ontology term which has a numerical value
- the second one (lines 51-84) shows how to apply an ontology term to a part of a protein (defined by
SequenceRange
).
All data records (like the Protein
record) must link back to one dataset
profile with the use of includedInDataset
property as shown in line 3. This will ensure the correct assignment of data records to one or more datasets.
When data records describe protein complexes, ensembles or other biochemical entities, you can use a list of multiple Protein
profiles as in the following example:
{
"mainEntity": {
"@type": "ItemList",
"numberOfItems": 3,
"itemListElement": [
{
"@type": "Protein",
…
},
]
}
}
You are encouraged to credit a publication for all protein annotations described in data records by using a ScholarlyArticle
schema.org type. An example can be seen in Protein
profile, lines 80-83.
5. Web resource site map
For each of your resources, you are required to generate a sitemap file, a site map of your web resource listing all the pages accessible to crawlers, validators, or scrapers. This file will enable them to crawl your resource efficiently.
To create your sitemap file, place a file named sitemap.xml
at the root of the domain. A simple XML sitemap example looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.net/</loc>
<lastmod>2009-09-22</lastmod>
</url>
<url>
<loc>http://www.example.net/help</loc>
<lastmod>2009-09-22</lastmod>
</url>
<url>
<loc>http://www.example.net/data/1</loc>
<lastmod>2009-09-22</lastmod>
</url>
</urlset>
Sitemaps are limited to 50,000 URLs. If your resource has a higher number of web pages or data records, you should use a sitemap index file, which can include 50,000 individual sitemap. A sitemap index file example looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://website.org/sitemap1.xml.gz</loc>
</sitemap>
<sitemap>
<loc>https://website.org/sitemap2.xml.gz</loc>
</sitemap>
</sitemapindex>
6. Persistent identifiers for data records
Bioschemas promotes the use of persistent identifiers for data records in life sciences. You are encouraged to register a unique URI to all data records within your resources. You can register your persistent identifiers by requesting a namespace in Identifiers.org.
Each persistent identifier has
- a namespace which uniquely identifies the data collection
- a namespace suffix, identifying the data record within the collection.
Persistent identifiers are used throughout Bioschemas markup whenever possible (e.g. line 10 and 19 in the data record profile example).
Keywords: schemaorg, markup, structured data, bioschemas, ELIXIR IDP Community
Topics:
Audience:
- (Markup provider, Markup consumer) People interested in adding Bioschemas markup to their own IDP resource
Authors:
License: CC-BY 4.0
Version: 1.0
Last Modified: 25 January 2022