Alasdair J G Gray

Connecting the dots in the World's data

New Paper: BioHackathon 2020 report on using Bioschemas to populate a community registry

The main topic that I focused on at last year’s virtual BioHackathon was using Bioschemas markup scraped from web pages about proteins known to be disordered and populating this data into a registry (IDPcentral). We’ve refined the process, and markup, quite a bit since the hackathon, resulting in one notebook that transforms the scraped data files into a consolidated knowledge graph and another notebook which runs some simple analysis queries, including the HCLS Dataset Description metadata statistic queries.

  1. Exploiting Bioschemas Markup to Populate IDPcentral

    Abstract: One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) community is create a registry called IDPcentral. The registry will aggregate data contained in the community’s specialist data sources such as DisProt, MobiDB, and Protein Ensemble Database (PED) so that proteins that are known to be intrinsically disordered can be discovered; with summary details of the protein presented, and the specialist source consulted for more detailed data. At the ELIXIR BioHackathon-Europe 2020, we aimed to investigate the feasibility of populating IDPcentral harvesting the Bioschemas markup that has been deployed on the IDP community data sources. The benefit of using Bioschemas markup, which is embedded in the HTML web pages for each protein in the data source, is that a standard harvesting approach can be used for all data sources; rather than needing bespoke wrappers for each data source API. We expect to harvest the markup using the Bioschemas Markup Scraper and Extractor (BMUSE) tool that has been developed specifically for this purpose. The challenge, however, is that the sources contain overlapping information about proteins but use different identifiers for the proteins. After the data has been harvested, it will need to be processed so that information about a particular protein, which will come from multiple sources, is consolidated into a single concept for the protein, with links back to where each piece of data originated. As well as populating the IDPcentral registry, we plan to consolidate the markup into a knowledge graph that can be queried to gain further insight into the IDPs.

    Gray, Alasdair J G and Papadopoulos, Petros and Mičetić, Ivan and Hatos, András

    Technical Report. BioHackrXiv. jun, 2021

About Me

Headshot

I'm an Associate Professor in Computer Science at Heriot-Watt University. My research focuses on linking datasets. Read more

Tweets