New Workshop Paper: SPaR.txt
SPaR.txt is the result of a recent pilot project conducted with my colleague Ioannis Konstas jointly with Bimal Kumar and Richard Watson from Northumbria University. We are investigating using NLP techniques jointly with Knowledge Graphs to extract requirements from the UK’s building regulations. SPaR.txt, developed by our RA Ruben Kruiper, allows for extracting terms with little training required. This was shown over the Scottish building regulations since they are openly available in a machine processable format. This provides the first stepping stone to being able to automate the extraction of requirements from regulatory texts.
SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts
Abstract: Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such training data has led to research that focuses on small sub-tasks, such as shallow parsing or the extraction of a limited subset of rules. This study introduces a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for ACC. We annotate a small domain-specific dataset of 200 sentences, SPaR.txt, and train a sequence tagger that achieves 79,93 F1-score on the test set. We then show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents, and that both contiguous and discontiguous Multi-Word Expressions (MWE) are discovered with reasonable accuracy (70,3%).
Kruiper, Ruben and Konstas, Ioannis and Gray, Alasdair J G and Sadeghineko, Fahrad and Watson, Richard and Kumar, Bimal
In Natural Legal Language Processing workshop, EMNLP, Punta Cana, Dominican Republic, pages 129-143, Association for Computational Linguistics, 2021
About Me
I'm an Associate Professor in Computer Science at Heriot-Watt University. My research focuses on linking datasets. Read more