Electronic Thesis and Dissertation Repository

Automatic extraction of requirements-related information from regulatory documents cited in the project contract

Sara Fotouhi, The University of Western Ontario

Abstract

[Context and motivation] Project contracts for building a system contain a large number of cross-references to regulatory documents such as environmental regulations, quality standards, and regulatory “codes”. The system being developed must comply with regulatory requirements in such documents. Thus, a domain expert needs to read and interpret the relevant regulatory documents. [Problem] This can be an arduous and time-consuming task in large projects because the relevant regulatory requirements may be scattered across numerous regulatory documents. [Principal idea and novelty] The text prior to or following an external cross-reference in a contract contains information that can assist in automatically locating relevant information from the target regulatory documents. This study used dependency parsing, Part of Speech tagging and Regular Expression to extract the Target Phrase, which is the text referencing more elaborate content in the cited external document, and the target position, which is the location of the referenced text within the external document. The study then conducted a search operation using Elasticsearch and query DSL to retrieve relevant information from the cited legal documents and standards. [Research Contribution] This thesis describes a software solution that, to our knowledge, for the first time automatically extracts requirement-related information from external documents cross-referenced in the contract. [Conclusion] The final output displays the relevant text, the content of relevant pages and the page number for a corresponding regulatory requirement ordered by relevance score. For Target Phrase extraction, we obtained Precision = 0.81, Recall = 0.98 and F-measure = 0.89. We obtained Precision = 1 and Recall = 1 in target position extraction. Automatically extracting the relevant information from disparate sources will save an enormous amount of time and reduce workload for requirement analysts and domain experts.