
Identifying External Cross-references using Natural Language Processing (NLP)
Abstract
[Context and motivation] Software engineers build systems that need to be compliant with relevant regulations. These regulations are stated in authoritative documents from which regulatory requirements need to be elicited. Project contract contains cross-references to these regulatory requirements in external documents. [Problem] Exploring and identifying the regulatory requirements in voluminous textual data is enormously time consuming, and hence costly, and error-prone in sizable software projects. [Principal idea and novelty] We use Natural Language Processing (NLP), Pattern Recognition and Web Scrapping techniques for automatically extracting external cross-references from contractual requirements and prepare a map for representing related external cross-references to each contractual requirement. This map is also automatically extended to the world-wide web using previously identified references that are not located in local resources. The novel aspects in our approach involve: (i) a taxonomy of semantic cues for identifying cross-references, (ii) a taxonomy of grammatical structures for supporting various combinations of word roles in a sentence, (iii) APA standards for validating cross-references, and (iv) third party access for unavailable resources. [Research Contribution] The key research contribution is a tool implementing the mentioned techniques for identifying cross-references in contractual documents and related regulatory documents and the web. The tool produces high-level and detailed views of cross-references amongst documents that can be used by various stakeholders for project management, requirements elicitation, testing, and other purposes. We anticipate that this would save an enormous amount of time and effort needed to do this task manually in contractual projects. [Conclusion] The output cross-references produced by the tool suggests a precision of 99%, and recall of 87% from contractual requirements. Further work is identified.