Thesis Format

Monograph

Identifying External Cross-references using Natural Language Processing (NLP)

Elham Rahmani, The University of Western OntarioFollow

Degree

Master of Science

Program

Computer Science

Collaborative Specialization

Artificial Intelligence

Supervisor

Nazim H. Madhavji

Abstract

[Context and motivation] Software engineers build systems that need to be compliant with relevant regulations. These regulations are stated in authoritative documents from which regulatory requirements need to be elicited. Project contract contains cross-references to these regulatory requirements in external documents. [Problem] Exploring and identifying the regulatory requirements in voluminous textual data is enormously time consuming, and hence costly, and error-prone in sizable software projects. [Principal idea and novelty] We use Natural Language Processing (NLP), Pattern Recognition and Web Scrapping techniques for automatically extracting external cross-references from contractual requirements and prepare a map for representing related external cross-references to each contractual requirement. This map is also automatically extended to the world-wide web using previously identified references that are not located in local resources. The novel aspects in our approach involve: (i) a taxonomy of semantic cues for identifying cross-references, (ii) a taxonomy of grammatical structures for supporting various combinations of word roles in a sentence, (iii) APA standards for validating cross-references, and (iv) third party access for unavailable resources. [Research Contribution] The key research contribution is a tool implementing the mentioned techniques for identifying cross-references in contractual documents and related regulatory documents and the web. The tool produces high-level and detailed views of cross-references amongst documents that can be used by various stakeholders for project management, requirements elicitation, testing, and other purposes. We anticipate that this would save an enormous amount of time and effort needed to do this task manually in contractual projects. [Conclusion] The output cross-references produced by the tool suggests a precision of 99%, and recall of 87% from contractual requirements. Further work is identified.

Summary for Lay Audience

In this thesis, we implemented an approach for automatically identifying external cross-references (references that refer to the existing external documents) from a contract document which is an official agreement between supplier and customer organizations. We categorized external references into three groups based on their differing formats: Direct Cue (DC), Indirect Cue (DC) and No Cue (NC) references. In the case study contract with 683 pages and 10345 paragraphs, we identified 667 DC references (83% of the total external references). Therefore, we focused on identifying DC references in this thesis.

As data preparation, we created two taxonomies: (i) “whitelist” taxonomy consists of a number of “reporting phrases” that precede cross-references in the contract, (ii) and ‘Hasleaf_Pattern” taxonomy consists of patterns that aid in finding the boundaries of references.

By utilizing Natural Language Processing (one of the artificial intelligence disciplines contains a set of functionalities designed for interacting between computers and human natural languages and then making them understandable for machines), RegexParser (a mini programming language enabling you for describing and parsing the texts) and the mentioned taxonomies, we have created a tool that can identify DC references from contracts with 99% average accuracy. For cross-references with target documents not available locally, the tool searches the world wide web using Web Scrapping techniques (an automated approach enabling to extract data from HTML web pages). With the target resource determined, the tool attempts to find second level references. Currently, the tool is limited to two levels of reference identification. This tabulated reference shows the relations between the references in the contract and the target resources with domain information.

This tabulated information can be used by different stakeholders including: project managers for scoping the effort and time for compliance analysis; analysts for eliciting project requirements; testers for creating test cases, and others. The case study contract was processed for cross-references by the tool in approx. 17 seconds; manually identifying these references would take a number of days, thus saving an enormous amount of time and effort, not to mention the quality of the work.

Recommended Citation

Rahmani, Elham, "Identifying External Cross-references using Natural Language Processing (NLP)" (2020). Electronic Thesis and Dissertation Repository. 6938.
https://ir.lib.uwo.ca/etd/6938