Electronic Thesis and Dissertation Repository


Doctor of Philosophy


Computer Science


Charles X. Ling


We study active learning with generalized queries in the thesis.

In contrast to supervised learning, active learning can usually achieve the same predictive accuracy with much fewer labeled training examples, thus significantly reducing the labeling cost. However, previous studies of active learning mostly assume that the learner can only ask specific queries (i.e., require labels for specific examples by providing all feature values). For instance, if the task is to predict osteoarthritis based on a patient data set with 30 features, the previous active learners could only ask the specific queries as: does this patient have osteoarthritis, if ID is 32765, name is Jane, age is 35, gender is female, weight is 85 kg, blood pressure is 160/90, temperature is 98F, no pain in knees, no history of diabetes, and so on (for all 30 features). However, amongst all these 30 features, many of them may be irrelevant to osteoarthritis diagnosis (such as, ID, name, history of diabetes, etc.). More importantly, for such specific queries, the answers provided by the oracle are also specific. That is, each responded label is only applicable to one specific query (i.e., one specific example). In real-world situations, the oracles (usually human experts) are often more ready to answer generalized queries, such as “are people over age 50 with knee pain likely to have osteoarthritis?” Here only two relevant features (age and type of pain) are mentioned, and the other 28 are considered as don’t-care. Real-world human oracles usually regard such queries as more intuitive and easy to comprehend. More importantly, as one such generalized query can represent a set of specific ones, the corresponding answer provided by the oracle is also applicable to this whole set of specific queries. For instance, in our previous example, the answer for the proposed query is applicable for all people over age 50 with knee pain. Therefore, the active learner can obtain more information from each generalized query (together with the corresponding answer), and furthermore improve the learning more effectively and efficiently. In this thesis, we assume that the oracle is capable of answering such generalized queries, and develop different algorithms to implement such active learning with generalized queries, according to different real-world scenarios (i.e., under different assumptions). As far as we know, no previous work on active learning can deal with such generalized queries.

More specifically, we study active learning with generalized queries from the following four perspectives:

  • We theoretically study why and when such generalized queries can help in active learning, and demonstrate the superiority of generalized queries over specific ones through toy examples and learning theories. (See Chapter 2 for details.)
  • We assume that the oracle can answer generalized queries as easily as specific ones (i.e., with the same effort or cost). Thus we develop two novel active learning algorithms to ask as general as possible queries, and simultaneously keep the answers from the oracle as certain as possible. (See Chapter 3 for details.)
  • We make a more realistic assumption that, the more general a query is, the higher cost (effort) it causes to request the label. We therefore study the generalized queries in a cost-sensitive framework, and discuss two scenarios to, either balance the trade-off of the predictive accuracy and the query cost, or minimize the total cost of misclassification and query. (See Chapter 4 for details.)
  • We consider a more relaxed scenario that the oracle could only provide ambiguous answers to generalized queries. That is, the oracle would only respond with either “positive” (“yes”) or “negative” (“no”), where “positive” indicates that at least one of the examples represented by the generalized query can be labeled positive, and “negative” indicates that all such examples would be labeled negative. We then develop another new algorithm to implement active learning with generalized queries under this condition. (See Chapter 5 for details.)

Our study in this thesis has thoroughly addressed the advantages and difficulties of active learning with generalized queries. The theoretical study has proved that the query complexity of active learning with generalized queries is significantly lower than active learning with specific ones. The empirical study for a variety scenarios has also demonstrated that, to achieve certain predictive accuracy, active learning with generalized queries requires us to ask significantly fewer queries (or requires us to spend significantly lower labeling cost), compared with active learning with specific ones.