Automatic idea generation and analysis using NLP and ML techniques

Liu, Haixia (2019) Automatic idea generation and analysis using NLP and ML techniques. PhD thesis, University of Nottingham.

[img] PDF (Thesis - as examined) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB)


Ideas are the fundamental way in which information is conveyed in written text. This research investigates the discovery and extraction of ideas from corpuses of scientific literature. There are several elements to this work: (1) the functional definition of ideas; (2) the computation of novel ideas; (3) the representation of ideas; (4) the construction of a ground truth dataset; and (5) the use of citations as an idea container.

Ideas are defined as a <problem, solution> pair, where the problem and solution are represented by noun phrases, or a sequence of words. As a result of this, the task of idea detection is broken down to problem and solution extraction. The task of idea extraction is similar to Named Entity Recognition (NER), where the problems and solutions may be seen as special entities. These techniques worked well although the results contained a lot of noise that need to be removed.

Automatic idea generation was conducted using a dataset from the Journal of Science. Old ideas were defined as the existing <problem, solution> pairs in the same abstract and new ideas were generated by predicting new links between problems and solutions that do not occur together in one abstract. Evaluation was performed using metrics that are widely used in information retrieval. The F1 scores (higher than 0.90) provides good evidence that the proposed method is capable of generating useful ideas.

A ground truth data set that contained <problem, solution> pairs was constructed from the publications of the International Conference on Neural Information Pro-cessing Systems and the Journal of Machine Learning Research. This data was annotated by human volunteers, and it was used for training idea detection models using Conditional Random Field (CRF) and Long-short Term Memory (LSTM). To evaluate the performance of the models, the precision and recall were computed.

Idea analysis was studied by analyzing citations, which are considered to be containers for ideas. Word vectors were used to represent the citations for the purpose of classifying citation sentiment, and a method was developed to measure the sequence of citation sentiment. This method for analysing internal citation sentiment sequence worked well (with F1 measure 0.86).

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Brailsford, Timothy
Goulding, James
Maul, Tomas
Houghton, Tessa
Keywords: ideas generation, natural language processing, computational linguistics,
Subjects: Q Science > QA Mathematics
Faculties/Schools: University of Nottingham, Malaysia > Faculty of Science and Engineering — Science > School of Computer Science
Item ID: 56448
Depositing User: LIU, HAIXIA
Date Deposited: 04 Apr 2019 07:06
Last Modified: 07 May 2020 13:00

Actions (Archive Staff Only)

Edit View Edit View