The research in extractive Text Summarization (TS) covers a wide range of features that are used to determine the most salient text segments to include them into the final summary. Different approaches select different features and methods, starting from the very basic ones like term frequency, position of the sentence within the original document, assigning higher weights to the sentences containing terms of the title and inverse sentence frequency; or more complex ones including word sense disambiguation, latent semantic analysis and anaphora resolution, textual entailment. However, each of the above mentioned system only focuses on a few distinct features, usually two or three. The aim of present research is to assess the relative importance of a set of different features and their impact on the process of extractive summarization generation. The inspected set of features and methods include term frequency, inverse term and sentence frequencies, word sense disambiguation, anaphora resolution, textual entailment recognition and corpus- tailored stopwords.
The initial work was focused on the impact of corpus-tailored stopwords on the process of TS and its integration with the abovementioned features. It was shown that some methods, for example, anaphora resolution implemented using JavaRAP, need improvement. The present paper reports on the further results of the ongoing research. The selected features were combined in a slightly different manner with the list of 350 common stopwords of English. The system performance was also tested without the stopwords filtering. BART coreference resolution tool was integrated to compare the results with the Java RAP results.
The final goal of the current research is to identify the features and tools that benefit TS the most with the further objective to use them for abstractive TS. Abstractive TS would involve transforming the text data to an internal semantic data representation. The present paper describes the data representation that was used in the experiments. It was designed to simplify the transition from the term-based data representation used now to the concept representation.
For full text: click here