ArXiv [hep-th] text analysis

Version 1.3
Daniele Gregori
Text analysis of all 163 000+ theoretical high energy physics papers on arXiv (with hep-th as primary or cross-list category), from 1986 to 2023. Exploration of the following possible tasks: 1) counting; 2) feature extraction; 3) classification; 4) question answering; 5) summarising; 6) recommending papers / research directions.The results are the following: 1) interesting temporal trends appear in title words popularity; 2) 2-words combinations of title words turn out to correspond to hep-th concepts and allow effective feature extraction and CONCEPT embedding of abstracts; 3) classifiers of article categories are built as Neural Networks (NNs) based on either CONCEPT or SciBERT embedding; 4) through a more sophisticated NN, the CONCEPT classifier works also for the subcategories within hep-th category; 5) effective question answering and summarization of article introductions, through high level AI WL functionality; 6) a first basic recommendation algorithm, according to distance in feature space.In perspective it looks sensible to relate papers in feature space and thus inspire new discoveries.

0. ArXiv data wrangling

We start by importing and cleaning ArXiv articles’ main data, from which we extract the relevant sub-datasets.
We also set up a basic API for directly import new data.

Full ArXiv dataset


Dataset for [hep-th] category

Dataset


Categories and IDs


Titles, abstracts, authors


Structuring data temporally


API for new submissions

ArXiv API


ArXiv Service Connect


1. Article title words counting

The first computational task we want to address is simply counting title words and observing their evolution over time.
Of course this involves relatively basic WL programming, that is surely no Machine Learning (ML) is required. However, since the dataset is very large, this will already deliver important insights, for example on word popularity temporal trends.

Title words cleaning


Total words

Most popular words ever


Creation of 1 word vocabulary


Time evolution of popular words

DateListPlot of words shares per month


Table of total words per year


WordCloud video


Neighbour title words as CONCEPTS

Counting neighbour words


Creation of 2 words vocabulary


2. Article abstract CONCEPT features

A first non-trivial programming task is to create an embedding for each abstract and title, through encoding in terms of a special vocabulary given by 2-words CONCEPTS. They are the most popular 2 words combinations in the titles, which through positional embedding should account also for 3+ words actual concepts.
Here we give a first demonstration of the effectiveness of this embedding, by showing it correctly clusters some articles sample (well known to the programmer’s natural intelligence). We also compare it to other kind of embeddings and highlight the possible improvements.

Creation of CONCEPT embeddings

Concepts abstract embeddings


Concepts title embeddings


Feature extraction through CONCEPT embeddings

Known articles examples


Clusters and FeatureSpacePlot


Comparison with other methods


Stability issues


3. Article classification for distinct categories

A second ML task is to build a classifier to identify each article’s category.
This problem is well-posed as long as the categories are actually distinct. Therefore we import other categories’ datasets, analyse and create an encoding vocabulary for them, in the same way as we did for hep-th category.
The results show remarkable effectiveness of the classifier based on CONCEPT embedding, which we benchmark with another classifier built using SciBERT embedding.

Other distinct categories datasets

Other categories for classifier testing


Other databases


Proper classifier through CONCEPT embeddings

Creation of title and abstract CONCEPT embedding


Computation of abstract CONCEPT embedding


Training proper classifier via CONCEPT


Proper classifier through SciBERT embeddings

SciBERT embeddings


Training proper classifier via SciBERT


4. Article classification for hep-th mixed categories

We also show it is possible to identify even mixed categories (“cross-list” in arXiv’s jargon).
However, here we have more difficulty, at least because the data themselves are not really distinct in this respect. In fact the article always share the hep-th category, simply because some researchers do not bother to classify their paper through slightly different main categories or secondary categories.
To improve accuracy, we set up the CONCEPT classifier as a more sophisticated NetGraph, which is able to handle two inputs: both abstract and title embeddings.
Differently from the case of distinct categories, we find that the classifier based on SciBERT embedding performs slightly better on mixed categories. However, it puts much strain on an average laptop’s RAM. Instead, our classifier based on CONCEPT embeddings manages very well the computer’s resources and allows cheap extensive analysis of the whole dataset (here our training and test sets add up to 70,000 articles).

Mixed classification generalities

Mixed categories classes definition


Classification task


Mixed classifier through CONCEPT embeddings

Computation of CONCEPT abstract embeddings


Train mixed classifier via CONCEPT (abstracts)


Computation of CONCEPT title and complete embeddings


Train mixed classifier via CONCEPT (complete with abstracts and titles)


Mixed classifier through SciBERT embeddings

Computation of SciBERT embeddings


Train mixed classifier via SciBERT


5. Question answering and summarizing article introductions

We also test simple question answering on abstracts, showing LLMFunction as much more effective than FindTextualAnswer.

Full source files and introductions

FindTextualAnswer for introductions

One should think at improving accuracy in difficult cases (involving math too) like the last.
Besides, one may think a possible further application as specific concepts definition by inspection of multiple introductions.

Questions on abstracts

Examples through TextSummarize (modest)

We try to summarize introduction of known papers through the new (from version 14.0) TextSummarize function
To improve accuracy of TextSummarize (like FindTextualAnswer) probably some LLMFunction and prompt engineering might be leveraged.

6. Recommendation (state of the art) and perspectives

Given the net trained to classify the category / topic, we can think to use it iteratively to hierarchically cluster the hep-th main category into increasingly smaller subcategories, down to each (overlapping) small field of expertise. Here we show and explain just the first iteration of such process.
Finally, we could use the trained net to set up a recommendation algorithm, using as criterion the lowest distance from a given article, in feature space. However, it turns out the net trained as a classifier is not ideal for this task. One could expect that by including citation data and training a new net on them, we could obtain a better recommendation application.

Recommendation by nearest article in feature space