ArXiv [hep-th] text analysis
ArXiv [hep-th] text analysis
Version 1.3
Daniele Gregori
Text analysis of all 163 000+ theoretical high energy physics papers on arXiv (with hep-th as primary or cross-list category), from 1986 to 2023. Exploration of the following possible tasks: 1) counting; 2) feature extraction; 3) classification; 4) question answering; 5) summarising; 6) recommending papers / research directions.The results are the following: 1) interesting temporal trends appear in title words popularity; 2) 2-words combinations of title words turn out to correspond to hep-th concepts and allow effective feature extraction and CONCEPT embedding of abstracts; 3) classifiers of article categories are built as Neural Networks (NNs) based on either CONCEPT or SciBERT embedding; 4) through a more sophisticated NN, the CONCEPT classifier works also for the subcategories within hep-th category; 5) effective question answering and summarization of article introductions, through high level AI WL functionality; 6) a first basic recommendation algorithm, according to distance in feature space.In perspective it looks sensible to relate papers in feature space and thus inspire new discoveries.
0. ArXiv data wrangling
0. ArXiv data wrangling
We start by importing and cleaning ArXiv articles’ main data, from which we extract the relevant sub-datasets.
We also set up a basic API for directly import new data.
Full ArXiv dataset
Full ArXiv dataset
Dataset for [hep-th] category
Dataset for [hep-th] category
Dataset
Dataset
Categories and IDs
Categories and IDs
Titles, abstracts, authors
Titles, abstracts, authors
Structuring data temporally
Structuring data temporally
API for new submissions
API for new submissions
ArXiv API
ArXiv API
ArXiv Service Connect
ArXiv Service Connect
1. Article title words counting
1. Article title words counting
The first computational task we want to address is simply counting title words and observing their evolution over time.
Of course this involves relatively basic WL programming, that is surely no Machine Learning (ML) is required. However, since the dataset is very large, this will already deliver important insights, for example on word popularity temporal trends.
Title words cleaning
Title words cleaning
Total words
Total words
Most popular words ever
Most popular words ever
Creation of 1 word vocabulary
Creation of 1 word vocabulary
Time evolution of popular words
Time evolution of popular words
DateListPlot of words shares per month
DateListPlot of words shares per month
Table of total words per year
Table of total words per year
WordCloud video
WordCloud video
Neighbour title words as CONCEPTS
Neighbour title words as CONCEPTS
Counting neighbour words
Counting neighbour words
Creation of 2 words vocabulary
Creation of 2 words vocabulary
2. Article abstract CONCEPT features
2. Article abstract CONCEPT features
A first non-trivial programming task is to create an embedding for each abstract and title, through encoding in terms of a special vocabulary given by 2-words CONCEPTS. They are the most popular 2 words combinations in the titles, which through positional embedding should account also for 3+ words actual concepts.
Here we give a first demonstration of the effectiveness of this embedding, by showing it correctly clusters some articles sample (well known to the programmer’s natural intelligence). We also compare it to other kind of embeddings and highlight the possible improvements.
Creation of CONCEPT embeddings
Creation of CONCEPT embeddings
Concepts abstract embeddings
Concepts abstract embeddings
Concepts title embeddings
Concepts title embeddings
Feature extraction through CONCEPT embeddings
Feature extraction through CONCEPT embeddings
Known articles examples
Known articles examples
Clusters and FeatureSpacePlot
Clusters and FeatureSpacePlot
Comparison with other methods
Comparison with other methods
Stability issues
Stability issues
3. Article classification for distinct categories
3. Article classification for distinct categories
A second ML task is to build a classifier to identify each article’s category.
This problem is well-posed as long as the categories are actually distinct. Therefore we import other categories’ datasets, analyse and create an encoding vocabulary for them, in the same way as we did for hep-th category.
The results show remarkable effectiveness of the classifier based on CONCEPT embedding, which we benchmark with another classifier built using SciBERT embedding.
Other distinct categories datasets
Other distinct categories datasets
Other categories for classifier testing
Other categories for classifier testing
Other databases
Other databases
Proper classifier through CONCEPT embeddings
Proper classifier through CONCEPT embeddings
Creation of title and abstract CONCEPT embedding
Creation of title and abstract CONCEPT embedding
Computation of abstract CONCEPT embedding
Computation of abstract CONCEPT embedding
Training proper classifier via CONCEPT
Training proper classifier via CONCEPT
Proper classifier through SciBERT embeddings
Proper classifier through SciBERT embeddings
SciBERT embeddings
SciBERT embeddings
Training proper classifier via SciBERT
Training proper classifier via SciBERT
4. Article classification for hep-th mixed categories
4. Article classification for hep-th mixed categories
We also show it is possible to identify even mixed categories (“cross-list” in arXiv’s jargon).
However, here we have more difficulty, at least because the data themselves are not really distinct in this respect. In fact the article always share the hep-th category, simply because some researchers do not bother to classify their paper through slightly different main categories or secondary categories.
To improve accuracy, we set up the CONCEPT classifier as a more sophisticated NetGraph, which is able to handle two inputs: both abstract and title embeddings.
Differently from the case of distinct categories, we find that the classifier based on SciBERT embedding performs slightly better on mixed categories. However, it puts much strain on an average laptop’s RAM. Instead, our classifier based on CONCEPT embeddings manages very well the computer’s resources and allows cheap extensive analysis of the whole dataset (here our training and test sets add up to 70,000 articles).
Mixed classification generalities
Mixed classification generalities
Mixed categories classes definition
Mixed categories classes definition
Classification task
Classification task
Mixed classifier through CONCEPT embeddings
Mixed classifier through CONCEPT embeddings
Computation of CONCEPT abstract embeddings
Computation of CONCEPT abstract embeddings
Train mixed classifier via CONCEPT (abstracts)
Train mixed classifier via CONCEPT (abstracts)
Computation of CONCEPT title and complete embeddings
Computation of CONCEPT title and complete embeddings
Train mixed classifier via CONCEPT (complete with abstracts and titles)
Train mixed classifier via CONCEPT (complete with abstracts and titles)
Mixed classifier through SciBERT embeddings
Mixed classifier through SciBERT embeddings
Computation of SciBERT embeddings
Computation of SciBERT embeddings
Train mixed classifier via SciBERT
Train mixed classifier via SciBERT
5. Question answering and summarizing article introductions
5. Question answering and summarizing article introductions
We also test simple question answering on abstracts, showing LLMFunction as much more effective than FindTextualAnswer.
Full source files and introductions
Full source files and introductions
FindTextualAnswer for introductions
FindTextualAnswer for introductions
One should think at improving accuracy in difficult cases (involving math too) like the last.
Besides, one may think a possible further application as specific concepts definition by inspection of multiple introductions.
Questions on abstracts
Questions on abstracts
Examples through TextSummarize (modest)
Examples through TextSummarize (modest)
We try to summarize introduction of known papers through the new (from version 14.0) TextSummarize function
To improve accuracy of TextSummarize (like FindTextualAnswer) probably some LLMFunction and prompt engineering might be leveraged.
6. Recommendation (state of the art) and perspectives
6. Recommendation (state of the art) and perspectives
Given the net trained to classify the category / topic, we can think to use it iteratively to hierarchically cluster the hep-th main category into increasingly smaller subcategories, down to each (overlapping) small field of expertise. Here we show and explain just the first iteration of such process.
Finally, we could use the trained net to set up a recommendation algorithm, using as criterion the lowest distance from a given article, in feature space. However, it turns out the net trained as a classifier is not ideal for this task. One could expect that by including citation data and training a new net on them, we could obtain a better recommendation application.
Recommendation by nearest article in feature space
Recommendation by nearest article in feature space