Basic Examples (5)
The first article ever on ArXiv:
A DateListPlot showing the trends in the most popular title words in theoretical physics category ("hep-th", primary or cross-list):
All the 50 most common 2-neighbour title words on the whole ArXiv, ever:
Authors with more than one possible name (and categories) are conveniently registered as "ArXivAuthor" entities. For example:
We can then also easily create an author citations graph, with the tooltip indicating the articles ids:
Scope (13)
The dimensions of the whole ArXiv main dataset (at the end of September 2024):
Let us create a super-database with all computer science "cs" type (primary or cross-list) categories:
and then let us visualize the most and less frequent title words:
Let us calculate the 10 most frequent categories, with their meaning and number of articles each:
We can create train and test sets using only 5000={4500,500} titles and abstracts for each category:
we can train a NN to classify these categories, with layers' dimension 80 and dropout level 0.5:
Even with a basic 30 minutes training on laptop CPU, we obtain 89% accuracy:
and a rather clean confusion matrix:
We could even classify authors within the same category, with ArXivClassifyAuthorNet.
Extracting TEX introduction:
also TEX formulae:
Explain a technical concept using an article introduction and LLMSynthesize:
Let us visualize all authors with more than 7 papers, in primary category "cs.NA":
Let us pick a random author among them and use LLM functionality to explain his overall work: