DanieleGregori/ ArXivExplore

(1.0.1) current version: 1.0.3 »

ArXivExplore helps the deep data analysis of all research articles on ArXiv

Contributed by: Daniele Gregori

ArXivExplore helps the deep data analysis of all 2.6M physics, math, cs, etc. articles on ArXiv, providing functionality for e.g. title/abstract word statistics; TeX source/formulae and citations dissection; NNs for classification or recommendation; LLM automated concept definitions and author reports.

Installation Instructions

To install this paclet in your Wolfram Language environment, evaluate this code:
PacletInstall["DanieleGregori/ArXivExplore"]


To load the code after installation, evaluate this code:
Needs["DanieleGregori`ArXivExplore`"]

Examples

Basic Examples (4) 

The first article ever on ArXiv:

In[1]:=
ArXivDatasetLookup["id"] // First
Out[1]=
In[2]:=
ArXivTitles["physics/9403001"]
Out[2]=
In[3]:=
ArXivAuthors["physics/9403001"]
Out[3]=

A DateListPlot showing the trends in the most popular title words in theoretical physics category (hep-th):

In[4]:=
Block[{words = {"black", "gauge", "gravity", "string"}},
 ArXivPlot[words, {"hep-th", All}, "Titles", PlotRange -> Full, PlotLegends -> words]]
Out[4]=

All the 100 most common 2-neighbour title words on the whole ArXiv, ever:

In[5]:=
ArXivTitlesWordNeighboursTop[All, 2, 50] // EchoTiming // Normal // Multicolumn[#, 3] &
1956.861427`
Out[5]=

Let us also show an author's citations graph, with the tooltip indicating the articles ids:

In[6]:=
ArXivCitationsAuthorGraph[{"E. Vescovi", "Edoardo Vescovi"}]
Out[6]=

Scope (12) 

The dimensions whole ArXiv dataset (at the end of July 2024):

In[7]:=
ArXivDataset[All] // Dimensions
Out[7]=

Let us create a super-database with all computer science "cs" type primary or cross-list categories:

In[8]:=
ArXivDataset[{"cs", All}] = ArXivDatasetAggregate[{"cs", All}] // EchoFunction[Dimensions];

and then let us visualize the most frequent and less frequent title words:

In[9]:=
Block[{cat = {"cs", All}, tabs, colrules, tabskey, compl, cut = 160, res = 16},
 colrules = {"learning" -> Style["learning", Purple, Bold], "using" -> Style["using", Purple, Bold], "theory" -> Style["theory", Red, Bold], "understanding" -> Style["understanding", Red, Bold]};
 tabs = MapAt[Apply[Sequence, #] &,
    MapIndexed[
     Partition[
       Riffle[Map[Style[#, Bold] &, Range[res*(First[#2] - 1) + 1, res*First[#2]]], #], 2] &, Partition[Normal@ArXivTitlesWordsTop[cat, cut], UpTo@res]], {All,
      All, 2}] /. colrules;
 tabskey = First@Cases[
    tabs, _List?(MemberQ[#[[All, 2]], Alternatives["theory", "understanding"] /. colrules] &)];
 compl = Text[Style[
    "... " <> ToString[Round[First@tabskey[[1, 1]] - 1, 10]] <> "+\n
words more\n
popular than\n
\"understanding\"\n
or \"theory\"\n
in CS !", Bold, 12, TextAlignment -> Center]];
 GraphicsRow[{TextGrid@tabs[[1]], compl, TextGrid@tabskey}]]
Out[9]=

Let us compute the 4 most frequent categories:

In[10]:=
ArXivCategoriesTop[4] // Normal // Column
Out[10]=

with their meaning:

In[11]:=
ArXivCategoriesLegend[#] & /@ Keys@ArXivCategoriesTop[4] // Column
Out[11]=

Using only titles and abstracts, we can train a NN to classify different categories:

In[12]:=
{train, test} = ArXivClassifyCategoryTrainTest[4, 4000];
In[13]:=
net = ArXivClassifyCategoryNet[4, 128, .9]
Out[13]=
In[14]:=
netTrained = NetTrain[net, train, All, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 25]
Out[14]=

Even with a basic 15 minutes training on laptop CPU, we obtain 95% accuracy:

In[15]:=
NetMeasurements[netTrained["TrainedNet"], test, "Accuracy"]
Out[15]=
In[16]:=
NetMeasurements[netTrained["TrainedNet"], test, "ConfusionMatrixPlot"]
Out[16]=

We could even to classify authors within the same category, with ArXivClassifyAuthorNet.

Extracting TEX introduction:

In[17]:=
ArXivTeXIntroduction[Echo@RandomChoice@ArXivIDs[All]] // Short[#, 10] &
Out[17]=

also TEX formulae:

In[18]:=
Manipulate[Take[Lookup[#, i], UpTo[30]], {i, Keys[#]}, SaveDefinitions -> True] &@
 ArXivTeXFormulae[Echo@RandomChoice[ArXivIDs[All]]]

Explain a technical concept using an article introduction:

In[19]:=
ArXivExplainConcept["Viterbi algorithm", "2401.02314"]
Out[19]=

Let us visualize all authors with more than 7 papers in primary category "cs.NA":

In[20]:=
ArXivAuthorsTop[7, "cs.NA"] // Column
Out[20]=

Let us pick a random author among them and use LLM functionality to explain his overall work:

In[21]:=
ArXivExplainAuthor["Kevin Carlberg", "cs.NA"]
Out[21]=

Publisher

Daniele Gregori

Disclosures

Compatibility

Wolfram Language Version 14

Version History

  • 1.0.3 – 12 October 2024
  • 1.0.2 – 25 September 2024
  • 1.0.1 – 06 August 2024

License Information

MIT License

Paclet Source

See Also