Wolfram Research

Function Repository Resource:

KeywordsGraph

Source Notebook

A weighted graph connecting frequently used keywords of a text that are sequential neighbors and thus visualizing the flow and clustering of ideas in the text

Contributed by: Vitaliy Kaurov

ResourceFunction["KeywordsGraph"][text,number]

finds a given number of most used words in text (keywords) and builds a graph with such keywords as vertices where any two vertices are connected by an edge if one of the keywords follows the other directly in text.

ResourceFunction["KeywordsGraph"][text, number, blist]

builds a graph with blacklisted strings blist removed from the text.

Details and Options

The function returns a Graph expression.
All options of Graph can be applied. The only additional option to those of Graph are "StopWords" and "LowerCase".
Default setting "StopWords"True automatically applies DeleteStopwords and hence no stop words can appear as keywords. Use "StopWords"False to keep the stop words.
Default setting "LowerCase"True automatically applies ToLowerCase to remove unwanted capitalization (for example, at the beginning of sentences) that might lead to incorrect graphs. Use "LowerCase"False to keep the capital letters in text, for example, to distinguish some abbreviations.
VertexWeight is set for every vertex to the number of times the corresponding keyword is encountered in text.
EdgeWeight is set for every edge to the number of times an edge connection is made. Among other applications, this also help to build more meaningful CommunityGraphPlot as some of its methods take EdgeWeight in account.
By default an undirected Graph is returned. Use DirectedEdgesTrue to get a directed graph that shows the sequential order in text of connected keywords.
By default VertexLabelsAutomatic to show keywords on the graph. Use option VertexLabelsNone to remove them.
Large texts require longer time to compute.

Examples

Basic Examples

Consider an English tongue twister:

In[1]:=
text = "Betty Botter bought some butter
  But she said the butter\[CloseCurlyQuote]s bitter
  If I put it in my batter, it will make my batter bitter
  But a bit of better butter will make my batter better
  So \[OpenCurlyQuote]twas better Betty Botter bought a bit of better \
butter";

Find the nine most frequently used words (not counting stop words) and see which words are directly next to each other in the text:

In[2]:=
ResourceFunction["KeywordsGraph"][text, 9]
Out[2]=

You can also find the order in which words follow each other:

In[3]:=
ResourceFunction["KeywordsGraph"][text, 9, DirectedEdges -> True]
Out[3]=

Scope

Get the text of the book Alice In Wonderland and build a keywords graph for the top eleven keywords:

In[4]:=
text = ExampleData[{"Text", "AliceInWonderland"}];
ResourceFunction["KeywordsGraph"][text, 11]
Out[5]=

Exclude the unwanted words by forming a blacklist. You can also apply any option of Graph. For instance, you can restyle your graph and resize vertices in accordance with their properties:

In[6]:=
blist = {"came", "said", "like", "just", "went"};
g = ResourceFunction["KeywordsGraph"][text, 11, blist, VertexSize -> "VertexWeight", GraphStyle -> "Prototype"]
Out[7]=

Because KeywordsGraph yields a Graph expression, you can apply any functions to it that you can apply to a Graph. For instance, you can find clustering by displaying community structure (note, because edges are weighted they might influence how the clustering is computed):

In[8]:=
CommunityGraphPlot[g]
Out[8]=

VertexWeight and EdgeWeight are set to the numbers of times keywords and their next-neighbor pairs are met in a text:

In[9]:=
PropertyValue[g, {VertexWeight, EdgeWeight}]
Out[9]=

The order of numbers of VertexWeight corresponds to the order of VertexList:

In[10]:=
ListPlot[AssociationThread[
  VertexList[g] -> PropertyValue[g, VertexWeight]],
 PlotTheme -> "Detailed", ScalingFunctions -> "Log"]
Out[10]=

Options

LowerCase

Consider a text where capitalization matters. For instance, here "us" and "US" are different terms:

In[11]:=
text = "A few of us have recently immigrated to the US and found \
spouses here. To us the US seems a gret place to raise a family.";

By default ToLowerCase is applied and "us" is not distinguished from "US":

In[12]:=
blist = {"of", "a", "the", "to", "and"};
ResourceFunction["KeywordsGraph"][text, 6, blist, "StopWords" -> False, DirectedEdges -> True]
Out[13]=

Use option "LowerCase"False to distinguish capitalized cases:

In[14]:=
ResourceFunction["KeywordsGraph"][text, 6, blist, "LowerCase" -> False, "StopWords" -> False, DirectedEdges -> True]
Out[14]=

StopWords

Sometimes you might need to keep some stop words. For example, consider "us" and "US" here:

In[15]:=
text = "A few of us have recently immigrated to the US and found \
spouses here. To us the US seems a great place to raise a family.";

By default "us" and "US" will be removed by DeleteStopwords:

In[16]:=
ResourceFunction["KeywordsGraph"][text, 6]
Out[16]=

Use option "StopWords"False to retain some stopwords and make your own blacklist of words to remove:

In[17]:=
blist = {"of", "a", "the", "to", "and"};
ResourceFunction["KeywordsGraph"][text, 6, blist, "StopWords" -> False]
Out[18]=

Applications

Get the dataset for presidential inaugural addresses from the Wolfram Data Repository and order it by time:

In[19]:=
inaugural = SortBy[ResourceData["Presidential Inaugural Addresses"], "Date"];

Extract the text of inaugural addresses for the two last presidents as of 2019 - Barack Obama and Donald Trump:

In[20]:=
obama = inaugural[-2]["Text"];
trump = inaugural[-1]["Text"];

Define graph styles:

In[21]:=
styles = {
   VertexLabelStyle -> Directive[GrayLevel[.8], 14],
   EdgeStyle -> Opacity[.5],
   VertexSize -> "VertexWeight",
   VertexStyle -> Opacity[.7],
   GraphStyle -> "Prototype",
   Background -> Black};

Build KeywordsGraph for Barack Obama and Donald Trump using 30 keywords. You can get the notion of key ideas without actually reading the texts:

In[22]:=
CommunityGraphPlot[
 ResourceFunction["KeywordsGraph"][obama, 30, {}, styles],
 PlotLabel -> Style["Barack Obama", 30, Lighter@Blue, FontFamily -> "Phosphate"],
 CommunityBoundaryStyle -> Directive[Yellow, Dashed, Opacity[.5]]]
Out[22]=
In[23]:=
CommunityGraphPlot[
 ResourceFunction["KeywordsGraph"][trump, 30, {}, styles],
 PlotLabel -> Style["Donald Trump", 30, Lighter@Red, FontFamily -> "Phosphate"],
 CommunityBoundaryStyle -> Directive[Yellow, Dashed, Opacity[.5]]]
Out[23]=

Possible Issues

The second argument (number of keywords in graph) should not exceed the total number of keywords in the text:

In[24]:=
text = "How much wood would a woodchuck chuck if a woodchuck could \
chuck wood?
  He would chuck, he would, as much as he could, and chuck as much \
wood,
  as a woodchuck would if a woodchuck could chuck wood.";
In[25]:=
ResourceFunction["KeywordsGraph"][text, 6, VertexLabels -> Automatic]
Out[25]=

Neat Examples

Get the dataset for presidential inaugural addresses from the Wolfram Data Repository and order it by time:

In[26]:=
inaugural = SortBy[ResourceData["Presidential Inaugural Addresses"], "Date"];
In[27]:=
allTEXT = Normal[inaugural[All, "Text"]];
allNAME = Normal[inaugural[All, DateString[#Date, "Year"] <> " " <> CommonName[#Name] &]];

Define graph styles:

In[28]:=
styles = {
   VertexLabels -> None,
   EdgeStyle -> Opacity[.5],
   VertexSize -> "VertexWeight",
   VertexStyle -> Opacity[.7],
   GraphStyle -> "Prototype",
   Background -> Black};

Build KeywordsGraph for each address using 30 keywords and arrange them in a grid:

In[29]:=
allGRAPH = MapThread[
   CommunityGraphPlot[
     ResourceFunction["KeywordsGraph"][#1, 30, {}, styles],
     PlotLabel -> Style[#2, White, 13],
     CommunityBoundaryStyle -> Directive[Yellow, Dashed, Opacity[.5]]] &,
   {allTEXT, allNAME}];
In[30]:=
Grid[Partition[allGRAPH, 4], Spacings -> {0, 0}, Background -> Black]
Out[30]=

Resource History

Related Resources

License Information