Function Repository Resource:

ConcordanceWords

Source Notebook

Find words associated with a search term in a list, text file, PDF or URL

Contributed by: Aryan Deshpande and Faizon Zaman

ResourceFunction["ConcordanceWords"][source,searchterm,n]

finds surrounding words within n words of searchterm in source.

Details

The proximity n is an integer that defaults to 3.
The argument source supports the following forms:
"file"a file or URL corresponding to a PDF file
{"string1","string2",}a list of strings of text content

Examples

Basic Examples (3) 

Find words occurring next to or near "president" in the US Constitution:

In[1]:=
ResourceFunction["ConcordanceWords"][
 List[ExampleData[{"Text", "USConstitution"}]], "president"]
Out[1]=

Find words occurring next to or near "Earth" on Wikipedia's page on the Moon:

In[2]:=
ResourceFunction["ConcordanceWords"][
 List[WikipediaData["Moon"]], "Earth"]
Out[2]=

Find words occurring next to or near "analytics" in a PDF published online:

In[3]:=
ResourceFunction[
 "ConcordanceWords"]["http://exampledata.wolfram.com/article.pdf", "analytics"]
Out[3]=

Scope (3) 

Specify a distance of 5 for words occurring next to "Sheet" on Wikipedia's page on "Paper":

In[4]:=
ResourceFunction["ConcordanceWords"][
 List[WikipediaData["Paper"]], "Sheet", 5]
Out[4]=

Find words occurring next to or near "circle" on a webpage using its URL:

In[5]:=
arXivAPI = "http://export.arxiv.org/api/query?search_query=all:circle&start=0&max_results=2";
In[6]:=
ResourceFunction["ConcordanceWords"][arXivAPI, "circle"]
Out[6]=

Specify a distance of 5:

In[7]:=
ResourceFunction["ConcordanceWords"][arXivAPI, "Circle", 5]
Out[7]=

Possible Issues (2) 

The web scraping function will only work if it matches the XML element condition:

In[8]:=
ResourceFunction[
 "ConcordanceWords"]["https://arxiv.org/abs/1906.00068v1", "circles"]
Out[8]=

Instead, the following code can be used to import and process the data:

In[9]:=
positions = Position[StringCases[
   Import["https://arxiv.org/abs/1906.00068v1", "Hyperlinks"], RegularExpression["(/pdf/)|(.pdf)"]], Except@{}, 1, Heads -> False]
Out[9]=
In[10]:=
links = DeleteDuplicates[
  Flatten[Import["https://arxiv.org/abs/1906.00068v1", "Hyperlinks"][[#]] & /@ positions]]
Out[10]=
In[11]:=
ResourceFunction["ConcordanceWords"][#, "Circles"] & /@ links
Out[11]=

Neat Examples (1) 

Find correlated words using ServiceConnect["ArXiv"]:

In[12]:=
arXiv = ServiceConnect["ArXiv"];
articles = arXiv["Search", {"Query" -> "Physics", "MaxItems" -> 5}];
urls = Normal@articles[All, {"URL"}];
urlist = Flatten[Values[urls]];
pdfurls = StringReplace[urlist, "http://arxiv.org/abs/" -> "http://arxiv.org/pdf/"];
datapdf = Quiet[Import[#, "Plaintext"] & /@ pdfurls];
In[13]:=
ResourceFunction["ConcordanceWords"][datapdf, "Force"]
Out[13]=

Publisher

Aryan Deshpande

Version History

  • 1.0.0 – 01 March 2021

License Information