Function Repository Resource:

WebpageWordCloud

Source Notebook

Create a word cloud graphic from webpage text

Contributed by: Daniel de Souza Carvalho

ResourceFunction["WebpageWordCloud"][url]

creates a word cloud graphic from the text of the webpage at url.

Details and Options

ResourceFunction["WebpageWordCloud"] accepts all the options from the WordCloud function.
Some websites are created programmatically with JavaScript and dynamically build the webpage at the user's browser (at the client side). In this case, this function can take some time waiting for the internet request and browser to render the webpage.
ResourceFunction["WebpageWordCloud"] uses the following procedure, which can be slow depending on network and web conditions: (1) open a browser in the background; (2) send a request to a web server; and (3) render the webpage to get the text information (using JavaScript, API calls, etc…).
ResourceFunction["WebpageWordCloud"] supports the following options:
"SessionBrowser""Chrome"browser to use ("Firefox" is also supported)
"SessionVisible"Falsewhether to show the browser
"SessionKill"Truewhether to kill the browser instance session before rendering the page

Examples

Basic Examples (4) 

A word cloud graphic provides a type of summary of a webpage:

In[1]:=
ResourceFunction["WebpageWordCloud"]["www.wolfram.com"]
Out[1]=

Check out the trends at Amazon:

In[2]:=
ResourceFunction["WebpageWordCloud"]["www.amazon.com"]
Out[2]=

Highlight what a certain blog post is talking about:

In[3]:=
ResourceFunction[
 "WebpageWordCloud"]["https://blog.wolfram.com/2019/10/24/the-new-world-of-notebook-publishing/"]
Out[3]=

Obtain a highlight from the news:

In[4]:=
ResourceFunction["WebpageWordCloud"]["www.nyt.com"]
Out[4]=

Options (5) 

All options from WordCloud can be used:

In[5]:=
ResourceFunction[
 "WebpageWordCloud"]["https://en.wikipedia.org/wiki/John_II_of_Portugal", ColorFunction -> ColorData["DeepSeaColors"], Background -> Orange]
Out[5]=

Set the font family:

In[6]:=
ResourceFunction[
 "WebpageWordCloud"]["https://www.wolframscience.com/nks/p444--irreversibility-and-the-second-law-of-thermodynamics--webview/",
  FontFamily -> "Courier"]
Out[6]=

Provide a shape to fill:

In[7]:=
ResourceFunction[
 "WebpageWordCloud"]["https://en.wikipedia.org/wiki/Box", ColorNegate[\!\(\*
GraphicsBox[
TagBox[RasterBox[CompressedData["
1:eJztnb1qImEUhv3FWKirjVoGwdpCcTtdxCJtliV1QkzYxoBZWPYKbPQStPAO
rKzUK1DsvAdtFATx92T8YALp/JnNvDN5H3g8aOHgPCIDA8frh5fbJ5fD4Xi9
0h5u7//+qFTu//38pj35VX79/VwuPd6U/5SeS5XvD27txTvNR02PphBCCCEG
st1uZbPZUAMl9qHdbkuj0ZBms6kmvcxWqyXL5fLkDvv9Xs1kMimHy2FqnJPJ
5MM5PqVHNpsVj8cjPp9PTXqeXq9XzUgkItPp9Owe6XRaNXW73aZ/r6ys0+lU
MxgMsgeA7IEle2DJHliyB5bsgSV7YGl2D5fL9SXUzzN6D2p+D/2Y4XBYOp2O
9Pt96fV6atrNbrerZiqVev89QO0Ri8WOPo7VKRQKR50fM3tEo1GZz+fv97IO
026u12s1c7mcJXosFouTj2kldrudmvl8nj0AYA8s2AML9sCCPbBgDyzYAwv2
wII9sGAPLNgDC/bAgj2wYA8s2AML9sCCPbBgDyzYAwv2wII9sGAPLNgDC/bA
gj2wYA8s2AML9sCCPbBgDyzYAwv2wII9sGAPLNgDC/bAgj2wYA8s2AML9sCC
PbCwWo/ZbKZ2ZRz2Spi9L/1/uFqt1LTCvox4PH70caxOsViE7xEIBKRarUq9
XrettVpNzUQioT4z4r4litvD7H3pnyX39VlT9sCSPbBkDyzZA0v2wJI9sDSq
RyaTUe+lX2vT89R3i4dCoYt66PuXqTH6/f6zeuiMx2MZDAYyHA7VpJc5Go34
H5024nDfhRorIYQQQgj5fN4AMlfMrw==
"], {{0, 65.25}, {75., 0}}, {0, 255},
ColorFunction->RGBColor,
ImageResolution->96],
BoxForm`ImageTag["Byte", ColorSpace -> "RGB", Interleaving -> True],
Selectable->False],
DefaultBaseStyle->"ImageGraphics",
ImageSize->Magnification[1],
ImageSizeRaw->{75., 65.25},
PlotRange->{{0, 75.}, {0, 65.25}}]\)]]
Out[7]=

Specify a background color:

In[8]:=
ResourceFunction[
 "WebpageWordCloud"]["https://en.wikipedia.org/wiki/Alpha", Binarize[Rasterize[\[Alpha], RasterSize -> 300, ImageSize -> Large]],
  Background -> LightYellow]
Out[8]=

Firefox or Chrome can be set as the Marionette browser. View the browser used as a Marionette with "SessionVisible" True and keep the browser opened with "SessionKill" False:

In[9]:=
ResourceFunction["WebpageWordCloud"]["www.wolfram.com", "SessionBrowser" -> "Chrome", "SessionVisible" -> True, "SessionKill" -> False]
Out[9]=

Publisher

Daniel de Souza Carvalho

Version History

  • 2.0.0 – 01 May 2020
  • 1.0.0 – 11 November 2019

Related Resources

Author Notes

This function can be slow, as it opens a browser session, requests the URL, renders the page and gets the text back.
To get a webpage programmatically is not trivial: The web sites have security protection about "robots", and there is JavaScript code that is needed to render the page client side.

License Information