Function Repository Resource:

ImportWikipediaTables

Source Notebook

Import all available tables from a Wikipedia page

Contributed by: César Guerra

ResourceFunction["ImportWikipediaTables"][url]

imports all the available tables from the Wikipedia page located at url.

ResourceFunction["ImportWikipediaTables"]["title"]

imports the page titled "title".

ResourceFunction["ImportWikipediaTables"][…,format]

returns the tables in the format specified.

Details and Options

The format can be any of "ListOfTables" (default), "IconAssociation" or Automatic.

ResourceFunction["ImportWikipediaTables"] takes the following options:

"AvoidRowsOfUnequalLength"

False

whether to avoid rows of unequal length

"SemanticImportSpecification"

{}

lists of tables for which to try a SemanticImport

"ShowPreview"

False

whether to show a preview of the imported tables

Tables imported with semantic interpretation are given as Dataset objects. Otherwise tables are nested lists.

ResourceFunction["ImportWikipediaTables"] makes a call to the Import function with "FullData" second argument and then further processes the result to get the tables.

Examples

Basic Examples (2)

Import the tables available in the Wikipedia page about "List of countries by forest area". By default the result is a list of tables (nested lists) found in this page:

In[1]:=

Out[2]=

See the size of the tables:

In[3]:=

Out[3]=

Usually, the scraped values are strings, but if the data contains well formatted numbers they are scraped as numeric types. This shows a TextGrid of the first table:

In[4]:=

Out[4]=

If the second argument is specified as "IconAssociation", the result is an Association of table data. The tables are iconized as the "Data" element to shorten long outputs:

In[5]:=

Out[5]=

Tables can be copy–pasted or extracted programmatically (note that the 1 in the third argument of Query is used to get data from the IconizedObject):

In[6]:=

Out[6]=

Scope (2)

The key "UnequalLengthRows" contains the indexes of rows which have different lengths from the commonest length of the table. In the following, the first table has the first two rows with a different length from the rest of the table:

In[7]:=

Out[7]=

This information can be useful for fixing or skipping these rows:

In[8]:=

Out[8]=

Import tables from a Spanish Wikipedia page:

In[9]:=

Out[9]=

Options (8)

AvoidRowsOfUnequalLength (2)

Rows with different length than the commonest length of the table can be automatically skipped:

In[10]:=

Out[10]=

Using the default setting shows that the original tables had rows of unequal length:

In[11]:=

Out[11]=

SemanticImportSpecification (5)

The "SemanticImportSpecification" option can be used to try to semantically import all the tables:

In[12]:=

Out[12]=

Semantically import a specific list of tables:

In[13]:=

Out[13]=

The tables can have rows of unequal lengths:

In[14]:=

Out[14]=

An Association of list of types can be passed for the interpretation of each table (see the second argument of SemanticImport):

In[15]:=

Normal[ResourceFunction["ImportWikipediaTables"][
"List of islands by area", "SemanticImportSpecification" -> {{1}, {3, <|1 -> Number, 2 -> String|>}}][[3]]]

Out[15]=

Additionally, SemanticImport options can be passed for each table:

In[16]:=

ResourceFunction[
"ImportWikipediaTables"]["List of islands by area", "IconAssociation", "SemanticImportSpecification" -> {{1}, {3, {Integer, String, Number, Number, None}, "Options" -> {HeaderLines -> 2}}}]

Out[16]=

ShowPreview (1)

Show a preview of each table in a "Print" cell:

In[17]:=

Out[17]=

Properties and Relations (2)

Extract the names of the 10 largest lakes by area from a table on Wikipedia:

In[18]:=

Out[19]=

In the Wolfram Language, the same data can be gathered directly using the Entity framework and knowledge representation:

In[20]:=

Out[20]=

Import a table:

In[21]:=

Out[21]=

For more complex data tables, a direct call to the "Source" element of the Import function can be used, but parsing the result is complicated. For example, here we show the underlying structure of the same table:

In[22]:=

$Partition[StringCases[ StringCases[ Import["https://en.wikipedia.org/wiki/List_of_river_systems_by_length", "Source"], Shortest["<table" ~~ __ ~~ "</table>"], \[Infinity]][[ 6]], {"<td" ~~ Shortest[___] ~~ ">" ~~ Shortest[str__] ~~ "</td>" :> "data", "<th" ~~ Shortest[___] ~~ ">" ~~ Shortest[str__] ~~ "</th>" :> "header"}], 8][[;; 5]] // TableForm$

Out[22]=

Possible Issues (2)

Unequal length rows can appear at the end of tables, since Wikipedia tables can have headers at the end:

In[23]:=

Out[23]=

Wikipedia tables can also have spanned cells in any place, in those cases they are reported as rows of unequal lengths, so that they can be fixed or deleted:

In[24]:=

Out[24]=

In some cases they are due to missing data and can be handled incorrectly by SemanticImport. For example, getting the Cuba GDP from the corresponding Wikipedia table gives an incorrect result, since the values are shifted for this row:

In[25]:=