Function Repository Resource:

ImportWikipediaTables

Source Notebook

Import all available tables from a Wikipedia page

Contributed by: César Guerra

ResourceFunction["ImportWikipediaTables"][url]

imports all the available tables from the Wikipedia page located at url.

ResourceFunction["ImportWikipediaTables"]["title"]

imports the page titled "title".

ResourceFunction["ImportWikipediaTables"][,format]

returns the tables in the format specified.

Details and Options

The format can be any of "ListOfTables" (default), "IconAssociation" or Automatic.
ResourceFunction["ImportWikipediaTables"] takes the following options:
"AvoidRowsOfUnequalLength"Falsewhether to avoid rows of unequal length
"SemanticImportSpecification"{}lists of tables for which to try a SemanticImport
"ShowPreview"Falsewhether to show a preview of the imported tables
Tables imported with semantic interpretation are given as Dataset objects. Otherwise tables are nested lists.
ResourceFunction["ImportWikipediaTables"] makes a call to the Import function with "FullData" second argument and then further processes the result to get the tables.

Examples

Basic Examples (2) 

Import the tables available in the Wikipedia page about "List of countries by forest area". By default the result is a list of tables (nested lists) found in this page:

In[1]:=
tables = ResourceFunction["ImportWikipediaTables"][
   "List of countries by forest area"];
Short /@ tables
Out[2]=

See the size of the tables:

In[3]:=
Dimensions /@ tables
Out[3]=

Usually, the scraped values are strings, but if the data contains well formatted numbers they are scraped as numeric types. This shows a TextGrid of the first table:

In[4]:=
TextGrid[First[tables]]
Out[4]=

If the second argument is specified as "IconAssociation", the result is an Association of table data. The tables are iconized as the "Data" element to shorten long outputs:

In[5]:=
assoc = ResourceFunction["ImportWikipediaTables"][
  "List of countries by forest area", "IconAssociation"]
Out[5]=

Tables can be copy–pasted or extracted programmatically (note that the 1 in the third argument of Query is used to get data from the IconizedObject):

In[6]:=
TextGrid[Query[1, "Data", 1]@assoc]
Out[6]=

Scope (2) 

The key "UnequalLengthRows" contains the indexes of rows which have different lengths from the commonest length of the table. In the following, the first table has the first two rows with a different length from the rest of the table:

In[7]:=
assoc = ResourceFunction["ImportWikipediaTables"][
  "List_of_U.S._states_and_territories_by_population", "IconAssociation"]
Out[7]=

This information can be useful for fixing or skipping these rows:

In[8]:=
assoc[[1, "Data", 1, 3 ;;]] // Dimensions
Out[8]=

Import tables from a Spanish Wikipedia page:

In[9]:=
Short[ResourceFunction["ImportWikipediaTables"][
  URL["https://es.wikipedia.org/wiki/Organización_territorial_de_México"]], 20]
Out[9]=

Options (8) 

AvoidRowsOfUnequalLength (2) 

Rows with different length than the commonest length of the table can be automatically skipped:

In[10]:=
Dimensions /@ ResourceFunction["ImportWikipediaTables"][
  "List of U.S. states and territories by area", "AvoidRowsOfUnequalLength" -> True]
Out[10]=

Using the default setting shows that the original tables had rows of unequal length:

In[11]:=
Dimensions /@ ResourceFunction["ImportWikipediaTables"][
  "List of U.S. states and territories by area"]
Out[11]=

SemanticImportSpecification (5) 

The "SemanticImportSpecification" option can be used to try to semantically import all the tables:

In[12]:=
ResourceFunction[
 "ImportWikipediaTables"]["List of islands by population", "IconAssociation", "SemanticImportSpecification" -> All]
Out[12]=

Semantically import a specific list of tables:

In[13]:=
ResourceFunction[
 "ImportWikipediaTables"]["List of islands by area", "IconAssociation", "SemanticImportSpecification" -> {3, 6}]
Out[13]=

The tables can have rows of unequal lengths:

In[14]:=
ResourceFunction[
 "ImportWikipediaTables"]["List of U.S. states and territories by area", "IconAssociation", "SemanticImportSpecification" -> All]
Out[14]=

An Association of list of types can be passed for the interpretation of each table (see the second argument of SemanticImport):

In[15]:=
Normal[ResourceFunction["ImportWikipediaTables"][
   "List of islands by area", "SemanticImportSpecification" -> {{1}, {3, <|1 -> Number, 2 -> String|>}}][[3]]]
Out[15]=

Additionally, SemanticImport options can be passed for each table:

In[16]:=
ResourceFunction[
 "ImportWikipediaTables"]["List of islands by area", "IconAssociation", "SemanticImportSpecification" -> {{1}, {3, {Integer, String, Number, Number, None}, "Options" -> {HeaderLines -> 2}}}]
Out[16]=

ShowPreview (1) 

Show a preview of each table in a "Print" cell:

In[17]:=
Dimensions /@ ResourceFunction["ImportWikipediaTables"]["List of cities by GDP", "ShowPreview" -> True]
Out[17]=

Properties and Relations (2) 

Extract the names of the 10 largest lakes by area from a table on Wikipedia:

In[18]:=
tables = ResourceFunction["ImportWikipediaTables"]["List of lakes by area"];
tables[[1, 2 ;; 10, 3]]
Out[19]=

In the Wolfram Language, the same data can be gathered directly using the Entity framework and knowledge representation:

In[20]:=
EntityList[
 EntityClass[
  "Lake", {EntityProperty["Lake", "SurfaceArea"] -> TakeLargest[10]}]]
Out[20]=

Import a table:

In[21]:=
ResourceFunction["ImportWikipediaTables"][
  "List of river systems by length", "ShowPreview" -> True] // Short
Out[21]=

For more complex data tables, a direct call to the "Source" element of the Import function can be used, but parsing the result is complicated. For example, here we show the underlying structure of the same table:

In[22]:=
Partition[StringCases[
    StringCases[
      Import["https://en.wikipedia.org/wiki/List_of_river_systems_by_length", "Source"], Shortest["<table" ~~ __ ~~ "</table>"], \[Infinity]][[
     6]], {"<td" ~~ Shortest[___] ~~ ">" ~~ Shortest[str__] ~~ "</td>" :> "data", "<th" ~~ Shortest[___] ~~ ">" ~~ Shortest[str__] ~~ "</th>" :> "header"}], 8][[;; 5]] // TableForm
Out[22]=

Possible Issues (2) 

Unequal length rows can appear at the end of tables, since Wikipedia tables can have headers at the end:

In[23]:=
ResourceFunction["ImportWikipediaTables"][
  "List of United States cities by population", "IconAssociation"][1]
Out[23]=

Wikipedia tables can also have spanned cells in any place, in those cases they are reported as rows of unequal lengths, so that they can be fixed or deleted:

In[24]:=
ResourceFunction[
 "ImportWikipediaTables"]["List of longest bridges", "IconAssociation"]
Out[24]=

In some cases they are due to missing data and can be handled incorrectly by SemanticImport. For example, getting the Cuba GDP from the corresponding Wikipedia table gives an incorrect result, since the values are shifted for this row:

In[25]:=
ds = Query[1, "Data", 1][
   ResourceFunction["ImportWikipediaTables"][
    "List of countries by GDP (nominal)", "IconAssociation", "SemanticImportSpecification" -> {{1, Automatic, "Options" -> {HeaderLines -> 1, ExcludedLines -> {2}}}}]];
ds = ds[All, <|"Country" -> "Country/Territory", "UNRegion" -> "UN Region", "Estimate (IMF)" -> "IMF [1] [13]", "Date (IMF)" -> "World Bank [14]", "Estimate (WorldBank)" -> "United Nations [15]", "Date (WorldBank)" -> "column6"|>];
ds[Select[#"Country" === Entity["Country", "Cuba"] &]]
Out[26]=

In other cases, the unequal length rows can happen because the first column is spanned to group some rows of the table:

In[27]:=
tables = ResourceFunction["ImportWikipediaTables"][
   "List of tallest mountains in the Solar System", "IconAssociation"][1]
Out[27]=

To show the spanned strings in the first column we pad the unequal length rows with empty strings:

In[28]:=
table1 = First@tables["Data"];
With[{idx = Complement[Range[First@tables["Dimensions"]], tables["UnequalLengthRows"]]}, table1[[idx]] = Prepend[#, ""] & /@ table1[[idx]]];
table1[[All, 1 ;; 3]] // TableForm
Out[29]=

Applications (1) 

Histogram of the tallest mountains:

In[30]:=
Histogram[
 First[ResourceFunction["ImportWikipediaTables"][
   "List of highest mountains on Earth", "SemanticImportSpecification" -> {{1, <|3 -> Real|>, "Options" -> {HeaderLines -> None, ExcludedLines -> {1, 2, 3}}}}]], 20]
Out[30]=

Publisher

Cesar Guerra

Version History

  • 1.0.0 – 15 March 2023

Related Resources

License Information