Function Repository Resource:

ReadScan

Source Notebook

Evaluate a function applied to each record in a file

Contributed by: Andrew Steinacher

ResourceFunction["ReadScan"][f,file]

evaluates f applied to each expression read from file in turn.

ResourceFunction["ReadScan"][f,file,type]

evaluates fapplied to each object of the specified type read from file in turn.

ResourceFunction["ReadScan"][f,file,{type1,type2,}]

evaluates fapplied to each list of the specified types read from file in turn.

Details

ResourceFunction["ReadScan"] supports the same types and options as Read.
ResourceFunction["ReadScan"][f,File["file"]] and ResourceFunction["ReadScan"][f,InputStream["name",n]] are supported.

Examples

Basic Examples (2) 

Print expressions from a file:

In[1]:=
file = CreateFile[];
Put[<|a -> 1, b -> 2|>, <|a -> 3, b -> 4|>, file];
ResourceFunction["ReadScan"][Print, file]
Out[3]=

Print lines from a file:

In[4]:=
ResourceFunction["ReadScan"][Print, "ExampleData/sentences", String]
Out[4]=

Scope (1) 

Process groups of records in bulk from a file:

In[5]:=
ResourceFunction["ReadScan"][Print, "ExampleData/numbers", ConstantArray[Number, 3]]
Out[5]=

Options (2) 

Print records from a file:

In[6]:=
ResourceFunction["ReadScan"][Print, "ExampleData/source", Record]
Out[6]=

Specify record separators:

In[7]:=
ResourceFunction["ReadScan"][Print, "ExampleData/source", Record, RecordSeparators -> {{"(: "}, {" :)"}}]
Out[7]=

Applications (1) 

Prematurely end processing a file via Catch and Throw:

In[8]:=
Catch@ResourceFunction["ReadScan"][
  (
    Print[#];
    If[StringStartsQ[#, DigitCharacter],
     Throw[
      Success["FoundNumber",
       <| "MessageTemplate" -> "A line starting with a number was found!"
        |>
       ]
      ]
     ]
    ) &,
  "ExampleData/ExampleData.txt",
  String
  ]
Out[8]=

Properties and Relations (3) 

ReadScan[f,file,type] is comparable to Scan[f,ReadList[file,type]], but it may use less memory:

In[9]:=
MaxMemoryUsed@
 Scan[StringLength, ReadList["ExampleData/TreesOwnedByTheCityOfChampaign.csv", String]]
Out[9]=
In[10]:=
MaxMemoryUsed@
 ResourceFunction["ReadScan"][StringLength, "ExampleData/TreesOwnedByTheCityOfChampaign.csv", String]
Out[10]=

ReadScan used with Reap and Sow can replicate behavior of ReadList:

In[11]:=
Reap[ResourceFunction["ReadScan"][Sow, "ExampleData/source", String]][[2, 1]] === ReadList["ExampleData/source", String]
Out[11]=

Processing a large file having a simple structure can use a lot of memory if done with Import:

In[12]:=
MaxMemoryUsed[
 Import@FindFile["ExampleData/TreesOwnedByTheCityOfChampaign.csv"]]
Out[12]=

ReadScan can achieve similar results with significantly less memory:

In[13]:=
MaxMemoryUsed[
 Reap[ResourceFunction["ReadScan"][Sow, FindFile["ExampleData/TreesOwnedByTheCityOfChampaign.csv"], String]][[2, 1]]]
Out[13]=

If the lines of the file are to be processed, but not stored in the current session, substantially less memory is required:

In[14]:=
MaxMemoryUsed[
 ResourceFunction["ReadScan"][Identity, FindFile["ExampleData/TreesOwnedByTheCityOfChampaign.csv"], String];]
Out[14]=

Possible Issues (2) 

When processing groups of lines, the final list may include EndOfFile:

In[15]:=
ResourceFunction["ReadScan"][Print[f[#]] &, "ExampleData/numbers", ConstantArray[Number, 4]]
Out[15]=

Using Import may be faster:

In[16]:=
AbsoluteTiming[
 Length@Import@"ExampleData/TreesOwnedByTheCityOfChampaign.csv"]
Out[16]=
In[17]:=
AbsoluteTiming[
 Length@Reap[
    ResourceFunction["ReadScan"][Sow, "ExampleData/TreesOwnedByTheCityOfChampaign.csv", String]][[2, 1]]]
Out[17]=

Neat Examples (4) 

Download and extract a somewhat large, "nearly-flat" JSON file from USDA FoodData Central:

In[18]:=
FileSize[
 extractedFile = First@ExtractArchive[
    URL["https://fdc.nal.usda.gov/fdc-datasets/FoodData_Central_survey_food_json_2021-10-28.zip"],
    $TemporaryDirectory
    ]
 ]
Out[18]=

Import requires a large amount memory:

In[19]:=
MaxMemoryUsed@Import[extractedFile]
Out[19]=

Processing the file line-by-line requires much less memory, and allows for fine-grained control of how each line is processed:

In[20]:=
good = bad = skipped = 0;
{MaxMemoryUsed@Print@ResourceFunction["ReadScan"][
    If[StringStartsQ[#, "{"] && ! StringEndsQ[#, "["],
      Quiet@Check[
        ImportString[StringTrim[#, "," ~~ EndOfLine], "RawJSON"]; good++;,
        bad++;
        ],
      skipped++;
      ] &,
    extractedFile,
    String
    ], good, bad, skipped}
Out[21]=

Clean up by deleting the extracted file:

In[22]:=
DeleteFile[extractedFile];

Publisher

Andrew Steinacher

Version History

  • 1.0.0 – 13 December 2021

Author Notes

Although Import is convenient, it is infeasible to use it when working with extremely large files. However, it is possible to work around this by streaming the lines of such files one-by-one if the structure is “flat” with a consistent format one each line (e.g. lines of JSON or XML). These large, “flat” files occur often enough in practice (examples include USDA’s FoodData Central, and StackExchange archives) that I needed a general tool that would allow for processing them in a simple, memory-efficient fashion. Writing a custom one-off solution for reading such files with While and ReadLine is often tedious and difficult to do correctly, and it takes a substantial amount of development time away from what would otherwise be straightforward data processing task. Additionally, it separates the file-handling code from the data processing code, leading to more readable code.

This WFR was originally intended to process files line-by-line, but it was generalized to work with all of the types that Read supports with help from Richard Hennigan.

License Information