Wolfram Research

Function Repository Resource:

PandasObject (1.0.0) current version: 1.0.1 »

Source Notebook

Use the Python package pandas for data science in the Wolfram Language

Contributed by: Igor Bakshee, with examples adopted from the pandas documentation

ResourceFunction["PandasObject"][]

returns a configured PythonObject for the Python package pandas in a new Python session.

ResourceFunction["PandasObject"][session]

uses the specified running ExternalSessionObject session.

ResourceFunction["PandasObject"][,"func"[args,opts]]

executes the function func with the specified arguments and options.

Details and Options

The Python package pandas is a toolkit for data science. It offers, in particular, data structures and operations for the manipulation and analysis of numerical tables and time series.
ResourceFunction["PandasObject"] sets up a configuration of the resource function PythonObject that makes working with pandas more convenient and returns the resulting Python object.
ResourceFunction["PandasObject"] makes the Python-side functions and variables accessible by new names that are closer to the usual Wolfram Language conventions. For instance, for lookup properties:
"loc""ByLabel"selection by label
"iloc"“ByPosition"selection by integer position
"at"“AtLabel"label-based scalar lookup
"iat"“AtPosition"integer-based scalar lookup
For a Python object p, p["ToPythonName","wlname"] gives the native Python name corresponding to the Wolfram Language name wlname and p["FromPythonName","pname"] gives the respective Wolfram Language name for the Python-side name pname. In the object p, both wlname and pname can be used interchangeably.
p["RenamingRules"] gives a list of all renaming rules in the form {"wlname1""pname1",}.
p["FullInformation","Functions"] gives a list of the available functions and p["Information","func"] gives the signature of the specified function.
p["WebInformation"] gives a link to the pandas documentation that can be opened with SystemOpen.
Typically, the Wolfram Language signature of a pandas function closely resembles the Python-side signature in which Python-side objects are represented in the form of the resource function PythonObject with possible extensions suitable for the Wolfram Language.
Similar to the Python's pandas library, ResourceFunction["PandasObject"] uses the Python package Matplotlib, which is invoked via the resource function MatplotlibObject.
Additional utility functions are available in the form ResourceFunction["PandasObject"][obj,"func"[args,opts]], where obj can be an ExternalSessionObject session or any PythonObject p defined in the session. The utility functions include the utility functions of the resource function MatplotlibObject:
"Show"["fmt"]display the Python graphics
"Export"["file","fmt"]export the graphics
ResourceFunction["PandasObject"][p,"Show"[]] and ResourceFunction["PandasObject"][p,"Export"[]] provide basic plotting and exporting functionality, whereas MatplotlibObject, which works seamlessly with ResourceFunction["PandasObject"], gives more options and controls.
Spanning elements in objects created with ResourceFunction["PandasObject"][] use one-based indexes, include the end points, and otherwise follow the usual Wolfram Language conventions, as set up by the resource function PythonObject.

Examples

Basic Examples (2) 

Create a dataset of the purchases of some individuals:

In[1]:=
pd = ResourceFunction["PandasObject"][]
Out[1]=
In[2]:=
n = 10;
In[3]:=
purchases = <|"apples" -> RandomInteger[{0, 20}, n], "oranges" -> RandomInteger[{0, 20}, n]|>
Out[3]=
In[4]:=
pd["DataFrame"[purchases]]
Out[4]=

Transfer the dataset from Python:

In[5]:=
Normal[%]
Out[5]=

Attach customer names to the purchases, using the names as "index" (labels of the rows):

In[6]:=
customers = {"Jim", "Mary", "Bob", "Pat", "John", "Jen", "Joe", "Linda", "Dave", "Liz"};
In[7]:=
(df = pd["DataFrame"[purchases, "Index" -> customers]]) // Normal
Out[7]=

Get several rows from the beginning and the end of the dataset:

In[8]:=
df["Head"[]] // Normal
Out[8]=
In[9]:=
df["Tail"[]] // Normal
Out[9]=

Generate descriptive statistics of the purchases:

In[10]:=
df["Describe"[]] // Normal
Out[10]=

Create and display a plot of the purchases:

In[11]:=
df["Plot"[]]
Out[11]=
In[12]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[12]=

Histograms of the purchases:

In[13]:=
df["Histogram"[], True]
Out[13]=
In[14]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[14]=

Close the Python session to clean up:

In[15]:=
DeleteObject[pd["Session"]]

Load the Titanic disaster data:

In[16]:=
pd = ResourceFunction["PandasObject"][]
Out[16]=
In[17]:=
(titanic = pd["ReadCSV"[
     "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"]]) // Normal
Out[17]=

The descriptive statistics of the dataset:

In[18]:=
titanic["Describe"[]] // Normal
Out[18]=

Correlation between columns:

In[19]:=
titanic["Correlation"[]] // Normal
Out[19]=

The average fare paid:

In[20]:=
titanic["Fare"]["Mean"[]] // Normal
Out[20]=

The median age and fare of the passengers:

In[21]:=
titanic["Part"[{"Age", "Fare"}]]["Median"[]] // Normal
Out[21]=

Descriptive statistics of the specified columns:

In[22]:=
titanic["Part"[{"Age", "Fare"}]]["Describe"[]] // Normal
Out[22]=

Use the specific aggregating statistics for given columns instead of the predefined statistics:

In[23]:=
titanic["Aggregate"[<|"Age" -> {"min", "max", "median", "skew"}, "Fare" -> {"min", "max", "median", "mean"}|>]] // Normal
Out[23]=

Find the number of male and female passengers:

In[24]:=
titanic["Sex"]["ValueCounts"[]] // Normal
Out[24]=

The average age for male versus female passengers:

In[25]:=
 titanic["Part"[{"Sex", "Age"}]]["GroupBy"["Sex"]]["Mean"[]] // Normal
Out[25]=

Count the number of unique values in all the columns:

In[26]:=
titanic["NumberOfUnique"[]] // Normal
Out[26]=

Or in a specific column:

In[27]:=
titanic["Embarked"]["NumberOfUnique"[]] // Normal
Out[27]=

Pick passengers who embarked at the port "C" (Cherbourg):

In[28]:=
titanic[titanic["Embarked"] == "'C'"] // Normal
Out[28]=

Female passengers, whose fare was less than $10:

In[29]:=
fareMask = titanic["Fare"] < 10
Out[29]=
In[30]:=
sexMask = titanic["Sex"] == "'female'"
Out[30]=
In[31]:=
fareMask && sexMask
Out[31]=
In[32]:=
titanic[%] // Normal
Out[32]=

Find the correlation between the sex and survival by replacing string values in the "Sex" column with numbers using the String method "GetDummies" ("women first"):

In[33]:=
titanic["Sex"]["String"]["GetDummies"[]]["Head"[]] // Normal
Out[33]=
In[34]:=
titanic["Sex"]["String"]["GetDummies"[]][
  "corrwith"[titanic["Survived"]]] // Normal
Out[34]=

Summarize the survival rate by sex and cabin class using a spreadsheet-like pivot table (better survival rate for women and in the first class):

In[35]:=
titanic["PivotTable"["Survived", "Index" -> "Sex", "Columns" -> "Pclass"]] // Normal
Out[35]=

Find the missing values in the "Age" column:

In[36]:=
titanic["Age"]["IsNull"[]]
Out[36]=
In[37]:=
titanic[%] // Normal
Out[37]=

Statistics is calculated despite missing values:

In[38]:=
titanic["Age"]["Mean"[]] // Normal
Out[38]=

Drop the first few rows by giving the index values as labels and specifying "Axis"0 for a row-wise operation:

In[39]:=
titanic["Drop"[{0, 1, 2}, "Axis" -> 0]]["Head"[]] // Normal
Out[39]=

Prepare and show a histogram of age values:

In[40]:=
titanic["Age"]["Histogram"[]]
Out[40]=
In[41]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[41]=

Drop the specified column:

In[42]:=
titanic["Drop"[{"Sex"}, "Axis" -> 1]]["Head"[]] // Normal
Out[42]=

Create a copy of the object:

In[43]:=
tt = titanic["Copy"[]]
Out[43]=

Drop the "Sex" column in the object in-place:

In[44]:=
tt["Drop"[{"Sex"}, "Axis" -> 1, "InPlace" -> True]]
In[45]:=
tt["Head"[]] // Normal
Out[45]=

Sort the data by values of the specified column:

In[46]:=
titanic["SortValues"["Fare"]]["Head"[]] // Normal
Out[46]=

Prepare and show stacked filled plots of the numeric columns:

In[47]:=
titanic["Plot"]["Area"["Subplots" -> True], True]
Out[47]=
In[48]:=
ResourceFunction["PandasObject"][pd, "Show"[ImageSize -> Large]]
Out[48]=

Clean up the Python session:

In[49]:=
DeleteObject[pd["Session"]]

Scope (132) 

Object creation (2) 

In[50]:=
pd = ResourceFunction["PandasObject"][]
Out[50]=

Create a Series object from a list of values, letting pandas use a default integer index:

In[51]:=
pd["Series"[{1, 3, 5, None, 6, 8}]]
Out[51]=
In[52]:=
Normal[%]
Out[52]=
In[53]:=
DeleteObject[pd["Session"]]

Create a new panda object:

In[54]:=
pd = ResourceFunction["PandasObject"][]
Out[54]=

Create a date-time index array of six values by specifying the starting date in the form "yyyy-mm-dd":

In[55]:=
dates = pd["DateRange"["2013-01-01", "Periods" -> 6]]
Out[55]=
In[56]:=
dates // Normal
Out[56]=

Alternatively, specify the starting date without dashes:

In[57]:=
pd["DateRange"["20130101", "Periods" -> 6]] // Normal
Out[57]=

Construct a "DataFrame" object from the array of dates with labeled columns:

In[58]:=
pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> dates, "Columns" -> CharacterRange["A", "D"]]]
Out[58]=
In[59]:=
% // Normal
Out[59]=

Create a "DataFrame" from an association of objects that can be converted into a "Series"-like structure:

In[60]:=
pd["DataFrame"[<|
   "A" -> 1.0,
   "B" -> pd["Timestamp"["20130102"]],
   "C" -> pd["Series"[1, "index" -> Range[0, 3], "dtype" -> "float32"]],
   "D" -> ConstantArray[0, 4],
   "E" -> pd["Categorical"[{"test", "train", "test", "train"}]],
   "F" -> "foo"|>]]
Out[60]=
In[61]:=
% // Normal
Out[61]=
In[62]:=
DeleteObject[pd["Session"]]

Viewing data (16) 

In[63]:=
pd = ResourceFunction["PandasObject"][]
Out[63]=

Create a "DataFrame" object:

In[64]:=
df = pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]
Out[64]=

Get the preset number of top and bottom rows of the data frame:

In[65]:=
df["Head"[]] // Normal
Out[65]=
In[66]:=
df["Tail"[]] // Normal
Out[66]=

Get the specified number of top and bottom rows:

In[67]:=
df["Head"[2]] // Normal
Out[67]=
In[68]:=
df["Tail"[2]] // Normal
Out[68]=

Get the index:

In[69]:=
df["Index"]
Out[69]=
In[70]:=
% // Normal
Out[70]=

Alternatively:

In[71]:=
%%["ToList"[]]
Out[71]=

Get column names:

In[72]:=
df["Columns"]
Out[72]=
In[73]:=
Normal[%]
Out[73]=

Alternatively:

In[74]:=
%%["ToList"[]]
Out[74]=

Get a quick statistic summary of your data:

In[75]:=
df["Describe"[]] // Normal
Out[75]=

Sort by index values in descending order:

In[76]:=
df["SortIndex"["Ascending" -> False]]
Out[76]=
In[77]:=
Normal[%]
Out[77]=

Sort by column labels in descending order:

In[78]:=
df["SortIndex"["Axis" -> 1, "Ascending" -> False]] // Normal
Out[78]=

Sort by values in a column:

In[79]:=
 df["SortValues"["B"]] // Normal
Out[79]=

Transpose the data frame:

In[80]:=
df["T"]
Out[80]=

Get the dataset of the transposed values:

In[81]:=
% // Normal
Out[81]=
In[82]:=
Normal[%]
Out[82]=

Convert a "DataFrame" object to a NumPy array:

In[83]:=
df["ToNumpy"[]]
Out[83]=

Get the values in your Wolfram Language session:

In[84]:=
Normal[%] // Normal // MatrixForm
Out[84]=

Compare with the original dataset:

In[85]:=
df // Normal
Out[85]=
In[86]:=
DeleteObject[pd["Session"]]

Selection (26) 

Getting DataFrame Parts (9) 

Create a new pandas object:

In[87]:=
pd = ResourceFunction["PandasObject"][]
Out[87]=
In[88]:=
df = pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]
Out[88]=

Select a single column, treating a column label as property. Note that the returned object is a "Series" and not a "DataFrame":

In[89]:=
df["A"]
Out[89]=

Import the "Series" object as a TimeSeries:

In[90]:=
Normal[%]
Out[90]=

Plot the time series:

In[91]:=
DateListPlot[%]
Out[91]=

Select several columns using the "Part" syntax:

In[92]:=
df["Part"[{"A", "B"}]] // Normal
Out[92]=

Select several rows using the Span syntax:

In[93]:=
df["Part"[;; 3]] // Normal
Out[93]=

Use a Python object "Slice" to span rows by index values (rather than integer indices):

In[94]:=
slice = ResourceFunction["PythonObject"][pd["Session"], "slice"]
Out[94]=
In[95]:=
slice[All]["2013-01-02", "2013-01-04"]
Out[95]=
In[96]:=
df["Part"[%]] // Normal
Out[96]=

Alternatively, use the Python syntax to create a "Slice" object:

In[97]:=
ResourceFunction["PythonObject"][
 pd["Session"], "slice('2013-01-02','2013-01-04')"]
Out[97]=
In[98]:=
 df["Part"[%]] // Normal
Out[98]=

Delete rows and add a column by re-indexing:

In[99]:=
df["Reindex"["Index" -> df["Index"]["Part"[2 ;; 4]], "Columns" -> Flatten[{df["Columns"] // Normal, "E"}]]] // Normal
Out[99]=
In[100]:=
DeleteObject[pd["Session"]]
Selecting by label (8) 

Create a data frame object:

In[101]:=
pd = ResourceFunction["PandasObject"][]
Out[101]=
In[102]:=
(dates = pd["DateRange"["20130101", "Periods" -> 6]]) // Normal
Out[102]=
In[103]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> dates, "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[103]=

Use the "ByLabel" property of the "DataFrame" object to access values by label:

In[104]:=
loc = df["ByLabel"]
Out[104]=

Get a cross section of the data frame using an index label:

In[105]:=
dates["Part"[2]]
Out[105]=
In[106]:=
loc["Part"[%]] // Normal
Out[106]=

Alternatively, use the index value:

In[107]:=
loc["Part"["2013-01-02"]] // Normal
Out[107]=

Get a part corresponding to all rows of the specified columns (or, in the pandas parlance, select on a multi-axis by label):

In[108]:=
loc["Part"[;; , {"A", "B"}]] // Normal
Out[108]=

Get parts of a single row in a form of a "Series" object, rather than a data frame:

In[109]:=
loc["Part"["2013-01-02", {"A", "B"}]]
Out[109]=
In[110]:=
Normal[%]
Out[110]=

Get a scalar value:

In[111]:=
loc["Part"[dates["Part"[2]], "A"]]
Out[111]=

Equivalently, get fast access to a scalar using the "AtLabel" property of the "DataFrame" object:

In[112]:=
at = df["AtLabel"]
Out[112]=
In[113]:=
at["Part"[dates["Part"[2]], "A"]]
Out[113]=
In[114]:=
DeleteObject[pd["Session"]]
Selecting by position (9) 

Create a data frame object:

In[115]:=
pd = ResourceFunction["PandasObject"][]
Out[115]=
In[116]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[116]=

Use the "ByPosition" property of the "DataFrame" object to access values by position:

In[117]:=
iloc = df["ByPosition"]
Out[117]=
In[118]:=
iloc["Part"[2]] // Normal
Out[118]=

Use the Span syntax to access elements:

In[119]:=
iloc["Part"[3 ;; 5, 1 ;; 3]] // Normal
Out[119]=

Select by lists of integer positions:

In[120]:=
iloc["Part"[{2, 3, 5}, {1, 3}]] // Normal
Out[120]=

Slice rows explicitly:

In[121]:=
iloc["Part"[2 ;; 3]] // Normal
Out[121]=

Alternatively:

In[122]:=
iloc["Part"[2 ;; 3, ;;]] // Normal
Out[122]=

Slice columns explicitly:

In[123]:=
iloc["Part"[;; , 2 ;; 3]] // Normal
Out[123]=

Get a scalar value explicitly:

In[124]:=
iloc["Part"[2, 2]]
Out[124]=

Equivalently:

In[125]:=
iat = df["AtPosition"]
Out[125]=
In[126]:=
iat["Part"[2, 2]]
Out[126]=
In[127]:=
DeleteObject[pd["Session"]]

Hierarchical Indexing (4) 

Create a "DataFrame" object for an array with more than two dimensions:

In[128]:=
pd = ResourceFunction["PandasObject"][]
Out[128]=
In[129]:=
df = pd["DataFrame"[<|"vals" -> Range[4]|>, "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}]]
Out[129]=

The imported "DataFrame" object is represented as an Association with keys in the form of a list:

In[130]:=
df // Normal
Out[130]=
In[131]:=
% // Normal
Out[131]=

In Python, the index is represented as a "MultiIndex" object:

In[132]:=
df["Index"]
Out[132]=

Create a "MultiIndex" directly:

In[133]:=
tuples = Transpose[{{"bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"}, {"one", "two", "one", "two", "one", "two", "one", "two"}}]
Out[133]=
In[134]:=
index = pd["MultiIndex"]["FromTuples"][All][tuples]
Out[134]=
In[135]:=
(pd["DataFrame"[<|"vals" -> Range[8]|>, "Index" -> index]]) // Normal
Out[135]=
In[136]:=
DeleteObject[pd["Session"]]

Boolean indexing (2) 

In[137]:=
pd = ResourceFunction["PandasObject"][]
Out[137]=
In[138]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[138]=

Create an array of Boolean values for which values in the column "A" are positive:

In[139]:=
bool = df["A"] > 0
Out[139]=
In[140]:=
Normal[%]["Values"]
Out[140]=

Pick rows for which the condition holds:

In[141]:=
df["Part"[bool]] // Normal
Out[141]=

A Boolean array for positive "DataFrame" values:

In[142]:=
(positive = df > 0) // Normal
Out[142]=

Select values from a "DataFrame" where a Boolean condition is met:

In[143]:=
df["Part"[positive]] // Normal
Out[143]=
In[144]:=
DeleteObject[pd["Session"]]

Create a data frame:

In[145]:=
pd = ResourceFunction["PandasObject"][]
Out[145]=
In[146]:=
(df = pd[
    "DataFrame"[
     Join[RandomReal[{-1, 1}, {6, 4}], List /@ {"one", "one", "two", "three", "four", "three"}, 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "E"]]]) // Normal
Out[146]=

Use the "IsIn" method for filtering:

In[147]:=
df["Part"[df["E"]["IsIn"[{"two", "four"}]]]] // Normal
Out[147]=
In[148]:=
DeleteObject[pd["Session"]]

Setting Values (2) 

Create a "DataFrame":

In[149]:=
pd = ResourceFunction["PandasObject"][]
Out[149]=
In[150]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[150]=

Create a new "Series" object that is longer than the "DataFrame" and starts with a time offset:

In[151]:=
n = 7;
In[152]:=
s1 = pd["Series"[Range[n], "Index" -> pd["DateRange"[df["Index"]["Part"[2]], "Periods" -> n]]]]
Out[152]=
In[153]:=
Normal[%]["DatePath"] // Column
Out[153]=

Add the object as a new column to the data frame, automatically aligning the data by the indexes:

In[154]:=
df["Assign"["Part"["F"] -> s1]]
Out[154]=
In[155]:=
df // Normal
Out[155]=

Set values by label:

In[156]:=
date = df["Index"]["Part"[1]]
Out[156]=
In[157]:=
df["AtLabel"]["Assign"["Part"[date, "A"] -> 100]]
Out[157]=
In[158]:=
df // Normal
Out[158]=

Set values by position:

In[159]:=
df["AtPosition"]["Assign"["Part"[1, 2] -> -100]]
Out[159]=
In[160]:=
df // Normal
Out[160]=

Set values with an array:

In[161]:=
shape = df["Shape"]
Out[161]=
In[162]:=
df["ByLabel"][
 "Assign"["Part"[;; , "D"] -> ConstantArray[50, shape[[1]]]]]
Out[162]=
In[163]:=
df // Normal
Out[163]=

Create a NumPy array:

In[164]:=
arr = ResourceFunction["PythonObject"][pd["Session"], "numpy.array([10]*len(" <> df[] <> "))"]
Out[164]=
In[165]:=
arr // Normal // Normal
Out[165]=

Assign a column to a NumPy array by position:

In[166]:=
df["ByPosition"]["Assign"["Part"[;; , 3] -> arr]]
Out[166]=
In[167]:=
df // Normal
Out[167]=
In[168]:=
DeleteObject[pd["Session"]]

Create a "DataFrame":

In[169]:=
pd = ResourceFunction["PandasObject"][]
Out[169]=
In[170]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[170]=

A Boolean array of values where a condition is met:

In[171]:=
(bool = df > 0) // Normal
Out[171]=

Replace values where the condition is not met:

In[172]:=
df["where"[bool]] // Normal
Out[172]=

Negate positive values using the where operation with setting:

In[173]:=
df["Assign"["Part"[df > 0] -> -df]]
Out[173]=
In[174]:=
df // Normal
Out[174]=
In[175]:=
DeleteObject[pd["Session"]]

Missing data (5) 

Create a "DataFrame" with missing values:

In[176]:=
pd = ResourceFunction["PandasObject"][]
Out[176]=
In[177]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {4, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 4]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[177]=
In[178]:=
(df = df["Reindex"["Columns" -> CharacterRange["A", "E"]]]) // Normal
Out[178]=

Assign some values in the "DataFrame":

In[179]:=
df["ByPosition"]["Assign"["Part"[1 ;; 2, -1] -> 1]]
Out[179]=
In[180]:=
df // Normal
Out[180]=

Drop rows with missing data:

In[181]:=
df["DropMissing"["How" -> "any"]] // Normal
Out[181]=

Fill missing data:

In[182]:=
df["FillMissing"["Value" -> 5]] // Normal
Out[182]=

Get the Boolean mask where values are missing:

In[183]:=
pd["IsMissing"[df]] // Normal
Out[183]=
In[184]:=
DeleteObject[pd["Session"]]

Binary Operations (16) 

Functions vs. Operators (10) 

Define a "DataFrame" object:

In[185]:=
pd = ResourceFunction["PandasObject"][]
Out[185]=
In[186]:=
(df = pd[
    "DataFrame"[<|"angles" -> {0, 3, 4}, "degrees" -> {360, 180, 360}|>,
      "Index" -> {"circle", "triangle", "rectangle"}]]) // Normal
Out[186]=

Add a scalar using an arithmetic operator version:

In[187]:=
df + 1 // Normal
Out[187]=

Equivalently, use a function version:

In[188]:=
df["Add"[1]] // Normal
Out[188]=

Subtract a list:

In[189]:=
df["Subtract"[{1, 2}, "Axis" -> "columns"]] // Normal
Out[189]=

Subtract a "Series" object:

In[190]:=
(s = pd["Series"[{1, 1, 1}, "Index" -> {"circle", "triangle", "rectangle"}]]) // Normal
Out[190]=
In[191]:=
df["Subtract"[s, "Axis" -> "index"]] // Normal
Out[191]=

Multiply a dictionary by axis:

In[192]:=
df["Multiply"[<|"angles" -> 0, "degrees" -> 2|>]] // Normal
Out[192]=
In[193]:=
df["Multiply"[<|"circle" -> 0, "triangle" -> 2, "rectangle" -> 3|>, "Axis" -> "index"]] // Normal
Out[193]=

Multiply a "DataFrame" of different shape using the operator version:

In[194]:=
(other = pd["DataFrame"[<|"angles" -> {0, 3, 4}|>, "Index" -> {"circle", "triangle", "rectangle"}]]) // Normal
Out[194]=
In[195]:=
df*other // Normal
Out[195]=

Use the function version to fill in missing values:

In[196]:=
df["Multiply"[other, "FillValue" -> 0]] // Normal
Out[196]=

Create a "DataFrame" with a hierarchical "MultiIndex":

In[197]:=
(dfm = pd[
    "DataFrame"[<|"angles" -> {0, 3, 4, 4, 5, 6}, "degrees" -> {360, 180, 360, 360, 540, 720}|>, "Index" -> {{" A", " A", " A", " B", " B", " B"}, {"circle", "triangle", "rectangle", "square", "pentagon", "hexagon"}}]]) // Normal
Out[197]=

Divide by the hierarchical "DataFrame" specifying the level:

In[198]:=
df["Divide"[dfm, "Level" -> 1, "FillValue" -> 0]] // Normal
Out[198]=
In[199]:=
DeleteObject[pd["Session"]]
Matching / Broadcasting Behavior (6) 

Create a data frame:

In[200]:=
pd = ResourceFunction["PandasObject"][]
Out[200]=
In[201]:=
(df = pd["DataFrame"[<|
       "one" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"a", "b", "c"}]], "two" -> pd["Series"[RandomReal[{-1, 1}, {4}], "Index" -> {"a", "b", "c", "d"}]],
      "three" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"b", "c", "d"}]]
      |>
     ]]) // Normal
Out[201]=

Select a row and a column:

In[202]:=
(row = df["ByPosition"]["Part"[2]]) // Normal
Out[202]=
In[203]:=
(column = df["two"]) // Normal
Out[203]=

Use the "Axis" option to match on the index or columns:

In[204]:=
df["Subtract"[row, "Axis" -> "columns"]] // Normal
Out[204]=
In[205]:=
df["Subtract"[column, "Axis" -> "index"]] // Normal
Out[205]=

Alternatively:

In[206]:=
df["Subtract"[row, "Axis" -> 1]] // Normal
Out[206]=
In[207]:=
df["Subtract"[column, "Axis" -> 0]] // Normal
Out[207]=

Use the built-in Python divmod function with a Series object to take the floor division and modulo operation at the same time:

In[208]:=
divmod = ResourceFunction["PythonObject"][pd["Session"], "divmod"]
Out[208]=
In[209]:=
 (s = pd["Series"[Range[0, 9]]]) // Normal
Out[209]=
In[210]:=
{div, rem} = divmod[All][s, 3]
Out[210]=

Do elementwise divmod:

In[211]:=
{div, rem} = divmod[All][s, {2, 2, 3, 3, 4, 4, 5, 5, 6, 6}]
Out[211]=
In[212]:=
DeleteObject[pd["Session"]]

Stats (4) 

Create a data frame:

In[213]:=
pd = ResourceFunction["PandasObject"][]
Out[213]=
In[214]:=
(df = pd[
    "DataFrame"[
     Join[RandomReal[{-1, 1}, {6, 4}], Table[{5}, {6}], List /@ Flatten[{None, Range[5]}], 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "F"]]]) // Normal
Out[214]=

Perform descriptive statistics:

In[215]:=
df["Mean"[]] // Normal
Out[215]=

Same operation on the other axis:

In[216]:=
Normal[df["Mean"[1]]]["DatePath"]
Out[216]=

Operate on objects that have different dimensionality with alignment:

In[217]:=
Normal[(s = pd["Series"[{1, 3, 5, None, 6, 8}, "Index" -> df["Index"]]][
      "shift"[2]])]["DatePath"] // Column
Out[217]=
In[218]:=
df["Subtract"[s, "Axis" -> "index"]] // Normal
Out[218]=
In[219]:=
DeleteObject[pd["Session"]]

Applying Functions (4) 

Create a data frame:

In[220]:=
pd = ResourceFunction["PandasObject"][]
Out[220]=
In[221]:=
(df = pd[
    "DataFrame"[
     Join[RandomReal[{-1, 1}, {6, 4}], Table[{5}, {6}], List /@ Flatten[{None, Range[5]}], 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "F"]]]) // Normal
Out[221]=

Import a NumPy function:

In[222]:=
ExternalFunction[pd["Session"], "numpy.cumsum"]
Out[222]=

Apply the function:

In[223]:=
df["Apply"[%]] // Normal
Out[223]=

Create and apply a lambda function:

In[224]:=
ExternalFunction[pd["Session"], "lambda x: x.max() - x.min()"]
Out[224]=
In[225]:=
df["Apply"[%]] // Normal
Out[225]=
In[226]:=
DeleteObject[pd["Session"]]

Histogramming (2) 

Create a "Series" object:

In[227]:=
pd = ResourceFunction["PandasObject"][]
Out[227]=
In[228]:=
(s = pd["Series"["data" -> RandomInteger[{0, 7}, {10}]]]) // Normal
Out[228]=

Count unique values in the series:

In[229]:=
s["ValueCounts"[]] // Normal
Out[229]=
In[230]:=
DeleteObject[pd["Session"]]

String Methods (2) 

Create a "Series" object:

In[231]:=
pd = ResourceFunction["PandasObject"][]
Out[231]=
In[232]:=
(s = pd["Series"[{"A", "B", "C", "Aaba", "Baca", None, "CABA", "dog", "cat"}]]) // Normal
Out[232]=

Use the "String" attribute to operate on each element of the series:

In[233]:=
str = s["String"]
Out[233]=
In[234]:=
str["Lower"[]] // Normal
Out[234]=
In[235]:=
DeleteObject[pd["Session"]]

Merging (7) 

Concatenating (4) 

Create a data frame:

In[236]:=
pd = ResourceFunction["PandasObject"][]
Out[236]=
In[237]:=
n = 3;
In[238]:=
(df = pd["DataFrame"[RandomReal[{-1, 1}, {n, 2}]]]) // Normal
Out[238]=

Break it into pieces:

In[239]:=
Normal /@ (pieces = df["Part"[# ;; #]] & /@ Range[n])
Out[239]=

Concatenate the pieces:

In[240]:=
(df1 = pd["Concatenate"[pieces]]) // Normal
Out[240]=

The concatenated object is the same as the original:

In[241]:=
df["Equals"[df1]]
Out[241]=
In[242]:=
DeleteObject[pd["Session"]]
Database-Style Joining  (3) 

Create two "DataFrame" objects:

In[243]:=
pd = ResourceFunction["PandasObject"][]
Out[243]=
In[244]:=
(left = pd[
    "DataFrame"[<|"key" -> {"foo", "foo"}, "left" -> {1, 2}|>]]) // Normal
Out[244]=
In[245]:=
(right = pd["DataFrame"[<|"key" -> {"foo", "foo"}, "right" -> {4, 5}|>]]) // Normal
Out[245]=

Merge the objects in SQL style:

In[246]:=
pd["Merge"[left, right, "On" -> "key"]] // Normal
Out[246]=

Alternatively, with different keys:

In[247]:=
(left = pd[
    "DataFrame"[<|"key" -> {"foo", "bar"}, "left" -> {1, 2}|>]]) // Normal
Out[247]=
In[248]:=
(right = pd["DataFrame"[
     "data" -> <|"key" -> {"foo", "bar"}, "right" -> {4, 5}|>]]) // Normal
Out[248]=
In[249]:=
pd["Merge"[left, right, "On" -> "key"]] // Normal
Out[249]=
In[250]:=
DeleteObject[pd["Session"]]

Grouping (3) 

Create a "DataFrame" object:

In[251]:=
pd = ResourceFunction["PandasObject"][]
Out[251]=
In[252]:=
(df = pd[
    "DataFrame"[<|
      "A" -> {"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"},
      "B" -> {"one", "one", "two", "three", "two", "two", "one", "three"},
      "C" -> RandomInteger[10, {8}],
      "D" -> RandomInteger[10, {8}]
      |>]]) // Normal
Out[252]=

Group by values of the column "A", sum values in the groups and combine the results in a new "DataFrame" object:

In[253]:=
df["GroupBy"["A"]]["Sum"[]]
Out[253]=
In[254]:=
% // Normal
Out[254]=

Group by multiple columns forming a hierarchical index and apply the summing function to each group:

In[255]:=
df["GroupBy"[{"A", "B"}]]["Sum"[]] // Normal
Out[255]=
In[256]:=
DeleteObject[pd["Session"]]

Reshaping (7) 

Pivoting (4) 

Create a "DataFrame" object:

In[257]:=
pd = ResourceFunction["PandasObject"][]
Out[257]=
In[258]:=
(df = pd[
    "DataFrame"[<|"foo" -> {"one", "one", "one", "two", "two", "two"},
       "bar" -> {"A", "B", "C", "A", "B", "C"}, "baz" -> Range[6], "zoo" -> {"x", "y", "z", "q", "w", "t"}|>]]) // Normal
Out[258]=

Organize the object by index and column values:

In[259]:=
df["Pivot"["Index" -> "foo", "Columns" -> "bar", "Values" -> "baz"]] // Normal
Out[259]=

Alternatively:

In[260]:=
df["Pivot"["Index" -> "foo", "Columns" -> "bar"]]["baz"] // Normal
Out[260]=

Give a list of value labels:

In[261]:=
df["Pivot"["Index" -> "foo", "Columns" -> "bar", "Values" -> {"baz", "zoo"}]] // Normal
Out[261]=
In[262]:=
pd["PivotTable"[df, "Index" -> "foo", "Columns" -> "bar", "Values" -> {"baz", "zoo"}]] // Normal
Out[262]=
In[263]:=
df // Normal
Out[263]=
In[264]:=
DeleteObject[pd["Session"]]
Stacking (3) 

Create a "DataFrame" object with a hierarchical index:

In[265]:=
pd = ResourceFunction["PandasObject"][]
Out[265]=
In[266]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {4, 2}], "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}, "Columns" -> {"A", "B"}]]) // Normal
Out[266]=

Stack the object by "compressing" a level in the columns:

In[267]:=
(stacked = df["Stack"[]]) // Normal
Out[267]=

Reverse the operation by "unstacking" the last level:

In[268]:=
stacked["Unstack"[]] // Normal
Out[268]=
In[269]:=
stacked["Unstack"[0]] // Normal
Out[269]=
In[270]:=
stacked["Unstack"[1]] // Normal
Out[270]=
In[271]:=
DeleteObject[pd["Session"]]

Time Series Resampling (3) 

Create a series with 9 one-second timestamps:

In[272]:=
pd = ResourceFunction["PandasObject"][]
Out[272]=
In[273]:=
dates = pd[
  "DateRange"["1/1/2000", "Periods" -> 9, "Frequency" -> "S"]]
Out[273]=
In[274]:=
( s = pd["Series"[Range[dates["Size"]], "Index" -> dates]]) // Normal
Out[274]=

Downsample the series into 3-second bins and sum the values falling into each bin:

In[275]:=
(s["Resample"["3S"]]["Sum"[]]) // Normal
Out[275]=
In[276]:=
%["DatePath"]
Out[276]=

Check the sums:

In[277]:=
Total /@ Partition[Range[dates["Size"]], 3]
Out[277]=
In[278]:=
DeleteObject[pd["Session"]]

Create a time series object with dates given in the local time zone:

In[279]:=
pd = ResourceFunction["PandasObject"][]
Out[279]=
In[280]:=
dates = pd[
  "DateRange"["3/6/2012 00:00", "Periods" -> 2, "Frequency" -> "D"]]
Out[280]=
In[281]:=
s = pd["Series"[{1, 2}, "index" -> dates]]
Out[281]=

Check the dates:

In[282]:=
s["Index"] // Normal
Out[282]=

Localize the series to the UTC time zone and check the dates:

In[283]:=
sUTC = s["TimeZoneLocalize"["UTC"]]
Out[283]=
In[284]:=
%["Index"] // Normal
Out[284]=

Convert the series to another time zone:

In[285]:=
sUTC["TimeZoneConvert"["US/Eastern"]]
Out[285]=
In[286]:=
%["Index"] // Normal
Out[286]=
In[287]:=
DeleteObject[pd["Session"]]

Create a series with quarterly frequency for a year, ending in November:

In[288]:=
pd = ResourceFunction["PandasObject"][]
Out[288]=
In[289]:=
periods = pd["PeriodRange"["1990Q1", "2000Q4", "Frequency" -> "Q-NOV"]]
Out[289]=
In[290]:=
s = pd["Series"[RandomReal[1, periods["Size"]], "Index" -> periods]]
Out[290]=

Check the start dates of a few periods in the series:

In[291]:=
Normal[s["Head"[]]]["Dates"]
Out[291]=

Convert the series to 9 AM of the end of the month following the quarter end and check starting dates again:

In[292]:=
s["Assign"[
  "index" -> (periods[
        "AsFrequency"["Frequency" -> "M", "how" -> "e"]] + 1)[
     "AsFrequency"["Frequency" -> "H", "how" -> "s"]] + 9]]
Out[292]=
In[293]:=
Normal[s["Head"[]]]["Dates"]
Out[293]=
In[294]:=
DeleteObject[pd["Session"]]

Categoricals (8) 

Create a "DataFrame" with a column whose values are taken from a limited alphabet:

In[295]:=
pd = ResourceFunction["PandasObject"][]
Out[295]=
In[296]:=
(df = pd[
    "DataFrame"[<|"id" -> {1, 2, 3, 4, 5, 6}, "raw_grade" -> {"a", "b", "b", "a", "a", "e"}|>]]) // Normal
Out[296]=

Convert the raw grades to a categorical data type:

In[297]:=
df["Assign"[
  "Part"["grade"] -> df["Part"["raw_grade"]]["AsType"["category"]]]]
Out[297]=
In[298]:=
df // Normal
Out[298]=

The current categories:

In[299]:=
df["grade"]["cat"]["categories"] // Normal
Out[299]=

Rename the categories to more meaningful names in place:

In[300]:=
 df["Assign"[
  "grade.cat.categories" -> {"very good", "good", "very bad"}]]
Out[300]=
In[301]:=
df // Normal
Out[301]=

Reorder the categories and simultaneously add the missing categories:

In[302]:=
df["grade", True]["cat"][
 "SetCategories"[{ "very bad", "bad", "medium", "good", "very good"}]]
Out[302]=
In[303]:=
df["Assign"["grade" -> %]]
Out[303]=
In[304]:=
df // Normal
Out[304]=

Sort by order in the categories:

In[305]:=
df["SortValues"["grade"]] // Normal
Out[305]=

Sort by values in the "raw_grade" column (in lexicographic order):

In[306]:=
df["SortValues"["raw_grade"]] // Normal
Out[306]=

Group by a categorical column, showing empty categories:

In[307]:=
df["GroupBy"["grade"]]["Size"[]] // Normal
Out[307]=
In[308]:=
DeleteObject[pd["Session"]]

Plotting (5) 

Construct a simple data frame object:

In[309]:=
pd = ResourceFunction["PandasObject"][]
Out[309]=
In[310]:=
n = 100;
In[311]:=
df = pd["DataFrame"[<|"column1" -> RandomInteger[{0, 20}, n], "column2" -> RandomInteger[{20, 50}, n]|>]]
Out[311]=

Create a plot of column values with labels:

In[312]:=
df["Plot"[]]
Out[312]=

Show the plot in the default ("PNG") format:

In[313]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[313]=

Show the plot as a vector graphics:

In[314]:=
ResourceFunction["PandasObject"][pd, "Show"["SVG"]]
Out[314]=

Export the plot to a file from Python:

In[315]:=
ResourceFunction["PandasObject"][pd, "Export"["myfile", "SVG"]]
Out[315]=

Import the file:

In[316]:=
ResourceFunction["SVGImport"][%]
Out[316]=

Delete the file:

In[317]:=
DeleteFile[FileNames[%%, $HomeDirectory]]

Clear the plot figure:

In[318]:=
ResourceFunction["PandasObject"][pd, "Clear"[]]

Plot the specified column:

In[319]:=
df["column1"]["Plot"[]]
Out[319]=
In[320]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[320]=

Plot one column versus another:

In[321]:=
df["SortValues"["column1"], True]
Out[321]=
In[322]:=
%["Plot"["x" -> "column1", "y" -> "column2"]]
Out[322]=
In[323]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[323]=
In[324]:=
DeleteObject[pd["Session"]]

Create a time series:

In[325]:=
pd = ResourceFunction["PandasObject"][]
Out[325]=
In[326]:=
n = 500;
In[327]:=
s = pd["Series"[RandomReal[{-1, 1}, {n}], "Index" -> pd["DateRange"["1/1/2000", "Periods" -> n]]], True]
Out[327]=

Compute its cumulative sum:

In[328]:=
s = s["CumulativeSum"[]]
Out[328]=

Prepare a plot of the time series:

In[329]:=
s["Plot"[]]
Out[329]=

Show the plot:

In[330]:=
ResourceFunction["PandasObject"][s, "Show"[]]
Out[330]=
In[331]:=
DeleteObject[pd["Session"]]

Create a data frame:

In[332]:=
pd = ResourceFunction["PandasObject"][]
Out[332]=
In[333]:=
df = pd["DataFrame"[RandomReal[1, {10, 4}], "Columns" -> {"a", "b", "c", "d"}]]
Out[333]=

List available plot types:

In[334]:=
df["Plot"]["Information"]["Methods"] // Rest
Out[334]=

Create a bar plot:

In[335]:=
df["Plot"]["Bar"[]]
Out[335]=
In[336]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[336]=

Alternatively, use the "Plot" method of the "DataFrame" object:

In[337]:=
ResourceFunction["PandasObject"][pd, "Clear"[]]
In[338]:=
df["Plot"["Kind" -> "bar"]]
Out[338]=

A stacked horizontal plot:

In[339]:=
df["Plot"["Kind" -> "barh", "stacked" -> True]]
Out[339]=
In[340]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[340]=

A box plot:

In[341]:=
df["Plot"["Kind" -> "box"]]
Out[341]=
In[342]:=
ResourceFunction["PandasObject"][df, "Show"[]]
Out[342]=

Pass keywords supported by the resource function MatplotlibObject "boxplot":

In[343]:=
df["Plot"["Kind" -> "box", "vert" -> False]]
Out[343]=
In[344]:=
ResourceFunction["PandasObject"][df, "Show"[]]
Out[344]=
In[345]:=
DeleteObject[pd["Session"]]

Create a data frame with normally-distributed values:

In[346]:=
pd = ResourceFunction["PandasObject"][]
Out[346]=
In[347]:=
df = pd["DataFrame"[
   RandomVariate[NormalDistribution[0, 1], {1000, 4}], "columns" -> CharacterRange["a", "d"]]]
Out[347]=

Create a scatter matrix plot using the "ScatterMatrix" method from pandas.plotting:

In[348]:=
pd["plotting"]["ScatterMatrix"][All, True][df, "alpha" -> 0.4, "figsize" -> {6, 6}, "diagonal" -> "kde"]
Out[348]=
In[349]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[349]=
In[350]:=
DeleteObject[pd["Session"]]

Create a time series of a cumulative random process:

In[351]:=
pd = ResourceFunction["PandasObject"][]
Out[351]=
In[352]:=
price = pd[
  "Series"[FoldList[Plus, RandomReal[{-1, 1}, {150}]], "Index" -> pd["DateRange"["2000-1-1", "Periods" -> 150, "Frequency" -> "B"]]]]
Out[352]=

Compute a moving average and standard deviation of the process:

In[353]:=
ma = price["Rolling"[20]]["Mean"[]]
Out[353]=
In[354]:=
std = price["Rolling"[20]]["StandardDeviation"[]]
Out[354]=

Prepare a temporal plot of the prices, the mean values, and the Bollinger band using a MatplotlibObject:

In[355]:=
plt = ResourceFunction["MatplotlibObject"][pd]
Out[355]=
In[356]:=
plt["plot"[price["Index"], price, "k"]]
Out[356]=
In[357]:=
plt["plot"[ma["Index"], ma, "b"]]
Out[357]=
In[358]:=
plt["fill_between"[std["Index"], ma - 2*std, "y2" -> ma + 2*std, "color" -> "b", "alpha" -> 0.2]]
Out[358]=

Show the plots:

In[359]:=
ResourceFunction["MatplotlibObject"][price, "Show"[]]
Out[359]=
In[360]:=
DeleteObject[pd["Session"]]

Importing and exporting data (14) 

CSV (4) 
In[361]:=
pd = ResourceFunction["PandasObject"][]
Out[361]=
In[362]:=
Normal[df = pd["DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]]
Out[362]=

Write to a CSV file:

In[363]:=
fname = FileNameJoin[{$TemporaryDirectory, "foo.csv"}];
In[364]:=
 df["ToCSV"[fname, "Index" -> False]]

Print contents of the file:

In[365]:=
FilePrint[fname]

Read the CSV file as a "DataFrame" object:

In[366]:=
pd["ReadCSV"[fname]]
Out[366]=
In[367]:=
% // Normal
Out[367]=

Clean up:

In[368]:=
DeleteFile[fname]
In[369]:=
DeleteObject[pd["Session"]]
HDF5 (5) 

Create a "DataFrame":

In[370]:=
pd = ResourceFunction["PandasObject"][]
Out[370]=
In[371]:=
(df = pd[
    "DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]) // Normal
Out[371]=

Write to a HDF5 Store:

In[372]:=
fname = FileNameJoin[{$TemporaryDirectory, "foo.h5"}];
In[373]:=
df["ToHDF"[fname, "df"]]

List datasets in the exported file and import the contents of the first dataset:

In[374]:=
Import[fname]
Out[374]=
In[375]:=
Import[fname, First[%]]
Out[375]=

Read the contents of the file as a "DataFrame":

In[376]:=
pd["ReadHDF"[fname, "df"]]
Out[376]=
In[377]:=
% // Normal
Out[377]=

Clean up:

In[378]:=
DeleteFile[fname]
In[379]:=
DeleteObject[pd["Session"]]
Excel (5) 

Create a new pandas object:

In[380]:=
pd = ResourceFunction["PandasObject"][]
Out[380]=
In[381]:=
(df = pd[
    "DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]) // Normal
Out[381]=

Write to an Excel file:

In[382]:=
fname = FileNameJoin[{$TemporaryDirectory, "foo.xlsx"}];
In[383]:=
df["ToExcel"[fname, "SheetName" -> "Sheet1"]]

Check the file:

In[384]:=
Import[fname]
Out[384]=

Read the file as "DataFrame":

In[385]:=
pd["ReadExcel"[fname]]
Out[385]=
In[386]:=
% // Normal
Out[386]=

Clean up:

In[387]:=
DeleteFile[fname]
In[388]:=
DeleteObject[pd["Session"]]

Applications (5) 

Use PandasObject to perform data analysis in Python when importing data to the Wolfram Language is impractical or undesirable. Download a county business patterns file from the US Census database and unzip it to a temporary directory:

In[389]:=
fname = ExtractArchive[
    "https://www2.census.gov/programs-surveys/cbp/datasets/2020/cbp20us.zip", $TemporaryDirectory] // First;

Check the timing of creating a dataset in the Wolfram Language:

In[390]:=
(ds = Import[fname, {"CSV", "Dataset"}, "HeaderLines" -> 1];) // AbsoluteTiming
Out[390]=

Import the data to a "DataFrame" in Python and check the timing:

In[391]:=
pd = ResourceFunction["PandasObject"][]
Out[391]=
In[392]:=
(df = pd["ReadCSV"[fname]]) // AbsoluteTiming
Out[392]=

The first few lines of the dataset:

In[393]:=
df["Head"[2]] // Normal
Out[393]=

Compare to the dataset imported to the Wolfram Language:

In[394]:=
Take[ds, 2]
Out[394]=
In[395]:=
DeleteObject[pd["Session"]]

Properties and Relations (7) 

PandasObject[] gives the same result as the resource function PythonObject with a special configuration:

In[396]:=
session = StartExternalSession["Python"]
Out[396]=
In[397]:=
ResourceFunction["PandasObject"][session]
Out[397]=
In[398]:=
ResourceFunction["PythonObject"][session, "pandas", "Configuration" -> ResourceFunction["PandasObject"]]
Out[398]=
In[399]:=
DeleteObject[session]

Get information on a pandas object:

In[400]:=
pd = ResourceFunction["PandasObject"][]
Out[400]=
In[401]:=
pd["Information"]
Out[401]=

Open the user guide in your default web browser:

In[402]:=
pd["WebInformation"] // SystemOpen

In[403]:=
DeleteObject[pd["Session"]]

Some of the functions and classes available in the pandas module:

In[404]:=
pd = ResourceFunction["PandasObject"][]
Out[404]=
In[405]:=
pd["FullInformation", "Functions"] // Short[#, 3] &
Out[405]=
In[406]:=
pd["FullInformation", "Classes"] // Short[#, 3] &
Out[406]=

Information on a class:

In[407]:=
pd["Information", "DataFrame"] // Short
Out[407]=

The web documentation for a class:

In[408]:=
pd["WebInformation", "DataFrame"] // SystemOpen

In[409]:=
DeleteObject[pd["Session"]]

pandas’s "DataFrame" is analogous to Dataset, but keeps the object on the Python side:

In[410]:=
pd = ResourceFunction["PandasObject"][]
Out[410]=
In[411]:=
pd["DataFrame"[<|"a" -> {1, 2, 3}, "b" -> {4, 5, 6}|>]]
Out[411]=

Print the object in Python:

In[412]:=
%["Evaluate"["print(``)"]]

Transfer the data from Python to create a Dataset:

In[413]:=
Normal[%%]
Out[413]=
In[414]:=
Head[%]
Out[414]=
In[415]:=
DeleteObject[pd["Session"]]

Many pandas operations are parallel to operations on Dataset:

In[416]:=
pd = ResourceFunction["PandasObject"][]
Out[416]=
In[417]:=
n = 500;
In[418]:=
df = pd["DataFrame"[<|
    "a" -> RandomVariate[NormalDistribution[0, 1], {n}], "b" -> RandomInteger[100, {n}]|>]]
Out[418]=
In[419]:=
ds = Normal[df]
Out[419]=

Select rows satisfying a condition:

In[420]:=
df[df["a"] > 2.5] // Normal
Out[420]=
In[421]:=
ds[Select[#a > 2.5 &]]
Out[421]=

Plot histograms of the columns:

In[422]:=
df["Histogram"[], True]
Out[422]=
In[423]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[423]=
In[424]:=
Histogram[ds[All, #], PlotLabel -> #] & /@ Normal[Keys[First[ds]]]
Out[424]=
In[425]:=
DeleteObject[pd["Session"]]

Similarly, pandas’s "Series" object is analogous to TimeSeries:

In[426]:=
pd = ResourceFunction["PandasObject"][]
Out[426]=
In[427]:=
n = 20;
In[428]:=
s = pd["Series"[RandomReal[1, {n}], "index" -> pd["DateRange"["20130102", "periods" -> n]]]]
Out[428]=
In[429]:=
ts = Normal[%]
Out[429]=

Plot the time series in Python:

In[430]:=
s["Plot"[]]
Out[430]=
In[431]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[431]=

Plot the imported time series with DateListPlot:

In[432]:=
DateListPlot[ts]
Out[432]=
In[433]:=
DeleteObject[pd["Session"]]

Create a "DataFrame" and a Boolean mask for positive values:

In[434]:=
pd = ResourceFunction["PandasObject"][]
Out[434]=
In[435]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {3, 2}], "Columns" -> {"A", "B"}]]) // Normal
Out[435]=
In[436]:=
df > 0 // Normal
Out[436]=

PythonObject allows you to apply Python commands directly, and bring the results back to the Wolfram Language if necessary:

In[437]:=
df["Evaluate"["``>0"]]
Out[437]=
In[438]:=
% // Normal
Out[438]=

Alternatively:

In[439]:=
ExternalEvaluate[pd["Session"], df[] <> ">0"]
Out[439]=
In[440]:=
DeleteObject[pd["Session"]]

Possible Issues (2) 

Create a pandas object:

In[441]:=
pd = ResourceFunction["PandasObject"][]
Out[441]=

Since NumPy arrays have a single data type for the entire array (dtype), importing a NumPy array to the Wolfram Language may fail if one of the columns cannot be imported directly:

In[442]:=
(df = pd["DataFrame"[<|
       "A" -> Range[3],
      "B" -> RandomReal[1, {3}],
       "C" -> {"foo", "bar", "baz"}
      |>
       ]]) // Normal
Out[442]=
In[443]:=
vals = df["Values", True]
Out[443]=
In[444]:=
Normal[%]
Out[444]=

Drop the problematic column to import only numeric values:

In[445]:=
df["Drop"["C", "Axis" -> 1]]["Values"] // Normal // Normal // MatrixForm
Out[445]=

Import the rest:

In[446]:=
df["C"]["ToList"[]] // Normal
Out[446]=

Alternatively, import the entire array by converting the NumPy array to list:

In[447]:=
vals["tolist"[]]
Out[447]=
In[448]:=
DeleteObject[pd["Session"]]

Create two "Series" objects:

In[449]:=
pd = ResourceFunction["PandasObject"][]
Out[449]=
In[450]:=
s = pd["Series"[RandomReal[{-1, 1}, {100}]]]
Out[450]=
In[451]:=
s1 = pd["Series"[RandomReal[5, {200}]]]
Out[451]=

The Matplotlib library, used by pandas, accumulates the results of successive plot commands:

In[452]:=
(#["Plot"[]]; ResourceFunction["PandasObject"][pd, "Show"[]]) & /@ {s,
   s1}
Out[452]=

Clear the plot figure to create separate plots instead:

In[453]:=
(ResourceFunction["PandasObject"][pd, "Clear"[]]; #["Plot"[]]; ResourceFunction["PandasObject"][pd, "Show"[]]) & /@ {s, s1}
Out[453]=
In[454]:=
DeleteObject[pd["Session"]]

Version History

  • 1.0.1 – 21 February 2024
  • 1.0.0 – 16 December 2022

Source Metadata

Related Resources

License Information