Function Repository Resource:

PandasObject

Source Notebook

Use the Python package pandas for data science in Wolfram Language

Contributed by: Igor Bakshee, with examples adapted from the pandas documentation

ResourceFunction["PandasObject"][]

returns a configured PythonObject for the Python package pandas in a new Python session.

ResourceFunction["PandasObject"][session]

uses the specified running ExternalSessionObject session.

ResourceFunction["PandasObject"][,"func"[args,opts]]

executes the function func with the specified arguments and options.

Details and Options

The Python package pandas is a toolkit for data science. It offers, in particular, data structures and operations for the manipulation and analysis of numerical tables and time series.
ResourceFunction["PandasObject"] sets up a configuration of the resource function PythonObject that makes working with pandas more convenient and returns the resulting Python object.
ResourceFunction["PandasObject"] makes the Python-side functions and variables accessible by new names that are closer to the usual Wolfram Language conventions. For instance, for lookup properties:
"loc""ByLabel"selection by label
"iloc"“ByPosition"selection by integer position
"at"“AtLabel"label-based scalar lookup
"iat"“AtPosition"integer-based scalar lookup
For a Python object p, p["ToPythonName","wlname"] gives the native Python name corresponding to the Wolfram Language name wlname and p["FromPythonName","pname"] gives the respective Wolfram Language name for the Python-side name pname. In the object p, both wlname and pname can be used interchangeably.
p["RenamingRules"] gives a list of all renaming rules in the form {"wlname1""pname1",}.
p["FullInformation","Functions"] gives a list of the available functions and p["Information","func"] gives the signature of the specified function.
p["WebInformation"] gives a link to the pandas documentation that can be opened with SystemOpen.
Typically, the Wolfram Language signature of a pandas function closely resembles the Python-side signature in which Python-side objects are represented in the form of the resource function PythonObject with possible extensions suitable for the Wolfram Language.
Similar to the Python's pandas library, ResourceFunction["PandasObject"] uses the Python package Matplotlib, which is invoked via the resource function MatplotlibObject.
Additional utility functions are available in the form ResourceFunction["PandasObject"][obj,"func"[args,opts]], where obj can be an ExternalSessionObject session or any PythonObject p defined in the session. The utility functions include the utility functions of the resource function MatplotlibObject:
"Show"["fmt"]display the Python graphics
"Export"["file","fmt"]export the graphics
ResourceFunction["PandasObject"][p,"Show"[]] and ResourceFunction["PandasObject"][p,"Export"[]] provide basic plotting and exporting functionality, whereas MatplotlibObject, which works seamlessly with ResourceFunction["PandasObject"], gives more options and controls.
Spanning elements in objects created with ResourceFunction["PandasObject"][] use one-based indexes, include the end points, and otherwise follow the usual Wolfram Language conventions, as set up by the resource function PythonObject.

Examples

Basic Examples (2) 

Create a dataset of the purchases of some individuals:

In[1]:=
pd = ResourceFunction["PandasObject"][]
Out[1]=
In[2]:=
n = 10;
In[3]:=
purchases = <|"apples" -> RandomInteger[{0, 20}, n], "oranges" -> RandomInteger[{0, 20}, n]|>
Out[3]=
In[4]:=
pd["DataFrame"[purchases]]
Out[4]=

Transfer the dataset from Python:

In[5]:=
Normal[%]
Out[5]=

Attach customer names to the purchases, using the names as "index" (labels of the rows):

In[6]:=
customers = {"Jim", "Mary", "Bob", "Pat", "John", "Jen", "Joe", "Linda", "Dave", "Liz"};
In[7]:=
(df = pd["DataFrame"[purchases, "Index" -> customers]]) // Normal
Out[7]=

Get several rows from the beginning and the end of the dataset:

In[8]:=
df["Head"[]] // Normal
Out[8]=
In[9]:=
df["Tail"[]] // Normal
Out[9]=

Generate descriptive statistics of the purchases:

In[10]:=
df["Describe"[]] // Normal
Out[10]=

Create and display a plot of the purchases:

In[11]:=
df["Plot"[]]
Out[11]=
In[12]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[12]=

Histograms of the purchases:

In[13]:=
df["Histogram"[], True]
Out[13]=
In[14]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[14]=

Close the Python session to clean up:

In[15]:=
DeleteObject[pd["Session"]]

Load the Titanic disaster data:

In[16]:=
pd = ResourceFunction["PandasObject"][]
Out[16]=
In[17]:=
(titanic = pd["ReadCSV"[
     "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"]]) // Normal
Out[17]=

The descriptive statistics of the dataset:

In[18]:=
titanic["Describe"[]] // Normal
Out[18]=

The average fare paid:

In[19]:=
titanic["Fare"]["Mean"[]] // Normal
Out[19]=

The median age and fare of the passengers:

In[20]:=
titanic["Part"[{"Age", "Fare"}]]["Median"[]] // Normal
Out[20]=

Descriptive statistics of the specified columns:

In[21]:=
titanic["Part"[{"Age", "Fare"}]]["Describe"[]] // Normal
Out[21]=

Use the specific aggregating statistics for given columns instead of the predefined statistics:

In[22]:=
titanic["Aggregate"[<|"Age" -> {"min", "max", "median", "skew"}, "Fare" -> {"min", "max", "median", "mean"}|>]] // Normal
Out[22]=

Find the number of male and female passengers:

In[23]:=
titanic["Sex"]["ValueCounts"[]] // Normal
Out[23]=

The average age for male versus female passengers:

In[24]:=
 titanic["Part"[{"Sex", "Age"}]]["GroupBy"["Sex"]]["Mean"[]] // Normal
Out[24]=

Count the number of unique values in all the columns:

In[25]:=
titanic["NumberOfUnique"[]] // Normal
Out[25]=

Or in a specific column:

In[26]:=
titanic["Embarked"]["NumberOfUnique"[]] // Normal
Out[26]=

Pick passengers who embarked at the port "C" (Cherbourg):

In[27]:=
titanic[titanic["Embarked"] == "'C'"] // Normal
Out[27]=

Female passengers, whose fare was less than $10:

In[28]:=
fareMask = titanic["Fare"] < 10
Out[28]=
In[29]:=
sexMask = titanic["Sex"] == "'female'"
Out[29]=
In[30]:=
fareMask && sexMask
Out[30]=
In[31]:=
titanic[%] // Normal
Out[31]=

Find the correlation between the sex and survival by replacing string values in the "Sex" column with numbers using the String method "GetDummies" ("women first"):

In[32]:=
titanic["Sex"]["String"]["GetDummies"[]]["Head"[]] // Normal
Out[32]=
In[33]:=
titanic["Sex"]["String"]["GetDummies"[]][
  "corrwith"[titanic["Survived"]]] // Normal
Out[33]=

Summarize the survival rate by sex and cabin class using a spreadsheet-like pivot table (better survival rate for women and in the first class):

In[34]:=
titanic["PivotTable"["Survived", "Index" -> "Sex", "Columns" -> "Pclass"]] // Normal
Out[34]=

Find the missing values in the "Age" column:

In[35]:=
titanic["Age"]["IsNull"[]]
Out[35]=
In[36]:=
titanic[%] // Normal
Out[36]=

Statistics is calculated despite missing values:

In[37]:=
titanic["Age"]["Mean"[]] // Normal
Out[37]=

Drop the first few rows by giving the index values as labels and specifying "Axis"0 for a row-wise operation:

In[38]:=
titanic["Drop"[{0, 1, 2}, "Axis" -> 0]]["Head"[]] // Normal
Out[38]=

Prepare and show a histogram of age values:

In[39]:=
titanic["Age"]["Histogram"[]]
Out[39]=
In[40]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[40]=

Drop the specified column:

In[41]:=
titanic["Drop"[{"Sex"}, "Axis" -> 1]]["Head"[]] // Normal
Out[41]=

Create a copy of the object:

In[42]:=
tt = titanic["Copy"[]]
Out[42]=

Drop the "Sex" column in the object in-place:

In[43]:=
tt["Drop"[{"Sex"}, "Axis" -> 1, "InPlace" -> True]]
In[44]:=
tt["Head"[]] // Normal
Out[44]=

Sort the data by values of the specified column:

In[45]:=
titanic["SortValues"["Fare"]]["Head"[]] // Normal
Out[45]=

Prepare and show stacked filled plots of the numeric columns:

In[46]:=
titanic["Plot"]["Area"["Subplots" -> True], True]
Out[46]=
In[47]:=
ResourceFunction["PandasObject"][pd, "Show"[ImageSize -> Large]]
Out[47]=

Clean up the Python session:

In[48]:=
DeleteObject[pd["Session"]]

Scope (125) 

Object creation (2) 

In[49]:=
pd = ResourceFunction["PandasObject"][]
Out[49]=

Create a Series object from a list of values, letting pandas use a default integer index:

In[50]:=
pd["Series"[{1, 3, 5, None, 6, 8}]]
Out[50]=
In[51]:=
Normal[%]
Out[51]=
In[52]:=
DeleteObject[pd["Session"]]

Create a new panda object:

In[53]:=
pd = ResourceFunction["PandasObject"][]
Out[53]=

Create a date-time index array of six values by specifying the starting date in the form "yyyy-mm-dd":

In[54]:=
dates = pd["DateRange"["2013-01-01", "Periods" -> 6]]
Out[54]=
In[55]:=
dates // Normal
Out[55]=

Alternatively, specify the starting date without dashes:

In[56]:=
pd["DateRange"["20130101", "Periods" -> 6]] // Normal
Out[56]=

Construct a "DataFrame" object from the array of dates with labeled columns:

In[57]:=
pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> dates, "Columns" -> CharacterRange["A", "D"]]]
Out[57]=
In[58]:=
% // Normal
Out[58]=

Create a "DataFrame" from an association of objects that can be converted into a "Series"-like structure:

In[59]:=
pd["DataFrame"[<|
   "A" -> 1.0,
   "B" -> pd["Timestamp"["20130102"]],
   "C" -> pd["Series"[1, "index" -> Range[0, 3], "dtype" -> "float32"]],
   "D" -> ConstantArray[0, 4],
   "E" -> pd["Categorical"[{"test", "train", "test", "train"}]],
   "F" -> "foo"|>]]
Out[59]=
In[60]:=
% // Normal
Out[60]=
In[61]:=
DeleteObject[pd["Session"]]

Viewing data (16) 

In[62]:=
pd = ResourceFunction["PandasObject"][]
Out[62]=

Create a "DataFrame" object:

In[63]:=
df = pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]
Out[63]=

Get the preset number of top and bottom rows of the data frame:

In[64]:=
df["Head"[]] // Normal
Out[64]=
In[65]:=
df["Tail"[]] // Normal
Out[65]=

Get the specified number of top and bottom rows:

In[66]:=
df["Head"[2]] // Normal
Out[66]=
In[67]:=
df["Tail"[2]] // Normal
Out[67]=

Get the index:

In[68]:=
df["Index"]
Out[68]=
In[69]:=
% // Normal
Out[69]=

Alternatively:

In[70]:=
%%["ToList"[]]
Out[70]=

Get column names:

In[71]:=
df["Columns"]
Out[71]=
In[72]:=
Normal[%]
Out[72]=

Alternatively:

In[73]:=
%%["ToList"[]]
Out[73]=

Get a quick statistic summary of your data:

In[74]:=
df["Describe"[]] // Normal
Out[74]=

Sort by index values in descending order:

In[75]:=
df["SortIndex"["Ascending" -> False]]
Out[75]=
In[76]:=
Normal[%]
Out[76]=

Sort by column labels in descending order:

In[77]:=
df["SortIndex"["Axis" -> 1, "Ascending" -> False]] // Normal
Out[77]=

Sort by values in a column:

In[78]:=
 df["SortValues"["B"]] // Normal
Out[78]=

Transpose the data frame:

In[79]:=
df["T"]
Out[79]=

Get the dataset of the transposed values:

In[80]:=
% // Normal
Out[80]=
In[81]:=
Normal[%]
Out[81]=

Convert a "DataFrame" object to a NumPy array:

In[82]:=
df["ToNumpy"[]]
Out[82]=

Get the values in your Wolfram Language session:

In[83]:=
Normal[%] // Normal // MatrixForm
Out[83]=

Compare with the original dataset:

In[84]:=
df // Normal
Out[84]=
In[85]:=
DeleteObject[pd["Session"]]

Selection (26) 

Getting DataFrame Parts (9) 

Create a new pandas object:

In[86]:=
pd = ResourceFunction["PandasObject"][]
Out[86]=
In[87]:=
df = pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]
Out[87]=

Select a single column, treating a column label as property. Note that the returned object is a "Series" and not a "DataFrame":

In[88]:=
df["A"]
Out[88]=

Import the "Series" object as a TimeSeries:

In[89]:=
Normal[%]
Out[89]=

Plot the time series:

In[90]:=
DateListPlot[%]
Out[90]=

Select several columns using the "Part" syntax:

In[91]:=
df["Part"[{"A", "B"}]] // Normal
Out[91]=

Select several rows using the Span syntax:

In[92]:=
df["Part"[;; 3]] // Normal
Out[92]=

Use a Python object "Slice" to span rows by index values (rather than integer indices):

In[93]:=
slice = ResourceFunction["PythonObject"][pd["Session"], "slice"]
Out[93]=
In[94]:=
slice[All]["2013-01-02", "2013-01-04"]
Out[94]=
In[95]:=
df["Part"[%]] // Normal
Out[95]=

Alternatively, use the Python syntax to create a "Slice" object:

In[96]:=
ResourceFunction["PythonObject"][
 pd["Session"], "slice('2013-01-02','2013-01-04')"]
Out[96]=
In[97]:=
 df["Part"[%]] // Normal
Out[97]=

Delete rows and add a column by re-indexing:

In[98]:=
df["Reindex"["Index" -> df["Index"]["Part"[2 ;; 4]], "Columns" -> Flatten[{df["Columns"] // Normal, "E"}]]] // Normal
Out[98]=
In[99]:=
DeleteObject[pd["Session"]]
Selecting by label (8) 

Create a data frame object:

In[100]:=
pd = ResourceFunction["PandasObject"][]
Out[100]=
In[101]:=
(dates = pd["DateRange"["20130101", "Periods" -> 6]]) // Normal
Out[101]=
In[102]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> dates, "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[102]=

Use the "ByLabel" property of the "DataFrame" object to access values by label:

In[103]:=
loc = df["ByLabel"]
Out[103]=

Get a cross section of the data frame using an index label:

In[104]:=
dates["Part"[2]]
Out[104]=
In[105]:=
loc["Part"[%]] // Normal
Out[105]=

Alternatively, use the index value:

In[106]:=
loc["Part"["2013-01-02"]] // Normal
Out[106]=

Get a part corresponding to all rows of the specified columns (or, in the pandas parlance, select on a multi-axis by label):

In[107]:=
loc["Part"[;; , {"A", "B"}]] // Normal
Out[107]=

Get parts of a single row in a form of a "Series" object, rather than a data frame:

In[108]:=
loc["Part"["2013-01-02", {"A", "B"}]]
Out[108]=
In[109]:=
Normal[%]
Out[109]=

Get a scalar value:

In[110]:=
loc["Part"[dates["Part"[2]], "A"]]
Out[110]=

Equivalently, get fast access to a scalar using the "AtLabel" property of the "DataFrame" object:

In[111]:=
at = df["AtLabel"]
Out[111]=
In[112]:=
at["Part"[dates["Part"[2]], "A"]]
Out[112]=
In[113]:=
DeleteObject[pd["Session"]]
Selecting by position (9) 

Create a data frame object:

In[114]:=
pd = ResourceFunction["PandasObject"][]
Out[114]=
In[115]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[115]=

Use the "ByPosition" property of the "DataFrame" object to access values by position:

In[116]:=
iloc = df["ByPosition"]
Out[116]=
In[117]:=
iloc["Part"[2]] // Normal
Out[117]=

Use the Span syntax to access elements:

In[118]:=
iloc["Part"[3 ;; 5, 1 ;; 3]] // Normal
Out[118]=

Select by lists of integer positions:

In[119]:=
iloc["Part"[{2, 3, 5}, {1, 3}]] // Normal
Out[119]=

Slice rows explicitly:

In[120]:=
iloc["Part"[2 ;; 3]] // Normal
Out[120]=

Alternatively:

In[121]:=
iloc["Part"[2 ;; 3, ;;]] // Normal
Out[121]=

Slice columns explicitly:

In[122]:=
iloc["Part"[;; , 2 ;; 3]] // Normal
Out[122]=

Get a scalar value explicitly:

In[123]:=
iloc["Part"[2, 2]]
Out[123]=

Equivalently:

In[124]:=
iat = df["AtPosition"]
Out[124]=
In[125]:=
iat["Part"[2, 2]]
Out[125]=
In[126]:=
DeleteObject[pd["Session"]]

Hierarchical Indexing (4) 

Create a "DataFrame" object for an array with more than two dimensions:

In[127]:=
pd = ResourceFunction["PandasObject"][]
Out[127]=
In[128]:=
df = pd["DataFrame"[<|"vals" -> Range[4]|>, "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}]]
Out[128]=

The imported "DataFrame" object is represented as an Association with keys in the form of a list:

In[129]:=
df // Normal
Out[129]=
In[130]:=
% // Normal
Out[130]=

In Python, the index is represented as a "MultiIndex" object:

In[131]:=
df["Index"]
Out[131]=

Create a "MultiIndex" directly:

In[132]:=
tuples = Transpose[{{"bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"}, {"one", "two", "one", "two", "one", "two", "one", "two"}}]
Out[132]=
In[133]:=
index = pd["MultiIndex"]["FromTuples"][All][tuples]
Out[133]=
In[134]:=
(pd["DataFrame"[<|"vals" -> Range[8]|>, "Index" -> index]]) // Normal
Out[134]=
In[135]:=
DeleteObject[pd["Session"]]

Boolean indexing (2) 

In[136]:=
pd = ResourceFunction["PandasObject"][]
Out[136]=
In[137]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[137]=

Create an array of Boolean values for which values in the column "A" are positive:

In[138]:=
bool = df["A"] > 0
Out[138]=
In[139]:=
Normal[%]["Values"]
Out[139]=

Pick rows for which the condition holds:

In[140]:=
df["Part"[bool]] // Normal
Out[140]=

A Boolean array for positive "DataFrame" values:

In[141]:=
(positive = df > 0) // Normal
Out[141]=

Select values from a "DataFrame" where a Boolean condition is met:

In[142]:=
df["Part"[positive]] // Normal
Out[142]=
In[143]:=
DeleteObject[pd["Session"]]

Create a data frame:

In[144]:=
pd = ResourceFunction["PandasObject"][]
Out[144]=
In[145]:=
(df = pd[
    "DataFrame"[
     Join[RandomReal[{-1, 1}, {6, 4}], List /@ {"one", "one", "two", "three", "four", "three"}, 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "E"]]]) // Normal
Out[145]=

Use the "IsIn" method for filtering:

In[146]:=
df["Part"[df["E"]["IsIn"[{"two", "four"}]]]] // Normal
Out[146]=
In[147]:=
DeleteObject[pd["Session"]]

Setting Values (2) 

Create a "DataFrame":

In[148]:=
pd = ResourceFunction["PandasObject"][]
Out[148]=
In[149]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[149]=

Create a new "Series" object that is longer than the "DataFrame" and starts with a time offset:

In[150]:=
n = 7;
In[151]:=
s1 = pd["Series"[Range[n], "Index" -> pd["DateRange"[df["Index"]["Part"[2]], "Periods" -> n]]]]
Out[151]=
In[152]:=
Normal[%]["DatePath"] // Column
Out[152]=

Add the object as a new column to the data frame, automatically aligning the data by the indexes:

In[153]:=
df["Assign"["Part"["F"] -> s1]]
Out[153]=
In[154]:=
df // Normal
Out[154]=

Set values by label:

In[155]:=
date = df["Index"]["Part"[1]]
Out[155]=
In[156]:=
df["AtLabel"]["Assign"["Part"[date, "A"] -> 100]]
Out[156]=
In[157]:=
df // Normal
Out[157]=

Set values by position:

In[158]:=
df["AtPosition"]["Assign"["Part"[1, 2] -> -100]]
Out[158]=
In[159]:=
df // Normal
Out[159]=

Set values with an array:

In[160]:=
shape = df["Shape"]
Out[160]=
In[161]:=
df["ByLabel"][
 "Assign"["Part"[;; , "D"] -> ConstantArray[50, shape[[1]]]]]
Out[161]=
In[162]:=
df // Normal
Out[162]=

Create a NumPy array:

In[163]:=
arr = ResourceFunction["PythonObject"][pd["Session"], "numpy.array([10]*len(" <> df[] <> "))"]
Out[163]=
In[164]:=
arr // Normal // Normal
Out[164]=

Assign a column to a NumPy array by position:

In[165]:=
df["ByPosition"]["Assign"["Part"[;; , 3] -> arr]]
Out[165]=
In[166]:=
df // Normal
Out[166]=
In[167]:=
DeleteObject[pd["Session"]]

Create a "DataFrame":

In[168]:=
pd = ResourceFunction["PandasObject"][]
Out[168]=
In[169]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[169]=

A Boolean array of values where a condition is met:

In[170]:=
(bool = df > 0) // Normal
Out[170]=

Replace values where the condition is not met:

In[171]:=
df["where"[bool]] // Normal
Out[171]=

Negate positive values using the where operation with setting:

In[172]:=
df["Assign"["Part"[df > 0] -> -df]]
Out[172]=
In[173]:=
df // Normal
Out[173]=
In[174]:=
DeleteObject[pd["Session"]]

Missing data (5) 

Create a "DataFrame" with missing values:

In[175]:=
pd = ResourceFunction["PandasObject"][]
Out[175]=
In[176]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {4, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 4]], "Columns" -> CharacterRange["A", "D"]]]) // Normal
Out[176]=
In[177]:=
(df = df["Reindex"["Columns" -> CharacterRange["A", "E"]]]) // Normal
Out[177]=

Assign some values in the "DataFrame":

In[178]:=
df["ByPosition"]["Assign"["Part"[1 ;; 2, -1] -> 1]]
Out[178]=
In[179]:=
df // Normal
Out[179]=

Drop rows with missing data:

In[180]:=
df["DropMissing"["How" -> "any"]] // Normal
Out[180]=

Fill missing data:

In[181]:=
df["FillMissing"["Value" -> 5]] // Normal
Out[181]=

Get the Boolean mask where values are missing:

In[182]:=
pd["IsMissing"[df]] // Normal
Out[182]=
In[183]:=
DeleteObject[pd["Session"]]

Binary Operations (16) 

Functions vs. Operators (10) 

Define a "DataFrame" object:

In[184]:=
pd = ResourceFunction["PandasObject"][]
Out[184]=
In[185]:=
(df = pd[
    "DataFrame"[<|"angles" -> {0, 3, 4}, "degrees" -> {360, 180, 360}|>,
      "Index" -> {"circle", "triangle", "rectangle"}]]) // Normal
Out[185]=

Add a scalar using an arithmetic operator version:

In[186]:=
df + 1 // Normal
Out[186]=

Equivalently, use a function version:

In[187]:=
df["Add"[1]] // Normal
Out[187]=

Subtract a list:

In[188]:=
df["Subtract"[{1, 2}, "Axis" -> "columns"]] // Normal
Out[188]=

Subtract a "Series" object:

In[189]:=
(s = pd["Series"[{1, 1, 1}, "Index" -> {"circle", "triangle", "rectangle"}]]) // Normal
Out[189]=
In[190]:=
df["Subtract"[s, "Axis" -> "index"]] // Normal
Out[190]=

Multiply a dictionary by axis:

In[191]:=
df["Multiply"[<|"angles" -> 0, "degrees" -> 2|>]] // Normal
Out[191]=
In[192]:=
df["Multiply"[<|"circle" -> 0, "triangle" -> 2, "rectangle" -> 3|>, "Axis" -> "index"]] // Normal
Out[192]=

Multiply a "DataFrame" of different shape using the operator version:

In[193]:=
(other = pd["DataFrame"[<|"angles" -> {0, 3, 4}|>, "Index" -> {"circle", "triangle", "rectangle"}]]) // Normal
Out[193]=
In[194]:=
df*other // Normal
Out[194]=

Use the function version to fill in missing values:

In[195]:=
df["Multiply"[other, "FillValue" -> 0]] // Normal
Out[195]=

Create a "DataFrame" with a hierarchical "MultiIndex":

In[196]:=
(dfm = pd[
    "DataFrame"[<|"angles" -> {0, 3, 4, 4, 5, 6}, "degrees" -> {360, 180, 360, 360, 540, 720}|>, "Index" -> {{" A", " A", " A", " B", " B", " B"}, {"circle", "triangle", "rectangle", "square", "pentagon", "hexagon"}}]]) // Normal
Out[196]=

Divide by the hierarchical "DataFrame" specifying the level:

In[197]:=
df["Divide"[dfm, "Level" -> 1, "FillValue" -> 0]] // Normal
Out[197]=
In[198]:=
DeleteObject[pd["Session"]]
Matching / Broadcasting Behavior (6) 

Create a data frame:

In[199]:=
pd = ResourceFunction["PandasObject"][]
Out[199]=
In[200]:=
(df = pd["DataFrame"[<|
       "one" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"a", "b", "c"}]], "two" -> pd["Series"[RandomReal[{-1, 1}, {4}], "Index" -> {"a", "b", "c", "d"}]],
      "three" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"b", "c", "d"}]]
      |>
     ]]) // Normal
Out[200]=

Select a row and a column:

In[201]:=
(row = df["ByPosition"]["Part"[2]]) // Normal
Out[201]=
In[202]:=
(column = df["two"]) // Normal
Out[202]=

Use the "Axis" option to match on the index or columns:

In[203]:=
df["Subtract"[row, "Axis" -> "columns"]] // Normal
Out[203]=
In[204]:=
df["Subtract"[column, "Axis" -> "index"]] // Normal
Out[204]=

Alternatively:

In[205]:=
df["Subtract"[row, "Axis" -> 1]] // Normal
Out[205]=
In[206]:=
df["Subtract"[column, "Axis" -> 0]] // Normal
Out[206]=

Use the built-in Python divmod function with a Series object to take the floor division and modulo operation at the same time:

In[207]:=
divmod = ResourceFunction["PythonObject"][pd["Session"], "divmod"]
Out[207]=
In[208]:=
 (s = pd["Series"[Range[0, 9]]]) // Normal
Out[208]=
In[209]:=
{div, rem} = divmod[All][s, 3]
Out[209]=

Do elementwise divmod:

In[210]:=
{div, rem} = divmod[All][s, {2, 2, 3, 3, 4, 4, 5, 5, 6, 6}]
Out[210]=
In[211]:=
DeleteObject[pd["Session"]]

Stats (4) 

Create a data frame:

In[212]:=
pd = ResourceFunction["PandasObject"][]
Out[212]=
In[213]:=
(df = pd[
    "DataFrame"[
     Join[RandomReal[{-1, 1}, {6, 4}], Table[{5}, {6}], List /@ Flatten[{None, Range[5]}], 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "F"]]]) // Normal
Out[213]=

Perform descriptive statistics:

In[214]:=
df["Mean"[]] // Normal
Out[214]=

Same operation on the other axis:

In[215]:=
Normal[df["Mean"[1]]]["DatePath"]
Out[215]=

Operate on objects that have different dimensionality with alignment:

In[216]:=
Normal[(s = pd["Series"[{1, 3, 5, None, 6, 8}, "Index" -> df["Index"]]][
      "shift"[2]])]["DatePath"] // Column
Out[216]=
In[217]:=
df["Subtract"[s, "Axis" -> "index"]] // Normal
Out[217]=
In[218]:=
DeleteObject[pd["Session"]]

Applying Functions (4) 

Create a data frame:

In[219]:=
pd = ResourceFunction["PandasObject"][]
Out[219]=
In[220]:=
(df = pd[
    "DataFrame"[
     Join[RandomReal[{-1, 1}, {6, 4}], Table[{5}, {6}], List /@ Flatten[{None, Range[5]}], 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "F"]]]) // Normal
Out[220]=

Import a NumPy function:

In[221]:=
ExternalFunction[pd["Session"], "numpy.cumsum"]
Out[221]=

Apply the function:

In[222]:=
df["Apply"[%]] // Normal
Out[222]=

Create and apply a lambda function:

In[223]:=
ExternalFunction[pd["Session"], "lambda x: x.max() - x.min()"]
Out[223]=
In[224]:=
df["Apply"[%]] // Normal
Out[224]=
In[225]:=
DeleteObject[pd["Session"]]

Histogramming (2) 

Create a "Series" object:

In[226]:=
pd = ResourceFunction["PandasObject"][]
Out[226]=
In[227]:=
(s = pd["Series"["data" -> RandomInteger[{0, 7}, {10}]]]) // Normal
Out[227]=

Count unique values in the series:

In[228]:=
s["ValueCounts"[]] // Normal
Out[228]=
In[229]:=
DeleteObject[pd["Session"]]

String Methods (2) 

Create a "Series" object:

In[230]:=
pd = ResourceFunction["PandasObject"][]
Out[230]=
In[231]:=
(s = pd["Series"[{"A", "B", "C", "Aaba", "Baca", None, "CABA", "dog", "cat"}]]) // Normal
Out[231]=

Use the "String" attribute to operate on each element of the series:

In[232]:=
str = s["String"]
Out[232]=
In[233]:=
str["Lower"[]] // Normal
Out[233]=
In[234]:=
DeleteObject[pd["Session"]]

Merging (7) 

Concatenating (4) 

Create a data frame:

In[235]:=
pd = ResourceFunction["PandasObject"][]
Out[235]=
In[236]:=
n = 3;
In[237]:=
(df = pd["DataFrame"[RandomReal[{-1, 1}, {n, 2}]]]) // Normal
Out[237]=

Break it into pieces:

In[238]:=
Normal /@ (pieces = df["Part"[# ;; #]] & /@ Range[n])
Out[238]=

Concatenate the pieces:

In[239]:=
(df1 = pd["Concatenate"[pieces]]) // Normal
Out[239]=

The concatenated object is the same as the original:

In[240]:=
df["Equals"[df1]]
Out[240]=
In[241]:=
DeleteObject[pd["Session"]]
Database-Style Joining  (3) 

Create two "DataFrame" objects:

In[242]:=
pd = ResourceFunction["PandasObject"][]
Out[242]=
In[243]:=
(left = pd[
    "DataFrame"[<|"key" -> {"foo", "foo"}, "left" -> {1, 2}|>]]) // Normal
Out[243]=
In[244]:=
(right = pd["DataFrame"[<|"key" -> {"foo", "foo"}, "right" -> {4, 5}|>]]) // Normal
Out[244]=

Merge the objects in SQL style:

In[245]:=
pd["Merge"[left, right, "On" -> "key"]] // Normal
Out[245]=

Alternatively, with different keys:

In[246]:=
(left = pd[
    "DataFrame"[<|"key" -> {"foo", "bar"}, "left" -> {1, 2}|>]]) // Normal
Out[246]=
In[247]:=
(right = pd["DataFrame"[
     "data" -> <|"key" -> {"foo", "bar"}, "right" -> {4, 5}|>]]) // Normal
Out[247]=
In[248]:=
pd["Merge"[left, right, "On" -> "key"]] // Normal
Out[248]=
In[249]:=
DeleteObject[pd["Session"]]

Grouping (3) 

Create a "DataFrame" object:

In[250]:=
pd = ResourceFunction["PandasObject"][]
Out[250]=
In[251]:=
(df = pd[
    "DataFrame"[<|
      "A" -> {"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"},
      "B" -> {"one", "one", "two", "three", "two", "two", "one", "three"},
      "C" -> RandomInteger[10, {8}],
      "D" -> RandomInteger[10, {8}]
      |>]]) // Normal
Out[251]=

Group by values of the column "A", sum values in the groups and combine the results in a new "DataFrame" object:

In[252]:=
df["GroupBy"["A"]]["Sum"[]]
Out[252]=
In[253]:=
% // Normal
Out[253]=

Group by multiple columns forming a hierarchical index and apply the summing function to each group:

In[254]:=
df["GroupBy"[{"A", "B"}]]["Sum"[]] // Normal
Out[254]=
In[255]:=
DeleteObject[pd["Session"]]

Reshaping (5) 

Pivoting (2) 

Create a "DataFrame" object:

In[256]:=
pd = ResourceFunction["PandasObject"][]
Out[256]=
In[257]:=
(df = pd[
    "DataFrame"[<|"foo" -> {"one", "one", "one", "two", "two", "two"},
       "bar" -> {"A", "B", "C", "A", "B", "C"}, "baz" -> Range[6], "zoo" -> {"x", "y", "z", "q", "w", "t"}|>]]) // Normal
Out[257]=

Organize the object by index and column values:

In[258]:=
df["Pivot"["Index" -> "foo", "Columns" -> "bar", "Values" -> "baz"]] // Normal
Out[258]=
In[259]:=
DeleteObject[pd["Session"]]
Stacking (3) 

Create a "DataFrame" object with a hierarchical index:

In[260]:=
pd = ResourceFunction["PandasObject"][]
Out[260]=
In[261]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {4, 2}], "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}, "Columns" -> {"A", "B"}]]) // Normal
Out[261]=

Stack the object by "compressing" a level in the columns:

In[262]:=
(stacked = df["Stack"[]]) // Normal
Out[262]=

Reverse the operation by "unstacking" the last level:

In[263]:=
stacked["Unstack"[]] // Normal
Out[263]=
In[264]:=
stacked["Unstack"[0]] // Normal
Out[264]=
In[265]:=
stacked["Unstack"[1]] // Normal
Out[265]=
In[266]:=
DeleteObject[pd["Session"]]

Time Series Resampling (3) 

Create a series with 9 one-second timestamps:

In[267]:=
pd = ResourceFunction["PandasObject"][]
Out[267]=
In[268]:=
dates = pd[
  "DateRange"["1/1/2000", "Periods" -> 9, "Frequency" -> "S"]]
Out[268]=
In[269]:=
( s = pd["Series"[Range[dates["Size"]], "Index" -> dates]]) // Normal
Out[269]=

Downsample the series into 3-second bins and sum the values falling into each bin:

In[270]:=
(s["Resample"["3S"]]["Sum"[]]) // Normal
Out[270]=
In[271]:=
%["DatePath"]
Out[271]=

Check the sums:

In[272]:=
Total /@ Partition[Range[dates["Size"]], 3]
Out[272]=
In[273]:=
DeleteObject[pd["Session"]]

Create a time series object with dates given in the local time zone:

In[274]:=
pd = ResourceFunction["PandasObject"][]
Out[274]=
In[275]:=
dates = pd[
  "DateRange"["3/6/2012 00:00", "Periods" -> 2, "Frequency" -> "D"]]
Out[275]=
In[276]:=
s = pd["Series"[{1, 2}, "index" -> dates]]
Out[276]=

Check the dates:

In[277]:=
s["Index"] // Normal
Out[277]=

Localize the series to the UTC time zone and check the dates:

In[278]:=
sUTC = s["TimeZoneLocalize"["UTC"]]
Out[278]=
In[279]:=
%["Index"] // Normal
Out[279]=

Convert the series to another time zone:

In[280]:=
sUTC["TimeZoneConvert"["US/Eastern"]]
Out[280]=
In[281]:=
%["Index"] // Normal
Out[281]=
In[282]:=
DeleteObject[pd["Session"]]

Create a series with quarterly frequency for a year, ending in November:

In[283]:=
pd = ResourceFunction["PandasObject"][]
Out[283]=
In[284]:=
periods = pd["PeriodRange"["1990Q1", "2000Q4", "Frequency" -> "Q-NOV"]]
Out[284]=
In[285]:=
s = pd["Series"[RandomReal[1, periods["Size"]], "Index" -> periods]]
Out[285]=

Check the start dates of a few periods in the series:

In[286]:=
Normal[s["Head"[]]]["Dates"]
Out[286]=

Convert the series to 9 AM of the end of the month following the quarter end and check starting dates again:

In[287]:=
s["Assign"[
  "index" -> (periods[
        "AsFrequency"["Frequency" -> "M", "how" -> "e"]] + 1)[
     "AsFrequency"["Frequency" -> "H", "how" -> "s"]] + 9]]
Out[287]=
In[288]:=
Normal[s["Head"[]]]["Dates"]
Out[288]=
In[289]:=
DeleteObject[pd["Session"]]

Categoricals (8) 

Create a "DataFrame" with a column whose values are taken from a limited alphabet:

In[290]:=
pd = ResourceFunction["PandasObject"][]
Out[290]=
In[291]:=
(df = pd[
    "DataFrame"[<|"id" -> {1, 2, 3, 4, 5, 6}, "raw_grade" -> {"a", "b", "b", "a", "a", "e"}|>]]) // Normal
Out[291]=

Convert the raw grades to a categorical data type:

In[292]:=
df["Assign"[
  "Part"["grade"] -> df["Part"["raw_grade"]]["AsType"["category"]]]]
Out[292]=
In[293]:=
df // Normal
Out[293]=

The current categories:

In[294]:=
df["grade"]["cat"]["categories"] // Normal
Out[294]=

Rename the categories to more meaningful names in place:

In[295]:=
 df["Assign"[
  "grade.cat.categories" -> {"very good", "good", "very bad"}]]
Out[295]=
In[296]:=
df // Normal
Out[296]=

Reorder the categories and simultaneously add the missing categories:

In[297]:=
df["grade", True]["cat"][
 "SetCategories"[{ "very bad", "bad", "medium", "good", "very good"}]]
Out[297]=
In[298]:=
df["Assign"["grade" -> %]]
Out[298]=
In[299]:=
df // Normal
Out[299]=

Sort by order in the categories:

In[300]:=
df["SortValues"["grade"]] // Normal
Out[300]=

Sort by values in the "raw_grade" column (in lexicographic order):

In[301]:=
df["SortValues"["raw_grade"]] // Normal
Out[301]=

Group by a categorical column, showing empty categories:

In[302]:=
df["GroupBy"["grade"]]["Size"[]] // Normal
Out[302]=
In[303]:=
DeleteObject[pd["Session"]]

Plotting (5) 

Construct a simple data frame object:

In[304]:=
pd = ResourceFunction["PandasObject"][]
Out[304]=
In[305]:=
n = 100;
In[306]:=
df = pd["DataFrame"[<|"column1" -> RandomInteger[{0, 20}, n], "column2" -> RandomInteger[{20, 50}, n]|>]]
Out[306]=

Create a plot of column values with labels:

In[307]:=
df["Plot"[]]
Out[307]=

Show the plot in the default ("PNG") format:

In[308]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[308]=

Show the plot as a vector graphics:

In[309]:=
ResourceFunction["PandasObject"][pd, "Show"["SVG"]]
Out[309]=

Export the plot to a file from Python:

In[310]:=
ResourceFunction["PandasObject"][pd, "Export"["myfile", "SVG"]]
Out[310]=

Import the file:

In[311]:=
ResourceFunction["SVGImport"][%]
Out[311]=

Delete the file:

In[312]:=
DeleteFile[FileNames[%%, $HomeDirectory]]

Clear the plot figure:

In[313]:=
ResourceFunction["PandasObject"][pd, "Clear"[]]

Plot the specified column:

In[314]:=
df["column1"]["Plot"[]]
Out[314]=
In[315]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[315]=

Plot one column versus another:

In[316]:=
df["SortValues"["column1"], True]
Out[316]=
In[317]:=
%["Plot"["x" -> "column1", "y" -> "column2"]]
Out[317]=
In[318]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[318]=
In[319]:=
DeleteObject[pd["Session"]]

Create a time series:

In[320]:=
pd = ResourceFunction["PandasObject"][]
Out[320]=
In[321]:=
n = 500;
In[322]:=
s = pd["Series"[RandomReal[{-1, 1}, {n}], "Index" -> pd["DateRange"["1/1/2000", "Periods" -> n]]], True]
Out[322]=

Compute its cumulative sum:

In[323]:=
s = s["CumulativeSum"[]]
Out[323]=

Prepare a plot of the time series:

In[324]:=
s["Plot"[]]
Out[324]=

Show the plot:

In[325]:=
ResourceFunction["PandasObject"][s, "Show"[]]
Out[325]=
In[326]:=
DeleteObject[pd["Session"]]

Create a data frame:

In[327]:=
pd = ResourceFunction["PandasObject"][]
Out[327]=
In[328]:=
df = pd["DataFrame"[RandomReal[1, {10, 4}], "Columns" -> {"a", "b", "c", "d"}]]
Out[328]=

List available plot types:

In[329]:=
df["Plot"]["Information"]["Methods"] // Rest
Out[329]=

Create a bar plot:

In[330]:=
df["Plot"]["Bar"[]]
Out[330]=
In[331]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[331]=

Alternatively, use the "Plot" method of the "DataFrame" object:

In[332]:=
ResourceFunction["PandasObject"][pd, "Clear"[]]
In[333]:=
df["Plot"["Kind" -> "bar"]]
Out[333]=

A stacked horizontal plot:

In[334]:=
df["Plot"["Kind" -> "barh", "stacked" -> True]]
Out[334]=
In[335]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[335]=

A box plot:

In[336]:=
df["Plot"["Kind" -> "box"]]
Out[336]=
In[337]:=
ResourceFunction["PandasObject"][df, "Show"[]]
Out[337]=

Pass keywords supported by the resource function MatplotlibObject "boxplot":

In[338]:=
df["Plot"["Kind" -> "box", "vert" -> False]]
Out[338]=
In[339]:=
ResourceFunction["PandasObject"][df, "Show"[]]
Out[339]=
In[340]:=
DeleteObject[pd["Session"]]

Create a data frame with normally-distributed values:

In[341]:=
pd = ResourceFunction["PandasObject"][]
Out[341]=
In[342]:=
df = pd["DataFrame"[
   RandomVariate[NormalDistribution[0, 1], {1000, 4}], "columns" -> CharacterRange["a", "d"]]]
Out[342]=

Create a scatter matrix plot using the "ScatterMatrix" method from pandas.plotting:

In[343]:=
pd["plotting"]["ScatterMatrix"][All, True][df, "alpha" -> 0.4, "figsize" -> {6, 6}, "diagonal" -> "kde"]
Out[343]=
In[344]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[344]=
In[345]:=
DeleteObject[pd["Session"]]

Create a time series of a cumulative random process:

In[346]:=
pd = ResourceFunction["PandasObject"][]
Out[346]=
In[347]:=
price = pd[
  "Series"[FoldList[Plus, RandomReal[{-1, 1}, {150}]], "Index" -> pd["DateRange"["2000-1-1", "Periods" -> 150, "Frequency" -> "B"]]]]
Out[347]=

Compute a moving average and standard deviation of the process:

In[348]:=
ma = price["Rolling"[20]]["Mean"[]]
Out[348]=
In[349]:=
std = price["Rolling"[20]]["StandardDeviation"[]]
Out[349]=

Prepare a temporal plot of the prices, the mean values, and the Bollinger band using a MatplotlibObject:

In[350]:=
plt = ResourceFunction["MatplotlibObject"][pd]
Out[350]=
In[351]:=
plt["plot"[price["Index"], price, "k"]]
Out[351]=
In[352]:=
plt["plot"[ma["Index"], ma, "b"]]
Out[352]=
In[353]:=
plt["fill_between"[std["Index"], ma - 2*std, "y2" -> ma + 2*std, "color" -> "b", "alpha" -> 0.2]]
Out[353]=

Show the plots:

In[354]:=
ResourceFunction["MatplotlibObject"][price, "Show"[]]
Out[354]=
In[355]:=
DeleteObject[pd["Session"]]

Importing and exporting data (9) 

CSV (4) 
In[356]:=
pd = ResourceFunction["PandasObject"][]
Out[356]=
In[357]:=
Normal[df = pd["DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]]
Out[357]=

Write to a CSV file:

In[358]:=
fname = FileNameJoin[{$TemporaryDirectory, "foo.csv"}];
In[359]:=
 df["ToCSV"[fname, "Index" -> False]]

Print contents of the file:

In[360]:=
FilePrint[fname]

Read the CSV file as a "DataFrame" object:

In[361]:=
pd["ReadCSV"[fname]]
Out[361]=
In[362]:=
% // Normal
Out[362]=

Clean up:

In[363]:=
DeleteFile[fname]
In[364]:=
DeleteObject[pd["Session"]]

problematic in 2.2.0:

Excel (5) 

Create a new pandas object:

In[365]:=
pd = ResourceFunction["PandasObject"][]
Out[365]=
In[366]:=
(df = pd[
    "DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]) // Normal
Out[366]=

Write to an Excel file:

In[367]:=
fname = FileNameJoin[{$TemporaryDirectory, "foo.xlsx"}];
In[368]:=
df["ToExcel"[fname, "SheetName" -> "Sheet1"]]

Check the file:

In[369]:=
Import[fname]
Out[369]=

Read the file as "DataFrame":

In[370]:=
pd["ReadExcel"[fname]]
Out[370]=
In[371]:=
% // Normal
Out[371]=

Clean up:

In[372]:=
DeleteFile[fname]
In[373]:=
DeleteObject[pd["Session"]]

Applications (5) 

Use PandasObject to perform data analysis in Python when importing data to the Wolfram Language is impractical or undesirable. Download a county business patterns file from the US Census database and unzip it to a temporary directory:

In[374]:=
fname = ExtractArchive[
    "https://www2.census.gov/programs-surveys/cbp/datasets/2020/cbp20us.zip", $TemporaryDirectory] // First;

Check the timing of creating a dataset in the Wolfram Language:

In[375]:=
(ds = Import[fname, {"CSV", "Dataset"}, "HeaderLines" -> 1];) // AbsoluteTiming
Out[375]=

Import the data to a "DataFrame" in Python and check the timing:

In[376]:=
pd = ResourceFunction["PandasObject"][]
Out[376]=
In[377]:=
(df = pd["ReadCSV"[fname]]) // AbsoluteTiming
Out[377]=

The first few lines of the dataset:

In[378]:=
df["Head"[2]] // Normal
Out[378]=

Compare to the dataset imported to the Wolfram Language:

In[379]:=
Take[ds, 2]
Out[379]=
In[380]:=
DeleteObject[pd["Session"]]

Properties and Relations (7) 

PandasObject[] gives the same result as the resource function PythonObject with a special configuration:

In[381]:=
session = StartExternalSession["Python"];
In[382]:=
ResourceFunction["PandasObject"][session]
Out[382]=
In[383]:=
ResourceFunction["PythonObject"][session, "pandas", "Configuration" -> ResourceFunction["PandasObject"]]
Out[383]=
In[384]:=
DeleteObject[session]

Get information on a pandas object:

In[385]:=
pd = ResourceFunction["PandasObject"][]
Out[385]=
In[386]:=
pd["Information"]
Out[386]=

Open the user guide in your default web browser:

In[387]:=
pd["WebInformation"] // SystemOpen

In[388]:=
DeleteObject[pd["Session"]]

Some of the functions and classes available in the pandas module:

In[389]:=
pd = ResourceFunction["PandasObject"][]
Out[389]=
In[390]:=
pd["FullInformation", "Functions"] // Short[#, 3] &
Out[390]=
In[391]:=
pd["FullInformation", "Classes"] // Short[#, 3] &
Out[391]=

Information on a class:

In[392]:=
pd["Information", "DataFrame"] // Short
Out[392]=

The web documentation for a class:

In[393]:=
pd["WebInformation", "DataFrame"] // SystemOpen

In[394]:=
DeleteObject[pd["Session"]]

pandas’s "DataFrame" is analogous to Dataset, but keeps the object on the Python side:

In[395]:=
pd = ResourceFunction["PandasObject"][]
Out[395]=
In[396]:=
pd["DataFrame"[<|"a" -> {1, 2, 3}, "b" -> {4, 5, 6}|>]]
Out[396]=

Print the object in Python:

In[397]:=
%["Evaluate"["print(``)"]]

Transfer the data from Python to create a Dataset:

In[398]:=
Normal[%%]
Out[398]=
In[399]:=
Head[%]
Out[399]=
In[400]:=
DeleteObject[pd["Session"]]

Many pandas operations are parallel to operations on Dataset:

In[401]:=
pd = ResourceFunction["PandasObject"][]
Out[401]=
In[402]:=
n = 500;
In[403]:=
df = pd["DataFrame"[<|
    "a" -> RandomVariate[NormalDistribution[0, 1], {n}], "b" -> RandomInteger[100, {n}]|>]]
Out[403]=
In[404]:=
ds = Normal[df]
Out[404]=

Select rows satisfying a condition:

In[405]:=
df[df["a"] > 2.5] // Normal
Out[405]=
In[406]:=
ds[Select[#a > 2.5 &]]
Out[406]=

Plot histograms of the columns:

In[407]:=
df["Histogram"[], True]
Out[407]=
In[408]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[408]=
In[409]:=
Histogram[ds[All, #], PlotLabel -> #] & /@ Normal[Keys[First[ds]]]
Out[409]=
In[410]:=
DeleteObject[pd["Session"]]

Similarly, pandas’s "Series" object is analogous to TimeSeries:

In[411]:=
pd = ResourceFunction["PandasObject"][]
Out[411]=
In[412]:=
n = 20;
In[413]:=
s = pd["Series"[RandomReal[1, {n}], "index" -> pd["DateRange"["20130102", "periods" -> n]]]]
Out[413]=
In[414]:=
ts = Normal[%]
Out[414]=

Plot the time series in Python:

In[415]:=
s["Plot"[]]
Out[415]=
In[416]:=
ResourceFunction["PandasObject"][pd, "Show"[]]
Out[416]=

Plot the imported time series with DateListPlot:

In[417]:=
DateListPlot[ts]
Out[417]=
In[418]:=
DeleteObject[pd["Session"]]

Create a "DataFrame" and a Boolean mask for positive values:

In[419]:=
pd = ResourceFunction["PandasObject"][]
Out[419]=
In[420]:=
(df = pd[
    "DataFrame"[RandomReal[{-1, 1}, {3, 2}], "Columns" -> {"A", "B"}]]) // Normal
Out[420]=
In[421]:=
df > 0 // Normal
Out[421]=

PythonObject allows you to apply Python commands directly, and bring the results back to the Wolfram Language if necessary:

In[422]:=
df["Evaluate"["``>0"]]
Out[422]=
In[423]:=
% // Normal
Out[423]=

Alternatively:

In[424]:=
ExternalEvaluate[pd["Session"], df[] <> ">0"]
Out[424]=
In[425]:=
DeleteObject[pd["Session"]]

Possible Issues (2) 

Create a pandas object:

In[426]:=
pd = ResourceFunction["PandasObject"][]
Out[426]=

Since NumPy arrays have a single data type for the entire array (dtype), importing a NumPy array to the Wolfram Language may fail if one of the columns cannot be imported directly:

In[427]:=
(df = pd["DataFrame"[<|
       "A" -> Range[3],
      "B" -> RandomReal[1, {3}],
       "C" -> {"foo", "bar", "baz"}
      |>
       ]]) // Normal
Out[427]=
In[428]:=
vals = df["Values", True]
Out[428]=
In[429]:=
Normal[%]
Out[429]=

Drop the problematic column to import only numeric values:

In[430]:=
df["Drop"["C", "Axis" -> 1]]["Values"] // Normal // Normal // MatrixForm
Out[430]=

Import the rest:

In[431]:=
df["C"]["ToList"[]] // Normal
Out[431]=

Alternatively, import the entire array by converting the NumPy array to list:

In[432]:=
vals["tolist"[]]
Out[432]=
In[433]:=
DeleteObject[pd["Session"]]

Create two "Series" objects:

In[434]:=
pd = ResourceFunction["PandasObject"][]
Out[434]=
In[435]:=
s = pd["Series"[RandomReal[{-1, 1}, {100}]]]
Out[435]=
In[436]:=
s1 = pd["Series"[RandomReal[5, {200}]]]
Out[436]=

The Matplotlib library, used by pandas, accumulates the results of successive plot commands:

In[437]:=
(#["Plot"[]]; ResourceFunction["PandasObject"][pd, "Show"[]]) & /@ {s,
   s1}
Out[437]=

Clear the plot figure to create separate plots instead:

In[438]:=
(ResourceFunction["PandasObject"][pd, "Clear"[]]; #["Plot"[]]; ResourceFunction["PandasObject"][pd, "Show"[]]) & /@ {s, s1}
Out[438]=
In[439]:=
DeleteObject[pd["Session"]]

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.0.1 – 21 February 2024
  • 1.0.0 – 16 December 2022

Source Metadata

Related Resources

Author Notes

The documentation last updated with:

In[1]:=
PandasObject[]["Version"]

License Information