Function Repository Resource:

PandasObject

Source Notebook

Use the Python package pandas for data science in Wolfram Language

Contributed by: Igor Bakshee, with examples adapted from the pandas documentation

ResourceFunction["PandasObject"][]

returns a configured PythonObject for the Python package pandas in a new Python session.

ResourceFunction["PandasObject"][session]

uses the specified running ExternalSessionObject session.

ResourceFunction["PandasObject"][…,"func"[args,opts]]

executes the function func with the specified arguments and options.

Details and Options

The Python package pandas is a toolkit for data science. It offers, in particular, data structures and operations for the manipulation and analysis of numerical tables and time series.

ResourceFunction["PandasObject"] sets up a configuration of the resource function PythonObject that makes working with pandas more convenient and returns the resulting Python object.

ResourceFunction["PandasObject"] makes the Python-side functions and variables accessible by new names that are closer to the usual Wolfram Language conventions. For instance, for lookup properties:

"loc"

"ByLabel"

selection by label

"iloc"

“ByPosition"

selection by integer position

"at"

“AtLabel"

label-based scalar lookup

"iat"

“AtPosition"

integer-based scalar lookup

For a Python object p, p["ToPythonName","wlname"] gives the native Python name corresponding to the Wolfram Language name wlname and p["FromPythonName","pname"] gives the respective Wolfram Language name for the Python-side name pname. In the object p, both wlname and pname can be used interchangeably.

p["RenamingRules"] gives a list of all renaming rules in the form {"wlname₁"→"pname₁",…}.

p["FullInformation","Functions"] gives a list of the available functions and p["Information","func"] gives the signature of the specified function.

p["WebInformation"] gives a link to the pandas documentation that can be opened with SystemOpen.

Typically, the Wolfram Language signature of a pandas function closely resembles the Python-side signature in which Python-side objects are represented in the form of the resource function PythonObject with possible extensions suitable for the Wolfram Language.

Similar to the Python's pandas library, ResourceFunction["PandasObject"] uses the Python package Matplotlib, which is invoked via the resource function MatplotlibObject.

Additional utility functions are available in the form ResourceFunction["PandasObject"][obj,"func"[args,opts]], where obj can be an ExternalSessionObject session or any PythonObject p defined in the session. The utility functions include the utility functions of the resource function MatplotlibObject:

"Show"["fmt"]

display the Python graphics

"Export"["file","fmt"]

export the graphics

ResourceFunction["PandasObject"][p,"Show"[…]] and ResourceFunction["PandasObject"][p,"Export"[…]] provide basic plotting and exporting functionality, whereas MatplotlibObject, which works seamlessly with ResourceFunction["PandasObject"], gives more options and controls.

Spanning elements in objects created with ResourceFunction["PandasObject"][…] use one-based indexes, include the end points, and otherwise follow the usual Wolfram Language conventions, as set up by the resource function PythonObject.

Examples

Basic Examples (2)

Create a dataset of the purchases of some individuals:

In[1]:=

Out[1]=

In[2]:=

In[3]:=

Out[3]=

In[4]:=

Out[4]=

Transfer the dataset from Python:

In[5]:=

Out[5]=

Attach customer names to the purchases, using the names as "index" (labels of the rows):

In[6]:=

In[7]:=

Out[7]=

Get several rows from the beginning and the end of the dataset:

In[8]:=

Out[8]=

In[9]:=

Out[9]=

Generate descriptive statistics of the purchases:

In[10]:=

Out[10]=

Create and display a plot of the purchases:

In[11]:=

Out[11]=

In[12]:=

Out[12]=

Histograms of the purchases:

In[13]:=

Out[13]=

In[14]:=

Out[14]=

Close the Python session to clean up:

In[15]:=

Load the Titanic disaster data:

In[16]:=

Out[16]=

In[17]:=

(titanic = pd["ReadCSV"[
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"]]) // Normal

Out[17]=

The descriptive statistics of the dataset:

In[18]:=

Out[18]=

The average fare paid:

In[19]:=

Out[19]=

The median age and fare of the passengers:

In[20]:=

Out[20]=

Descriptive statistics of the specified columns:

In[21]:=

Out[21]=

Use the specific aggregating statistics for given columns instead of the predefined statistics:

In[22]:=

Out[22]=

Find the number of male and female passengers:

In[23]:=

Out[23]=

The average age for male versus female passengers:

In[24]:=

Out[24]=

Count the number of unique values in all the columns:

In[25]:=

Out[25]=

Or in a specific column:

In[26]:=

Out[26]=

Pick passengers who embarked at the port "C" (Cherbourg):

In[27]:=

Out[27]=

Female passengers, whose fare was less than $10:

In[28]:=

Out[28]=

In[29]:=

Out[29]=

In[30]:=

Out[30]=

In[31]:=

Out[31]=

Find the correlation between the sex and survival by replacing string values in the "Sex" column with numbers using the String method "GetDummies" ("women first"):

In[32]:=

Out[32]=

In[33]:=

Out[33]=

Summarize the survival rate by sex and cabin class using a spreadsheet-like pivot table (better survival rate for women and in the first class):

In[34]:=

Out[34]=

Find the missing values in the "Age" column:

In[35]:=

Out[35]=

In[36]:=

Out[36]=

Statistics is calculated despite missing values:

In[37]:=

Out[37]=

Drop the first few rows by giving the index values as labels and specifying "Axis"→0 for a row-wise operation:

In[38]:=

Out[38]=

Prepare and show a histogram of age values:

In[39]:=

Out[39]=

In[40]:=

Out[40]=

Drop the specified column:

In[41]:=

Out[41]=

Create a copy of the object:

In[42]:=

Out[42]=

Drop the "Sex" column in the object in-place:

In[43]:=

In[44]:=

Out[44]=

Sort the data by values of the specified column:

In[45]:=

Out[45]=

Prepare and show stacked filled plots of the numeric columns:

In[46]:=

Out[46]=

In[47]:=

Out[47]=

Clean up the Python session:

In[48]:=

Scope (125)

Object creation (2)

In[49]:=

Out[49]=

Create a Series object from a list of values, letting pandas use a default integer index:

In[50]:=

Out[50]=

In[51]:=

Out[51]=

In[52]:=

Create a new panda object:

In[53]:=

Out[53]=

Create a date-time index array of six values by specifying the starting date in the form "yyyy-mm-dd":

In[54]:=

Out[54]=

In[55]:=

Out[55]=

Alternatively, specify the starting date without dashes:

In[56]:=

Out[56]=

Construct a "DataFrame" object from the array of dates with labeled columns:

In[57]:=

Out[57]=

In[58]:=

Out[58]=

Create a "DataFrame" from an association of objects that can be converted into a "Series"-like structure:

In[59]:=

pd["DataFrame"[<|
"A" -> 1.0,
"B" -> pd["Timestamp"["20130102"]],
"C" -> pd["Series"[1, "index" -> Range[0, 3], "dtype" -> "float32"]],
"D" -> ConstantArray[0, 4],
"E" -> pd["Categorical"[{"test", "train", "test", "train"}]],
"F" -> "foo"|>]]

Out[59]=

In[60]:=

Out[60]=

In[61]:=

Viewing data (16)

In[62]:=

Out[62]=

Create a "DataFrame" object:

In[63]:=

df = pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]

Out[63]=

Get the preset number of top and bottom rows of the data frame:

In[64]:=

Out[64]=

In[65]:=

Out[65]=

Get the specified number of top and bottom rows:

In[66]:=

Out[66]=

In[67]:=

Out[67]=

Get the index:

In[68]:=

Out[68]=

In[69]:=

Out[69]=

Alternatively:

In[70]:=

Out[70]=

Get column names:

In[71]:=

Out[71]=

In[72]:=

Out[72]=

Alternatively:

In[73]:=

Out[73]=

Get a quick statistic summary of your data:

In[74]:=

Out[74]=

Sort by index values in descending order:

In[75]:=

Out[75]=

In[76]:=

Out[76]=

Sort by column labels in descending order:

In[77]:=

Out[77]=

Sort by values in a column:

In[78]:=

Out[78]=

Transpose the data frame:

In[79]:=

Out[79]=

Get the dataset of the transposed values:

In[80]:=

Out[80]=

In[81]:=

Out[81]=

Convert a "DataFrame" object to a NumPy array:

In[82]:=

Out[82]=

Get the values in your Wolfram Language session:

In[83]:=

Out[83]=

Compare with the original dataset:

In[84]:=

Out[84]=

In[85]:=

Selection (26)

Getting DataFrame Parts (9)

Create a new pandas object:

In[86]:=

Out[86]=

In[87]:=

Out[87]=

Select a single column, treating a column label as property. Note that the returned object is a "Series" and not a "DataFrame":

In[88]:=

Out[88]=

Import the "Series" object as a TimeSeries:

In[89]:=

Out[89]=

Plot the time series:

In[90]:=

Out[90]=

Select several columns using the "Part" syntax:

In[91]:=

Out[91]=

Select several rows using the Span syntax:

In[92]:=

Out[92]=

Use a Python object "Slice" to span rows by index values (rather than integer indices):

In[93]:=

Out[93]=

In[94]:=

Out[94]=

In[95]:=

Out[95]=

Alternatively, use the Python syntax to create a "Slice" object:

In[96]:=

Out[96]=

In[97]:=

Out[97]=

Delete rows and add a column by re-indexing:

In[98]:=

Out[98]=

In[99]:=

Selecting by label (8)

Create a data frame object:

In[100]:=

Out[100]=

In[101]:=

Out[101]=

In[102]:=

Out[102]=

Use the "ByLabel" property of the "DataFrame" object to access values by label:

In[103]:=

Out[103]=

Get a cross section of the data frame using an index label:

In[104]:=

Out[104]=

In[105]:=

Out[105]=

Alternatively, use the index value:

In[106]:=

Out[106]=

Get a part corresponding to all rows of the specified columns (or, in the pandas parlance, select on a multi-axis by label):

In[107]:=

Out[107]=

Get parts of a single row in a form of a "Series" object, rather than a data frame:

In[108]:=

Out[108]=

In[109]:=

Out[109]=

Get a scalar value:

In[110]:=

Out[110]=

Equivalently, get fast access to a scalar using the "AtLabel" property of the "DataFrame" object:

In[111]:=

Out[111]=

In[112]:=

Out[112]=

In[113]:=

Selecting by position (9)

Create a data frame object:

In[114]:=

Out[114]=

In[115]:=

(df = pd[
"DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal

Out[115]=

Use the "ByPosition" property of the "DataFrame" object to access values by position:

In[116]:=

Out[116]=

In[117]:=

Out[117]=

Use the Span syntax to access elements:

In[118]:=

Out[118]=

Select by lists of integer positions:

In[119]:=

Out[119]=

Slice rows explicitly:

In[120]:=

Out[120]=

Alternatively:

In[121]:=

Out[121]=

Slice columns explicitly:

In[122]:=

Out[122]=

Get a scalar value explicitly:

In[123]:=

Out[123]=

Equivalently:

In[124]:=

Out[124]=

In[125]:=

Out[125]=

In[126]:=

Hierarchical Indexing (4)

Create a "DataFrame" object for an array with more than two dimensions:

In[127]:=

Out[127]=

In[128]:=

df = pd["DataFrame"[<|"vals" -> Range[4]|>, "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}]]

Out[128]=

The imported "DataFrame" object is represented as an Association with keys in the form of a list:

In[129]:=

Out[129]=

In[130]:=

Out[130]=

In Python, the index is represented as a "MultiIndex" object:

In[131]:=

Out[131]=

Create a "MultiIndex" directly:

In[132]:=

Out[132]=

In[133]:=

Out[133]=

In[134]:=

Out[134]=

In[135]:=

Boolean indexing (2)

In[136]:=

Out[136]=

In[137]:=

Out[137]=

Create an array of Boolean values for which values in the column "A" are positive:

In[138]:=

Out[138]=

In[139]:=

Out[139]=

Pick rows for which the condition holds:

In[140]:=

Out[140]=

A Boolean array for positive "DataFrame" values:

In[141]:=

Out[141]=

Select values from a "DataFrame" where a Boolean condition is met:

In[142]:=

Out[142]=

In[143]:=

Create a data frame:

In[144]:=

Out[144]=

In[145]:=

(df = pd[
"DataFrame"[
Join[RandomReal[{-1, 1}, {6, 4}], List /@ {"one", "one", "two", "three", "four", "three"}, 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "E"]]]) // Normal

Out[145]=

Use the "IsIn" method for filtering:

In[146]:=

Out[146]=

In[147]:=

Setting Values (2)

Create a "DataFrame":

In[148]:=

Out[148]=

In[149]:=

Out[149]=

Create a new "Series" object that is longer than the "DataFrame" and starts with a time offset:

In[150]:=

In[151]:=

Out[151]=

In[152]:=

Out[152]=

Add the object as a new column to the data frame, automatically aligning the data by the indexes:

In[153]:=

Out[153]=

In[154]:=

Out[154]=

Set values by label:

In[155]:=

Out[155]=

In[156]:=

Out[156]=

In[157]:=

Out[157]=

Set values by position:

In[158]:=

Out[158]=

In[159]:=

Out[159]=

Set values with an array:

In[160]:=

Out[160]=

In[161]:=

Out[161]=

In[162]:=

Out[162]=

Create a NumPy array:

In[163]:=

Out[163]=

In[164]:=

Out[164]=

Assign a column to a NumPy array by position:

In[165]:=

Out[165]=

In[166]:=

Out[166]=

In[167]:=

Create a "DataFrame":

In[168]:=

Out[168]=

In[169]:=

Out[169]=

A Boolean array of values where a condition is met:

In[170]:=

Out[170]=

Replace values where the condition is not met:

In[171]:=

Out[171]=

Negate positive values using the where operation with setting:

In[172]:=

Out[172]=

In[173]:=

Out[173]=

In[174]:=

Missing data (5)

Create a "DataFrame" with missing values:

In[175]:=

Out[175]=

In[176]:=

(df = pd[
"DataFrame"[RandomReal[{-1, 1}, {4, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 4]], "Columns" -> CharacterRange["A", "D"]]]) // Normal

Out[176]=

In[177]:=

Out[177]=

Assign some values in the "DataFrame":

In[178]:=

Out[178]=

In[179]:=

Out[179]=

Drop rows with missing data:

In[180]:=

Out[180]=

Fill missing data:

In[181]:=

Out[181]=

Get the Boolean mask where values are missing:

In[182]:=

Out[182]=

In[183]:=

Binary Operations (16)

Functions vs. Operators (10)

Define a "DataFrame" object:

In[184]:=

Out[184]=

In[185]:=

Out[185]=

Add a scalar using an arithmetic operator version:

In[186]:=

Out[186]=

Equivalently, use a function version:

In[187]:=

Out[187]=

Subtract a list:

In[188]:=

Out[188]=

Subtract a "Series" object:

In[189]:=

Out[189]=

In[190]:=

Out[190]=

Multiply a dictionary by axis:

In[191]:=

Out[191]=

In[192]:=

Out[192]=

Multiply a "DataFrame" of different shape using the operator version:

In[193]:=

Out[193]=

In[194]:=

Out[194]=

Use the function version to fill in missing values:

In[195]:=

Out[195]=

Create a "DataFrame" with a hierarchical "MultiIndex":

In[196]:=

(dfm = pd[
"DataFrame"[<|"angles" -> {0, 3, 4, 4, 5, 6}, "degrees" -> {360, 180, 360, 360, 540, 720}|>, "Index" -> {{" A", " A", " A", " B", " B", " B"}, {"circle", "triangle", "rectangle", "square", "pentagon", "hexagon"}}]]) // Normal

Out[196]=

Divide by the hierarchical "DataFrame" specifying the level:

In[197]:=

Out[197]=

In[198]:=

Matching / Broadcasting Behavior (6)

Create a data frame:

In[199]:=

Out[199]=

In[200]:=

(df = pd["DataFrame"[<|
"one" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"a", "b", "c"}]], "two" -> pd["Series"[RandomReal[{-1, 1}, {4}], "Index" -> {"a", "b", "c", "d"}]],
"three" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"b", "c", "d"}]]
|>
]]) // Normal

Out[200]=

Select a row and a column:

In[201]:=

Out[201]=

In[202]:=

Out[202]=

Use the "Axis" option to match on the index or columns:

In[203]:=

Out[203]=

In[204]:=

Out[204]=

Alternatively:

In[205]:=

Out[205]=

In[206]:=

Out[206]=

Use the built-in Python divmod function with a Series object to take the floor division and modulo operation at the same time:

In[207]:=

Out[207]=

In[208]:=

Out[208]=

In[209]:=

Out[209]=

Do elementwise divmod:

In[210]:=

Out[210]=

In[211]:=

Stats (4)

Create a data frame:

In[212]:=

Out[212]=

In[213]:=

(df = pd[
"DataFrame"[
Join[RandomReal[{-1, 1}, {6, 4}], Table[{5}, {6}], List /@ Flatten[{None, Range[5]}], 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "F"]]]) // Normal

Out[213]=

Perform descriptive statistics:

In[214]:=

Out[214]=

Same operation on the other axis:

In[215]:=

Out[215]=

Operate on objects that have different dimensionality with alignment:

In[216]:=

Out[216]=

In[217]:=

Out[217]=

In[218]:=

Applying Functions (4)

Create a data frame:

In[219]:=

Out[219]=

In[220]:=

Out[220]=

Import a NumPy function:

In[221]:=

Out[221]=

Apply the function:

In[222]:=

Out[222]=

Create and apply a lambda function:

In[223]:=

Out[223]=

In[224]:=

Out[224]=

In[225]:=

Histogramming (2)

Create a "Series" object:

In[226]:=

Out[226]=

In[227]:=

Out[227]=

Count unique values in the series:

In[228]:=

Out[228]=

In[229]:=

String Methods (2)

Create a "Series" object:

In[230]:=

Out[230]=

In[231]:=

Out[231]=

Use the "String" attribute to operate on each element of the series:

In[232]:=

Out[232]=

In[233]:=

Out[233]=

In[234]:=

Merging (7)

Concatenating (4)

Create a data frame:

In[235]:=

Out[235]=

In[236]:=

In[237]:=

Out[237]=

Break it into pieces:

In[238]:=

Out[238]=

Concatenate the pieces:

In[239]:=

Out[239]=

The concatenated object is the same as the original:

In[240]:=

Out[240]=

In[241]:=

Database-Style Joining (3)

Create two "DataFrame" objects:

In[242]:=

Out[242]=

In[243]:=

Out[243]=

In[244]:=

Out[244]=

Merge the objects in SQL style:

In[245]:=

Out[245]=

Alternatively, with different keys:

In[246]:=

Out[246]=

In[247]:=

Out[247]=

In[248]:=

Out[248]=

In[249]:=

Grouping (3)

Create a "DataFrame" object:

In[250]:=

Out[250]=

In[251]:=

(df = pd[
"DataFrame"[<|
"A" -> {"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"},
"B" -> {"one", "one", "two", "three", "two", "two", "one", "three"},
"C" -> RandomInteger[10, {8}],
"D" -> RandomInteger[10, {8}]
|>]]) // Normal

Out[251]=

Group by values of the column "A", sum values in the groups and combine the results in a new "DataFrame" object:

In[252]:=

Out[252]=

In[253]:=

Out[253]=

Group by multiple columns forming a hierarchical index and apply the summing function to each group:

In[254]:=

Out[254]=

In[255]:=

Reshaping (5)

Pivoting (2)

Create a "DataFrame" object:

In[256]:=

Out[256]=

In[257]:=

(df = pd[
"DataFrame"[<|"foo" -> {"one", "one", "one", "two", "two", "two"},
"bar" -> {"A", "B", "C", "A", "B", "C"}, "baz" -> Range[6], "zoo" -> {"x", "y", "z", "q", "w", "t"}|>]]) // Normal

Out[257]=

Organize the object by index and column values:

In[258]:=

Out[258]=

In[259]:=

Stacking (3)

Create a "DataFrame" object with a hierarchical index:

In[260]:=

Out[260]=

In[261]:=

(df = pd[
"DataFrame"[RandomReal[{-1, 1}, {4, 2}], "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}, "Columns" -> {"A", "B"}]]) // Normal

Out[261]=

Stack the object by "compressing" a level in the columns:

In[262]:=

Out[262]=

Reverse the operation by "unstacking" the last level:

In[263]:=

Out[263]=

In[264]:=

Out[264]=

In[265]:=

Out[265]=

In[266]:=

Time Series Resampling (3)

Create a series with 9 one-second timestamps:

In[267]:=

Out[267]=

In[268]:=

Out[268]=

In[269]:=

Out[269]=

Downsample the series into 3-second bins and sum the values falling into each bin:

In[270]:=

Out[270]=

In[271]:=

Out[271]=

Check the sums:

In[272]:=

Out[272]=

In[273]:=

Create a time series object with dates given in the local time zone:

In[274]:=

Out[274]=

In[275]:=

Out[275]=

In[276]:=

Out[276]=

Check the dates:

In[277]:=

Out[277]=

Localize the series to the UTC time zone and check the dates:

In[278]:=

Out[278]=

In[279]:=

Out[279]=

Convert the series to another time zone:

In[280]:=

Out[280]=

In[281]:=

Out[281]=

In[282]:=

Create a series with quarterly frequency for a year, ending in November:

In[283]:=

Out[283]=

In[284]:=

Out[284]=

In[285]:=

Out[285]=

Check the start dates of a few periods in the series:

In[286]:=

Out[286]=

Convert the series to 9 AM of the end of the month following the quarter end and check starting dates again:

In[287]:=

s["Assign"[
"index" -> (periods[
"AsFrequency"["Frequency" -> "M", "how" -> "e"]] + 1)[
"AsFrequency"["Frequency" -> "H", "how" -> "s"]] + 9]]

Out[287]=

In[288]:=

Out[288]=

In[289]:=

Categoricals (8)

Create a "DataFrame" with a column whose values are taken from a limited alphabet:

In[290]:=

Out[290]=

In[291]:=

$(df = pd[ "DataFrame"[<|"id" -> {1, 2, 3, 4, 5, 6}, "raw_grade" -> {"a", "b", "b", "a", "a", "e"}|>]]) // Normal$

Out[291]=

Convert the raw grades to a categorical data type:

In[292]:=

Out[292]=

In[293]:=

Out[293]=

The current categories:

In[294]:=

Out[294]=

Rename the categories to more meaningful names in place:

In[295]:=

Out[295]=

In[296]:=

Out[296]=

Reorder the categories and simultaneously add the missing categories:

In[297]:=

Out[297]=

In[298]:=

Out[298]=

In[299]:=

Out[299]=

Sort by order in the categories:

In[300]:=

Out[300]=

Sort by values in the "raw_grade" column (in lexicographic order):

In[301]:=

Out[301]=

Group by a categorical column, showing empty categories:

In[302]:=

Out[302]=

In[303]:=

Plotting (5)

Construct a simple data frame object:

In[304]:=

Out[304]=

In[305]:=

In[306]:=

df = pd["DataFrame"[<|"column1" -> RandomInteger[{0, 20}, n], "column2" -> RandomInteger[{20, 50}, n]|>]]

Out[306]=

Create a plot of column values with labels:

In[307]:=

Out[307]=

Show the plot in the default ("PNG") format:

In[308]:=

Out[308]=

Show the plot as a vector graphics:

In[309]:=

Out[309]=

Export the plot to a file from Python:

In[310]:=

Out[310]=

Import the file:

In[311]:=

Out[311]=

Delete the file:

In[312]:=

Clear the plot figure:

In[313]:=

Plot the specified column:

In[314]:=

Out[314]=

In[315]:=

Out[315]=

Plot one column versus another:

In[316]:=

Out[316]=

In[317]:=

Out[317]=

In[318]:=

Out[318]=

In[319]:=

Create a time series:

In[320]:=

Out[320]=

In[321]:=

In[322]:=

Out[322]=

Compute its cumulative sum:

In[323]:=

Out[323]=

Prepare a plot of the time series:

In[324]:=

Out[324]=

Show the plot:

In[325]:=

Out[325]=

In[326]:=

Create a data frame:

In[327]:=

Out[327]=

In[328]:=

Out[328]=

List available plot types:

In[329]:=

Out[329]=

Create a bar plot:

In[330]:=

Out[330]=

In[331]:=

Out[331]=

Alternatively, use the "Plot" method of the "DataFrame" object:

In[332]:=

In[333]:=

Out[333]=

A stacked horizontal plot:

In[334]:=

Out[334]=

In[335]:=

Out[335]=

A box plot:

In[336]:=

Out[336]=

In[337]:=

Out[337]=

Pass keywords supported by the resource function MatplotlibObject "boxplot":

In[338]:=

Out[338]=

In[339]:=

Out[339]=

In[340]:=

Create a data frame with normally-distributed values:

In[341]:=

Out[341]=

In[342]:=

Out[342]=

Create a scatter matrix plot using the "ScatterMatrix" method from pandas.plotting:

In[343]:=

Out[343]=

In[344]:=

Out[344]=

In[345]:=

Create a time series of a cumulative random process:

In[346]:=

Out[346]=

In[347]:=

price = pd[
"Series"[FoldList[Plus, RandomReal[{-1, 1}, {150}]], "Index" -> pd["DateRange"["2000-1-1", "Periods" -> 150, "Frequency" -> "B"]]]]

Out[347]=

Compute a moving average and standard deviation of the process:

In[348]:=

Out[348]=

In[349]:=

Out[349]=

Prepare a temporal plot of the prices, the mean values, and the Bollinger band using a MatplotlibObject:

In[350]:=

Out[350]=

In[351]:=

Out[351]=

In[352]:=

Out[352]=

In[353]:=

Out[353]=

Show the plots:

In[354]:=

Out[354]=

In[355]:=

Importing and exporting data (9)

CSV (4)

In[356]:=

Out[356]=

In[357]:=

Out[357]=

Write to a CSV file:

In[358]:=

In[359]:=

Print contents of the file:

In[360]:=

Read the CSV file as a "DataFrame" object:

In[361]:=

Out[361]=

In[362]:=

Out[362]=

Clean up:

In[363]:=

In[364]:=

problematic in 2.2.0:

Excel (5)

Create a new pandas object:

In[365]:=

Out[365]=

In[366]:=

Out[366]=

Write to an Excel file:

In[367]:=

In[368]:=

Check the file:

In[369]:=

Out[369]=

Read the file as "DataFrame":

In[370]:=

Out[370]=

In[371]:=

Out[371]=

Clean up:

In[372]:=

In[373]:=

Applications (5)

Use PandasObject to perform data analysis in Python when importing data to the Wolfram Language is impractical or undesirable. Download a county business patterns file from the US Census database and unzip it to a temporary directory:

In[374]:=

fname = ExtractArchive[
"https://www2.census.gov/programs-surveys/cbp/datasets/2020/cbp20us.zip", $TemporaryDirectory] // First;

Check the timing of creating a dataset in the Wolfram Language:

In[375]:=

Out[375]=

Import the data to a "DataFrame" in Python and check the timing:

In[376]:=

Out[376]=

In[377]:=

Out[377]=

The first few lines of the dataset:

In[378]:=

Out[378]=

Compare to the dataset imported to the Wolfram Language:

In[379]:=

Out[379]=

In[380]:=

Properties and Relations (7)

PandasObject[…] gives the same result as the resource function PythonObject with a special configuration:

In[381]:=

In[382]:=

Out[382]=

In[383]:=

Out[383]=

In[384]:=

Get information on a pandas object:

In[385]:=

Out[385]=

In[386]:=

Out[386]=

Open the user guide in your default web browser:

In[387]:=

In[388]:=

Some of the functions and classes available in the pandas module:

In[389]:=

Out[389]=

In[390]:=

Out[390]=

In[391]:=

Out[391]=

Information on a class:

In[392]:=

Out[392]=

The web documentation for a class:

In[393]:=

In[394]:=

pandas’s "DataFrame" is analogous to Dataset, but keeps the object on the Python side:

In[395]:=

Out[395]=

In[396]:=

Out[396]=

Print the object in Python:

In[397]:=

Transfer the data from Python to create a Dataset:

In[398]:=

Out[398]=

In[399]:=

Out[399]=

In[400]:=

Many pandas operations are parallel to operations on Dataset:

In[401]:=

Out[401]=

In[402]:=

In[403]:=

df = pd["DataFrame"[<|
"a" -> RandomVariate[NormalDistribution[0, 1], {n}], "b" -> RandomInteger[100, {n}]|>]]

Out[403]=

In[404]:=

Out[404]=

Select rows satisfying a condition:

In[405]:=

Out[405]=

In[406]:=

Out[406]=

Plot histograms of the columns:

In[407]:=

Out[407]=

In[408]:=

Out[408]=

In[409]:=

Out[409]=

In[410]:=

Similarly, pandas’s "Series" object is analogous to TimeSeries:

In[411]:=

Out[411]=

In[412]:=

In[413]:=

Out[413]=

In[414]:=

Out[414]=

Plot the time series in Python:

In[415]:=

Out[415]=

In[416]:=

Out[416]=

Plot the imported time series with DateListPlot:

In[417]:=

Out[417]=

In[418]:=

Create a "DataFrame" and a Boolean mask for positive values:

In[419]:=

Out[419]=

In[420]:=

Out[420]=

In[421]:=

Out[421]=

PythonObject allows you to apply Python commands directly, and bring the results back to the Wolfram Language if necessary:

In[422]:=

Out[422]=

In[423]:=

Out[423]=

Alternatively:

In[424]:=

Out[424]=

In[425]:=

Possible Issues (2)

Create a pandas object:

In[426]:=

Out[426]=

Since NumPy arrays have a single data type for the entire array (dtype), importing a NumPy array to the Wolfram Language may fail if one of the columns cannot be imported directly:

In[427]:=