Function Repository Resource:

PandasObject (1.0.0) current version: 1.0.1 »

Source Notebook

Use the Python package pandas for data science in the Wolfram Language

Contributed by: Igor Bakshee, with examples adopted from the pandas documentation

ResourceFunction["PandasObject"][]

returns a configured PythonObject for the Python package pandas in a new Python session.

ResourceFunction["PandasObject"][session]

uses the specified running ExternalSessionObject session.

ResourceFunction["PandasObject"][…,"func"[args,opts]]

executes the function func with the specified arguments and options.

Details and Options

The Python package pandas is a toolkit for data science. It offers, in particular, data structures and operations for the manipulation and analysis of numerical tables and time series.

ResourceFunction["PandasObject"] sets up a configuration of the resource function PythonObject that makes working with pandas more convenient and returns the resulting Python object.

ResourceFunction["PandasObject"] makes the Python-side functions and variables accessible by new names that are closer to the usual Wolfram Language conventions. For instance, for lookup properties:

"loc"

"ByLabel"

selection by label

"iloc"

“ByPosition"

selection by integer position

"at"

“AtLabel"

label-based scalar lookup

"iat"

“AtPosition"

integer-based scalar lookup

For a Python object p, p["ToPythonName","wlname"] gives the native Python name corresponding to the Wolfram Language name wlname and p["FromPythonName","pname"] gives the respective Wolfram Language name for the Python-side name pname. In the object p, both wlname and pname can be used interchangeably.

p["RenamingRules"] gives a list of all renaming rules in the form {"wlname₁"→"pname₁",…}.

p["FullInformation","Functions"] gives a list of the available functions and p["Information","func"] gives the signature of the specified function.

p["WebInformation"] gives a link to the pandas documentation that can be opened with SystemOpen.

Typically, the Wolfram Language signature of a pandas function closely resembles the Python-side signature in which Python-side objects are represented in the form of the resource function PythonObject with possible extensions suitable for the Wolfram Language.

Similar to the Python's pandas library, ResourceFunction["PandasObject"] uses the Python package Matplotlib, which is invoked via the resource function MatplotlibObject.

Additional utility functions are available in the form ResourceFunction["PandasObject"][obj,"func"[args,opts]], where obj can be an ExternalSessionObject session or any PythonObject p defined in the session. The utility functions include the utility functions of the resource function MatplotlibObject:

"Show"["fmt"]

display the Python graphics

"Export"["file","fmt"]

export the graphics

ResourceFunction["PandasObject"][p,"Show"[…]] and ResourceFunction["PandasObject"][p,"Export"[…]] provide basic plotting and exporting functionality, whereas MatplotlibObject, which works seamlessly with ResourceFunction["PandasObject"], gives more options and controls.

Spanning elements in objects created with ResourceFunction["PandasObject"][…] use one-based indexes, include the end points, and otherwise follow the usual Wolfram Language conventions, as set up by the resource function PythonObject.

Examples

Basic Examples (2)

Create a dataset of the purchases of some individuals:

In[1]:=

Out[1]=

In[2]:=

In[3]:=

Out[3]=

In[4]:=

Out[4]=

Transfer the dataset from Python:

In[5]:=

Out[5]=

Attach customer names to the purchases, using the names as "index" (labels of the rows):

In[6]:=

In[7]:=

Out[7]=

Get several rows from the beginning and the end of the dataset:

In[8]:=

Out[8]=

In[9]:=

Out[9]=

Generate descriptive statistics of the purchases:

In[10]:=

Out[10]=

Create and display a plot of the purchases:

In[11]:=

Out[11]=

In[12]:=

Out[12]=

Histograms of the purchases:

In[13]:=

Out[13]=

In[14]:=

Out[14]=

Close the Python session to clean up:

In[15]:=

Load the Titanic disaster data:

In[16]:=

Out[16]=

In[17]:=

(titanic = pd["ReadCSV"[
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"]]) // Normal

Out[17]=

The descriptive statistics of the dataset:

In[18]:=

Out[18]=

Correlation between columns:

In[19]:=

Out[19]=

The average fare paid:

In[20]:=

Out[20]=

The median age and fare of the passengers:

In[21]:=

Out[21]=

Descriptive statistics of the specified columns:

In[22]:=

Out[22]=

Use the specific aggregating statistics for given columns instead of the predefined statistics:

In[23]:=

Out[23]=

Find the number of male and female passengers:

In[24]:=

Out[24]=

The average age for male versus female passengers:

In[25]:=

Out[25]=

Count the number of unique values in all the columns:

In[26]:=

Out[26]=

Or in a specific column:

In[27]:=

Out[27]=

Pick passengers who embarked at the port "C" (Cherbourg):

In[28]:=

Out[28]=

Female passengers, whose fare was less than $10:

In[29]:=

Out[29]=

In[30]:=

Out[30]=

In[31]:=

Out[31]=

In[32]:=

Out[32]=

Find the correlation between the sex and survival by replacing string values in the "Sex" column with numbers using the String method "GetDummies" ("women first"):

In[33]:=

Out[33]=

In[34]:=

Out[34]=

Summarize the survival rate by sex and cabin class using a spreadsheet-like pivot table (better survival rate for women and in the first class):

In[35]:=

Out[35]=

Find the missing values in the "Age" column:

In[36]:=

Out[36]=

In[37]:=

Out[37]=

Statistics is calculated despite missing values:

In[38]:=

Out[38]=

Drop the first few rows by giving the index values as labels and specifying "Axis"→0 for a row-wise operation:

In[39]:=

Out[39]=

Prepare and show a histogram of age values:

In[40]:=

Out[40]=

In[41]:=

Out[41]=

Drop the specified column:

In[42]:=

Out[42]=

Create a copy of the object:

In[43]:=

Out[43]=

Drop the "Sex" column in the object in-place:

In[44]:=

In[45]:=

Out[45]=

Sort the data by values of the specified column:

In[46]:=

Out[46]=

Prepare and show stacked filled plots of the numeric columns:

In[47]:=

Out[47]=

In[48]:=

Out[48]=

Clean up the Python session:

In[49]:=

Scope (132)

Object creation (2)

In[50]:=

Out[50]=

Create a Series object from a list of values, letting pandas use a default integer index:

In[51]:=

Out[51]=

In[52]:=

Out[52]=

In[53]:=

Create a new panda object:

In[54]:=

Out[54]=

Create a date-time index array of six values by specifying the starting date in the form "yyyy-mm-dd":

In[55]:=

Out[55]=

In[56]:=

Out[56]=

Alternatively, specify the starting date without dashes:

In[57]:=

Out[57]=

Construct a "DataFrame" object from the array of dates with labeled columns:

In[58]:=

Out[58]=

In[59]:=

Out[59]=

Create a "DataFrame" from an association of objects that can be converted into a "Series"-like structure:

In[60]:=

pd["DataFrame"[<|
"A" -> 1.0,
"B" -> pd["Timestamp"["20130102"]],
"C" -> pd["Series"[1, "index" -> Range[0, 3], "dtype" -> "float32"]],
"D" -> ConstantArray[0, 4],
"E" -> pd["Categorical"[{"test", "train", "test", "train"}]],
"F" -> "foo"|>]]

Out[60]=

In[61]:=

Out[61]=

In[62]:=

Viewing data (16)

In[63]:=

Out[63]=

Create a "DataFrame" object:

In[64]:=

df = pd["DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]

Out[64]=

Get the preset number of top and bottom rows of the data frame:

In[65]:=

Out[65]=

In[66]:=

Out[66]=

Get the specified number of top and bottom rows:

In[67]:=

Out[67]=

In[68]:=

Out[68]=

Get the index:

In[69]:=

Out[69]=

In[70]:=

Out[70]=

Alternatively:

In[71]:=

Out[71]=

Get column names:

In[72]:=

Out[72]=

In[73]:=

Out[73]=

Alternatively:

In[74]:=

Out[74]=

Get a quick statistic summary of your data:

In[75]:=

Out[75]=

Sort by index values in descending order:

In[76]:=

Out[76]=

In[77]:=

Out[77]=

Sort by column labels in descending order:

In[78]:=

Out[78]=

Sort by values in a column:

In[79]:=

Out[79]=

Transpose the data frame:

In[80]:=

Out[80]=

Get the dataset of the transposed values:

In[81]:=

Out[81]=

In[82]:=

Out[82]=

Convert a "DataFrame" object to a NumPy array:

In[83]:=

Out[83]=

Get the values in your Wolfram Language session:

In[84]:=

Out[84]=

Compare with the original dataset:

In[85]:=

Out[85]=

In[86]:=

Selection (26)

Getting DataFrame Parts (9)

Create a new pandas object:

In[87]:=

Out[87]=

In[88]:=

Out[88]=

Select a single column, treating a column label as property. Note that the returned object is a "Series" and not a "DataFrame":

In[89]:=

Out[89]=

Import the "Series" object as a TimeSeries:

In[90]:=

Out[90]=

Plot the time series:

In[91]:=

Out[91]=

Select several columns using the "Part" syntax:

In[92]:=

Out[92]=

Select several rows using the Span syntax:

In[93]:=

Out[93]=

Use a Python object "Slice" to span rows by index values (rather than integer indices):

In[94]:=

Out[94]=

In[95]:=

Out[95]=

In[96]:=

Out[96]=

Alternatively, use the Python syntax to create a "Slice" object:

In[97]:=

Out[97]=

In[98]:=

Out[98]=

Delete rows and add a column by re-indexing:

In[99]:=

Out[99]=

In[100]:=

Selecting by label (8)

Create a data frame object:

In[101]:=

Out[101]=

In[102]:=

Out[102]=

In[103]:=

Out[103]=

Use the "ByLabel" property of the "DataFrame" object to access values by label:

In[104]:=

Out[104]=

Get a cross section of the data frame using an index label:

In[105]:=

Out[105]=

In[106]:=

Out[106]=

Alternatively, use the index value:

In[107]:=

Out[107]=

Get a part corresponding to all rows of the specified columns (or, in the pandas parlance, select on a multi-axis by label):

In[108]:=

Out[108]=

Get parts of a single row in a form of a "Series" object, rather than a data frame:

In[109]:=

Out[109]=

In[110]:=

Out[110]=

Get a scalar value:

In[111]:=

Out[111]=

Equivalently, get fast access to a scalar using the "AtLabel" property of the "DataFrame" object:

In[112]:=

Out[112]=

In[113]:=

Out[113]=

In[114]:=

Selecting by position (9)

Create a data frame object:

In[115]:=

Out[115]=

In[116]:=

(df = pd[
"DataFrame"[RandomReal[{-1, 1}, {6, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "D"]]]) // Normal

Out[116]=

Use the "ByPosition" property of the "DataFrame" object to access values by position:

In[117]:=

Out[117]=

In[118]:=

Out[118]=

Use the Span syntax to access elements:

In[119]:=

Out[119]=

Select by lists of integer positions:

In[120]:=

Out[120]=

Slice rows explicitly:

In[121]:=

Out[121]=

Alternatively:

In[122]:=

Out[122]=

Slice columns explicitly:

In[123]:=

Out[123]=

Get a scalar value explicitly:

In[124]:=

Out[124]=

Equivalently:

In[125]:=

Out[125]=

In[126]:=

Out[126]=

In[127]:=

Hierarchical Indexing (4)

Create a "DataFrame" object for an array with more than two dimensions:

In[128]:=

Out[128]=

In[129]:=

df = pd["DataFrame"[<|"vals" -> Range[4]|>, "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}]]

Out[129]=

The imported "DataFrame" object is represented as an Association with keys in the form of a list:

In[130]:=

Out[130]=

In[131]:=

Out[131]=

In Python, the index is represented as a "MultiIndex" object:

In[132]:=

Out[132]=

Create a "MultiIndex" directly:

In[133]:=

Out[133]=

In[134]:=

Out[134]=

In[135]:=

Out[135]=

In[136]:=

Boolean indexing (2)

In[137]:=

Out[137]=

In[138]:=

Out[138]=

Create an array of Boolean values for which values in the column "A" are positive:

In[139]:=

Out[139]=

In[140]:=

Out[140]=

Pick rows for which the condition holds:

In[141]:=

Out[141]=

A Boolean array for positive "DataFrame" values:

In[142]:=

Out[142]=

Select values from a "DataFrame" where a Boolean condition is met:

In[143]:=

Out[143]=

In[144]:=

Create a data frame:

In[145]:=

Out[145]=

In[146]:=

(df = pd[
"DataFrame"[
Join[RandomReal[{-1, 1}, {6, 4}], List /@ {"one", "one", "two", "three", "four", "three"}, 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "E"]]]) // Normal

Out[146]=

Use the "IsIn" method for filtering:

In[147]:=

Out[147]=

In[148]:=

Setting Values (2)

Create a "DataFrame":

In[149]:=

Out[149]=

In[150]:=

Out[150]=

Create a new "Series" object that is longer than the "DataFrame" and starts with a time offset:

In[151]:=

In[152]:=

Out[152]=

In[153]:=

Out[153]=

Add the object as a new column to the data frame, automatically aligning the data by the indexes:

In[154]:=

Out[154]=

In[155]:=

Out[155]=

Set values by label:

In[156]:=

Out[156]=

In[157]:=

Out[157]=

In[158]:=

Out[158]=

Set values by position:

In[159]:=

Out[159]=

In[160]:=

Out[160]=

Set values with an array:

In[161]:=

Out[161]=

In[162]:=

Out[162]=

In[163]:=

Out[163]=

Create a NumPy array:

In[164]:=

Out[164]=

In[165]:=

Out[165]=

Assign a column to a NumPy array by position:

In[166]:=

Out[166]=

In[167]:=

Out[167]=

In[168]:=

Create a "DataFrame":

In[169]:=

Out[169]=

In[170]:=

Out[170]=

A Boolean array of values where a condition is met:

In[171]:=

Out[171]=

Replace values where the condition is not met:

In[172]:=

Out[172]=

Negate positive values using the where operation with setting:

In[173]:=

Out[173]=

In[174]:=

Out[174]=

In[175]:=

Missing data (5)

Create a "DataFrame" with missing values:

In[176]:=

Out[176]=

In[177]:=

(df = pd[
"DataFrame"[RandomReal[{-1, 1}, {4, 4}], "Index" -> pd["DateRange"["20130101", "Periods" -> 4]], "Columns" -> CharacterRange["A", "D"]]]) // Normal

Out[177]=

In[178]:=

Out[178]=

Assign some values in the "DataFrame":

In[179]:=

Out[179]=

In[180]:=

Out[180]=

Drop rows with missing data:

In[181]:=

Out[181]=

Fill missing data:

In[182]:=

Out[182]=

Get the Boolean mask where values are missing:

In[183]:=

Out[183]=

In[184]:=

Binary Operations (16)

Functions vs. Operators (10)

Define a "DataFrame" object:

In[185]:=

Out[185]=

In[186]:=

Out[186]=

Add a scalar using an arithmetic operator version:

In[187]:=

Out[187]=

Equivalently, use a function version:

In[188]:=

Out[188]=

Subtract a list:

In[189]:=

Out[189]=

Subtract a "Series" object:

In[190]:=

Out[190]=

In[191]:=

Out[191]=

Multiply a dictionary by axis:

In[192]:=

Out[192]=

In[193]:=

Out[193]=

Multiply a "DataFrame" of different shape using the operator version:

In[194]:=

Out[194]=

In[195]:=

Out[195]=

Use the function version to fill in missing values:

In[196]:=

Out[196]=

Create a "DataFrame" with a hierarchical "MultiIndex":

In[197]:=

(dfm = pd[
"DataFrame"[<|"angles" -> {0, 3, 4, 4, 5, 6}, "degrees" -> {360, 180, 360, 360, 540, 720}|>, "Index" -> {{" A", " A", " A", " B", " B", " B"}, {"circle", "triangle", "rectangle", "square", "pentagon", "hexagon"}}]]) // Normal

Out[197]=

Divide by the hierarchical "DataFrame" specifying the level:

In[198]:=

Out[198]=

In[199]:=

Matching / Broadcasting Behavior (6)

Create a data frame:

In[200]:=

Out[200]=

In[201]:=

(df = pd["DataFrame"[<|
"one" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"a", "b", "c"}]], "two" -> pd["Series"[RandomReal[{-1, 1}, {4}], "Index" -> {"a", "b", "c", "d"}]],
"three" -> pd["Series"[RandomReal[{-1, 1}, {3}], "Index" -> {"b", "c", "d"}]]
|>
]]) // Normal

Out[201]=

Select a row and a column:

In[202]:=

Out[202]=

In[203]:=

Out[203]=

Use the "Axis" option to match on the index or columns:

In[204]:=

Out[204]=

In[205]:=

Out[205]=

Alternatively:

In[206]:=

Out[206]=

In[207]:=

Out[207]=

Use the built-in Python divmod function with a Series object to take the floor division and modulo operation at the same time:

In[208]:=

Out[208]=

In[209]:=

Out[209]=

In[210]:=

Out[210]=

Do elementwise divmod:

In[211]:=

Out[211]=

In[212]:=

Stats (4)

Create a data frame:

In[213]:=

Out[213]=

In[214]:=

(df = pd[
"DataFrame"[
Join[RandomReal[{-1, 1}, {6, 4}], Table[{5}, {6}], List /@ Flatten[{None, Range[5]}], 2], "Index" -> pd["DateRange"["20130101", "Periods" -> 6]], "Columns" -> CharacterRange["A", "F"]]]) // Normal

Out[214]=

Perform descriptive statistics:

In[215]:=

Out[215]=

Same operation on the other axis:

In[216]:=

Out[216]=

Operate on objects that have different dimensionality with alignment:

In[217]:=

Out[217]=

In[218]:=

Out[218]=

In[219]:=

Applying Functions (4)

Create a data frame:

In[220]:=

Out[220]=

In[221]:=

Out[221]=

Import a NumPy function:

In[222]:=

Out[222]=

Apply the function:

In[223]:=

Out[223]=

Create and apply a lambda function:

In[224]:=

Out[224]=

In[225]:=

Out[225]=

In[226]:=

Histogramming (2)

Create a "Series" object:

In[227]:=

Out[227]=

In[228]:=

Out[228]=

Count unique values in the series:

In[229]:=

Out[229]=

In[230]:=

String Methods (2)

Create a "Series" object:

In[231]:=

Out[231]=

In[232]:=

Out[232]=

Use the "String" attribute to operate on each element of the series:

In[233]:=

Out[233]=

In[234]:=

Out[234]=

In[235]:=

Merging (7)

Concatenating (4)

Create a data frame:

In[236]:=

Out[236]=

In[237]:=

In[238]:=

Out[238]=

Break it into pieces:

In[239]:=

Out[239]=

Concatenate the pieces:

In[240]:=

Out[240]=

The concatenated object is the same as the original:

In[241]:=

Out[241]=

In[242]:=

Database-Style Joining (3)

Create two "DataFrame" objects:

In[243]:=

Out[243]=

In[244]:=

Out[244]=

In[245]:=

Out[245]=

Merge the objects in SQL style:

In[246]:=

Out[246]=

Alternatively, with different keys:

In[247]:=

Out[247]=

In[248]:=

Out[248]=

In[249]:=

Out[249]=

In[250]:=

Grouping (3)

Create a "DataFrame" object:

In[251]:=

Out[251]=

In[252]:=

(df = pd[
"DataFrame"[<|
"A" -> {"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"},
"B" -> {"one", "one", "two", "three", "two", "two", "one", "three"},
"C" -> RandomInteger[10, {8}],
"D" -> RandomInteger[10, {8}]
|>]]) // Normal

Out[252]=

Group by values of the column "A", sum values in the groups and combine the results in a new "DataFrame" object:

In[253]:=

Out[253]=

In[254]:=

Out[254]=

Group by multiple columns forming a hierarchical index and apply the summing function to each group:

In[255]:=

Out[255]=

In[256]:=

Reshaping (7)

Pivoting (4)

Create a "DataFrame" object:

In[257]:=

Out[257]=

In[258]:=

(df = pd[
"DataFrame"[<|"foo" -> {"one", "one", "one", "two", "two", "two"},
"bar" -> {"A", "B", "C", "A", "B", "C"}, "baz" -> Range[6], "zoo" -> {"x", "y", "z", "q", "w", "t"}|>]]) // Normal

Out[258]=

Organize the object by index and column values:

In[259]:=

Out[259]=

Alternatively:

In[260]:=

Out[260]=

Give a list of value labels:

In[261]:=

Out[261]=

In[262]:=

Out[262]=

In[263]:=

Out[263]=

In[264]:=

Stacking (3)

Create a "DataFrame" object with a hierarchical index:

In[265]:=

Out[265]=

In[266]:=

(df = pd[
"DataFrame"[RandomReal[{-1, 1}, {4, 2}], "Index" -> {{"bar", "bar", "baz", "baz"}, {"one", "two", "one", "two"}}, "Columns" -> {"A", "B"}]]) // Normal

Out[266]=

Stack the object by "compressing" a level in the columns:

In[267]:=

Out[267]=

Reverse the operation by "unstacking" the last level:

In[268]:=

Out[268]=

In[269]:=

Out[269]=

In[270]:=

Out[270]=

In[271]:=

Time Series Resampling (3)

Create a series with 9 one-second timestamps:

In[272]:=

Out[272]=

In[273]:=

Out[273]=

In[274]:=

Out[274]=

Downsample the series into 3-second bins and sum the values falling into each bin:

In[275]:=

Out[275]=

In[276]:=

Out[276]=

Check the sums:

In[277]:=

Out[277]=

In[278]:=

Create a time series object with dates given in the local time zone:

In[279]:=

Out[279]=

In[280]:=

Out[280]=

In[281]:=

Out[281]=

Check the dates:

In[282]:=

Out[282]=

Localize the series to the UTC time zone and check the dates:

In[283]:=

Out[283]=

In[284]:=

Out[284]=

Convert the series to another time zone:

In[285]:=

Out[285]=

In[286]:=

Out[286]=

In[287]:=

Create a series with quarterly frequency for a year, ending in November:

In[288]:=

Out[288]=

In[289]:=

Out[289]=

In[290]:=

Out[290]=

Check the start dates of a few periods in the series:

In[291]:=

Out[291]=

Convert the series to 9 AM of the end of the month following the quarter end and check starting dates again:

In[292]:=

s["Assign"[
"index" -> (periods[
"AsFrequency"["Frequency" -> "M", "how" -> "e"]] + 1)[
"AsFrequency"["Frequency" -> "H", "how" -> "s"]] + 9]]

Out[292]=

In[293]:=

Out[293]=

In[294]:=

Categoricals (8)

Create a "DataFrame" with a column whose values are taken from a limited alphabet:

In[295]:=

Out[295]=

In[296]:=

$(df = pd[ "DataFrame"[<|"id" -> {1, 2, 3, 4, 5, 6}, "raw_grade" -> {"a", "b", "b", "a", "a", "e"}|>]]) // Normal$

Out[296]=

Convert the raw grades to a categorical data type:

In[297]:=

Out[297]=

In[298]:=

Out[298]=

The current categories:

In[299]:=

Out[299]=

Rename the categories to more meaningful names in place:

In[300]:=

Out[300]=

In[301]:=

Out[301]=

Reorder the categories and simultaneously add the missing categories:

In[302]:=

Out[302]=

In[303]:=

Out[303]=

In[304]:=

Out[304]=

Sort by order in the categories:

In[305]:=

Out[305]=

Sort by values in the "raw_grade" column (in lexicographic order):

In[306]:=

Out[306]=

Group by a categorical column, showing empty categories:

In[307]:=

Out[307]=

In[308]:=

Plotting (5)

Construct a simple data frame object:

In[309]:=

Out[309]=

In[310]:=

In[311]:=

df = pd["DataFrame"[<|"column1" -> RandomInteger[{0, 20}, n], "column2" -> RandomInteger[{20, 50}, n]|>]]

Out[311]=

Create a plot of column values with labels:

In[312]:=

Out[312]=

Show the plot in the default ("PNG") format:

In[313]:=

Out[313]=

Show the plot as a vector graphics:

In[314]:=

Out[314]=

Export the plot to a file from Python:

In[315]:=

Out[315]=

Import the file:

In[316]:=

Out[316]=

Delete the file:

In[317]:=

Clear the plot figure:

In[318]:=

Plot the specified column:

In[319]:=

Out[319]=

In[320]:=

Out[320]=

Plot one column versus another:

In[321]:=

Out[321]=

In[322]:=

Out[322]=

In[323]:=

Out[323]=

In[324]:=

Create a time series:

In[325]:=

Out[325]=

In[326]:=

In[327]:=

Out[327]=

Compute its cumulative sum:

In[328]:=

Out[328]=

Prepare a plot of the time series:

In[329]:=

Out[329]=

Show the plot:

In[330]:=

Out[330]=

In[331]:=

Create a data frame:

In[332]:=

Out[332]=

In[333]:=

Out[333]=

List available plot types:

In[334]:=

Out[334]=

Create a bar plot:

In[335]:=

Out[335]=

In[336]:=

Out[336]=

Alternatively, use the "Plot" method of the "DataFrame" object:

In[337]:=

In[338]:=

Out[338]=

A stacked horizontal plot:

In[339]:=

Out[339]=

In[340]:=

Out[340]=

A box plot:

In[341]:=

Out[341]=

In[342]:=

Out[342]=

Pass keywords supported by the resource function MatplotlibObject "boxplot":

In[343]:=

Out[343]=

In[344]:=

Out[344]=

In[345]:=

Create a data frame with normally-distributed values:

In[346]:=

Out[346]=

In[347]:=

Out[347]=

Create a scatter matrix plot using the "ScatterMatrix" method from pandas.plotting:

In[348]:=

Out[348]=

In[349]:=

Out[349]=

In[350]:=

Create a time series of a cumulative random process:

In[351]:=

Out[351]=

In[352]:=

price = pd[
"Series"[FoldList[Plus, RandomReal[{-1, 1}, {150}]], "Index" -> pd["DateRange"["2000-1-1", "Periods" -> 150, "Frequency" -> "B"]]]]

Out[352]=

Compute a moving average and standard deviation of the process:

In[353]:=

Out[353]=

In[354]:=

Out[354]=

Prepare a temporal plot of the prices, the mean values, and the Bollinger band using a MatplotlibObject:

In[355]:=

Out[355]=

In[356]:=

Out[356]=

In[357]:=

Out[357]=

In[358]:=

Out[358]=

Show the plots:

In[359]:=

Out[359]=

In[360]:=

Importing and exporting data (14)

CSV (4)

In[361]:=

Out[361]=

In[362]:=

Normal[df = pd["DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]]

Out[362]=

Write to a CSV file:

In[363]:=

In[364]:=

Print contents of the file:

In[365]:=

Read the CSV file as a "DataFrame" object:

In[366]:=

Out[366]=

In[367]:=

Out[367]=

Clean up:

In[368]:=

In[369]:=

HDF5 (5)

Create a "DataFrame":

In[370]:=

Out[370]=

In[371]:=

(df = pd[
"DataFrame"[<|"A" -> Range[3], "B" -> RandomReal[1, {3}], "C" -> {"foo", "bar", "baz"}|>]]) // Normal

Out[371]=

Write to a HDF5 Store:

In[372]:=

In[373]:=

List datasets in the exported file and import the contents of the first dataset:

In[374]:=

Out[374]=

In[375]:=

Out[375]=

Read the contents of the file as a "DataFrame":

In[376]:=

Out[376]=

In[377]:=

Out[377]=

Clean up:

In[378]:=

In[379]:=

Excel (5)

Create a new pandas object:

In[380]:=

Out[380]=

In[381]:=

Out[381]=

Write to an Excel file:

In[382]:=

In[383]:=

Check the file:

In[384]:=

Out[384]=

Read the file as "DataFrame":

In[385]:=

Out[385]=

In[386]:=

Out[386]=

Clean up:

In[387]:=

In[388]:=

Applications (5)

Use PandasObject to perform data analysis in Python when importing data to the Wolfram Language is impractical or undesirable. Download a county business patterns file from the US Census database and unzip it to a temporary directory:

In[389]:=

fname = ExtractArchive[
"https://www2.census.gov/programs-surveys/cbp/datasets/2020/cbp20us.zip", $TemporaryDirectory] // First;

Check the timing of creating a dataset in the Wolfram Language:

In[390]:=

Out[390]=

Import the data to a "DataFrame" in Python and check the timing:

In[391]:=

Out[391]=

In[392]:=

Out[392]=

The first few lines of the dataset:

In[393]:=

Out[393]=

Compare to the dataset imported to the Wolfram Language:

In[394]:=

Out[394]=

In[395]:=

Properties and Relations (7)

PandasObject[…] gives the same result as the resource function PythonObject with a special configuration:

In[396]:=

Out[396]=

In[397]:=

Out[397]=

In[398]:=

Out[398]=

In[399]:=

Get information on a pandas object:

In[400]:=

Out[400]=

In[401]:=

Out[401]=

Open the user guide in your default web browser:

In[402]:=

In[403]:=

Some of the functions and classes available in the pandas module:

In[404]:=

Out[404]=

In[405]:=

Out[405]=

In[406]:=

Out[406]=

Information on a class:

In[407]:=

Out[407]=

The web documentation for a class:

In[408]:=

In[409]:=

pandas’s "DataFrame" is analogous to Dataset, but keeps the object on the Python side:

In[410]:=

Out[410]=

In[411]:=

Out[411]=

Print the object in Python:

In[412]:=

Transfer the data from Python to create a Dataset:

In[413]:=

Out[413]=

In[414]:=

Out[414]=

In[415]:=

Many pandas operations are parallel to operations on Dataset:

In[416]:=

Out[416]=

In[417]:=

In[418]:=

df = pd["DataFrame"[<|
"a" -> RandomVariate[NormalDistribution[0, 1], {n}], "b" -> RandomInteger[100, {n}]|>]]

Out[418]=

In[419]:=

Out[419]=

Select rows satisfying a condition:

In[420]:=

Out[420]=

In[421]:=

Out[421]=

Plot histograms of the columns:

In[422]:=

Out[422]=

In[423]:=

Out[423]=

In[424]:=

Out[424]=

In[425]:=

Similarly, pandas’s "Series" object is analogous to TimeSeries:

In[426]:=

Out[426]=

In[427]:=

In[428]:=

Out[428]=

In[429]:=

Out[429]=

Plot the time series in Python:

In[430]:=

Out[430]=

In[431]:=

Out[431]=

Plot the imported time series with DateListPlot:

In[432]:=

Out[432]=

In[433]:=

Create a "DataFrame" and a Boolean mask for positive values:

In[434]:=

Out[434]=

In[435]:=

Out[435]=

In[436]:=

Out[436]=

PythonObject allows you to apply Python commands directly, and bring the results back to the Wolfram Language if necessary:

In[437]:=

Out[437]=

In[438]:=

Out[438]=

Alternatively:

In[439]:=

Out[439]=

In[440]:=

Possible Issues (2)

Create a pandas object:

In[441]:=

Out[441]=

Since NumPy arrays have a single data type for the entire array (dtype), importing a NumPy array to the Wolfram Language may fail if one of the columns cannot be imported directly:

In[442]:=