Function Repository Resource:

JapaneseTextTokenizer

Source Notebook

Separate a piece of Japanese text into grammatical parts

Contributed by: Richard Hennigan (Wolfram Research)

ResourceFunction["JapaneseTextTokenizer"]["string"]

returns a list of associations containing data about the grammatical parts of the Japanese text "string".

ResourceFunction["JapaneseTextTokenizer"]["string","property"]

returns the specified property for each of the parts found in "string".

Details and Options

In ResourceFunction["JapaneseTextTokenizer"]["text"], an Association is returned for each word found in "text" with the following properties:

"SurfaceForm"

the string corresponding to the word as it appears in the text

"BaseForm"

the dictionary form of the word

"Reading"

the pronounciation of the word

"PartsOfSpeech"

a list of possible parts of speech for the word

In ResourceFunction["JapaneseTextTokenizer"]["text","property"], the value for "property" can be any item from the first column of the table above or "Dataset", which returns a Dataset of all properties.

ResourceFunction["JapaneseTextTokenizer"] has the following options:

"EnglishPartsOfSpeech"

True

whether to translate the list of parts of speech to English

Language

$Language

which language to use when specifying "TranslateWords"→True

MissingString

None

a string to be used in place of Missing values

"Reading"

"Katakana"

which writing system to use to represent readings

"TranslateWords"

False

whether to include word translations in the output

Some possible values for "Reading" are "Hiragana", "Katakana" and "Romaji".

Examples

Basic Examples (4)

Get information about the structure of a piece of Japanese text:

In[1]:=

Out[1]=

Get a specific property:

In[2]:=

Out[2]=

Return a Dataset:

In[3]:=

Out[3]=

Specify a list of properties:

In[4]:=

Out[4]=

Properties and Relations (5)

Japanese text is typically written without spaces, so typical structural segmentation does not work:

In[5]:=

Out[5]=

JapaneseTextTokenizer can identify individual words without spaces:

In[6]:=

Out[6]=

Readings for kanji and particles are context sensitive:

In[7]:=

Out[7]=

Compare to Transliterate on individual particles:

In[8]:=

Out[8]=

In[9]:=

Out[9]=

The readings of "wa" and "o" correspond to the pronunciation (this example requires a Japanese voice to be installed):

In[10]:=

Out[10]=

See how the reading for a character changes depending on context:

In[11]:=

Out[11]=

In[12]:=

Out[12]=

Listen to the difference (this example requires a Japanese voice to be installed):

In[13]:=

Out[13]=

In[14]:=

Out[14]=

JapaneseTextTokenizer can include translations of individual words in the output:

In[15]:=

Out[15]=

Compare to using WordTranslation:

In[16]:=

Out[16]=

In[17]:=

Out[17]=

In[18]:=

Out[18]=

TextTranslation provides better results for text since it takes context into consideration:

In[19]:=

Out[19]=

In[20]:=

Out[20]=

FuriganaForm uses JapaneseTextTokenizer to parse text:

In[21]:=

Out[21]=

Get similar results using JapaneseTextTokenizer:

In[22]:=

Out[22]=

In[23]:=

$Grid[Transpose[{If[ StringContainsQ[#SurfaceForm, _?ResourceFunction[ "KanjiQ"]], #Reading, ""], #SurfaceForm} & /@ data]]$

Out[23]=

Options (12)

EnglishPartsOfSpeech (2)

By default, parts of speech are given as English string tokens:

In[24]:=

Out[24]=

Get the parts of speech in Japanese instead of English:

In[25]:=

Out[25]=

Language (3)

By default, when specifying "TranslateWords"→True, individual words will be translated to the language defined by $Language:

In[26]:=

Out[26]=

Get translations in another language:

In[27]:=

Out[27]=

Entity objects are also supported:

In[28]:=

Out[28]=

MissingString (3)

By default, a Missing object is returned when values are not found:

In[29]:=

$ResourceFunction[ "JapaneseTextTokenizer"]["'バード\[CenterDot]セイ' は名作です!", "Reading"]$

Out[29]=

Specify a string to use instead:

In[30]:=

$ResourceFunction[ "JapaneseTextTokenizer"]["'バード\[CenterDot]セイ' は名作です!", "Reading", MissingString -> " "]$

Out[30]=

In[31]:=

Out[31]=

Missing values can also typically be avoided by using full-width Japanese characters:

In[32]:=

Out[32]=

Reading (3)

Specify which writing system to show the readings in:

In[33]:=

Out[33]=

In[34]:=

Out[34]=

In[35]:=

Out[35]=

Generate phonetic romaji text from Japanese:

In[36]:=

Out[36]=

Entity objects are also supported:

In[37]:=

Out[37]=

TranslateWords (1)

Include word translations when available:

In[38]:=

Out[38]=

Applications (2)

Create a function to compute word counts in Japanese:

In[39]:=

In[40]:=

Out[40]=

Compare the total number of words for the same document in English using WordCounts:

In[41]:=

Out[41]=

In[42]:=

Out[42]=

Create a variant of TextCases that can find parts of speech in Japanese text:

In[43]:=

$japaneseTextCases[text_String, form_String] := Cases[ResourceFunction["JapaneseTextTokenizer"][ text, {"BaseForm", "PartsOfSpeech"}], KeyValuePattern[{"BaseForm" -> word_, "PartsOfSpeech" -> {___, form, ___}}] :> word];$