parser.sandhi

Intro

Sandhi splitter for Samskrit. Builds up a database of sandhi rules and utilizes them for both performing sandhi and splitting words.

Will generate splits that may not all be valid words. That is left to the calling module to validate. See for example SanskritLexicalAnalyzer

Example usage:

from sandhi import Sandhi sandhi = Sandhi() joins = sandhi.join(‘tasmin’, ‘iti’) splits = sandhi.split_at(‘tasminniti’, 5)

Draws inspiration from https://github.com/sanskrit/sanskrit

@author: Avinash Varna (github: @avinashvarna)

Usage

The Sandhi class can be used to join/split words:

>>> from sanskrit_parser.parser.sandhi import Sandhi
>>> sandhi = Sandhi()
>>> word1 = SanskritImmutableString('te')
>>> word2 = SanskritImmutableString('eva')
>>> joins = sandhi.join(word1, word2)
>>> for join in joins:
...    print(join)
...
teeva
taeva
ta eva
tayeva

To split at a specific position, use the Sandhi.split_at() method:

>>> w = SanskritImmutableString('taeva')
>>> splits = sandhi.split_at(w, 1)
>>> for split in splits:
...    print(split)
...
(u'tar', u'eva')
(u'tas', u'eva')
(u'taH', u'eva')
(u'ta', u'eva')

To split at all possible locations, use the Sandhi.split_all() method:

>>> splits_all = sandhi.split_all(w)
>>> for split in splits_all:
...    print(split)
...
(u't', u'aeva')
(u'tar', u'eva')
(u'taev', u'a')
(u'to', u'eva')
(u'ta', u'eva')
(u'te', u'eva')
(u'taH', u'eva')
(u'tae', u'va')
(u'taeva', u'')
(u'tas', u'eva')

Note: As mentioned previously, both over-generation and under-generation are possible with the Sandhi class.

Command line usage

$ python -m sanskrit_parser.parser.sandhi --join te eva
Joining te eva
set([u'teeva', u'taeva', u'ta eva', u'tayeva'])

$ python -m sanskrit_parser.parser.sandhi --split taeva 1
Splitting taeva at 1
set([(u'tar', u'eva'), (u'tas', u'eva'), (u'taH', u'eva'), (u'ta', u'eva')])

$ python -m sanskrit_parser.parser.sandhi --split taeva --all
All possible splits for taeva
set([(u't', u'aeva'), (u'tar', u'eva'), (u'taev', u'a'), (u'to', u'eva'),
(u'ta', u'eva'), (u'te', u'eva'), (u'taH', u'eva'), (u'tae', u'va'),
(u'taeva', u''), (u'tas', u'eva')])
class sanskrit_parser.parser.sandhi.Sandhi(rules_dir=None, use_default_rules=True, logger=None)[source]

Bases: object

Class to hold all the sandhi rules and methods for joining and splitting. Uses SLP1 encoding for all internal operations.

join(first_in, second_in)[source]

Performs sandhi. Warning: May generate forms that are not lexically valid.

Parameters
  • first_in – SanskritImmutableString first word of the sandhi

  • second_in – SanskritImmutableString word of the sandhi

Returns

list of strings of possible sandhi forms, or None if no sandhi can be performed

split_all(word_in, start=None, stop=None)[source]

Split word at all possible locations and return splits. Warning: Will generate splits that are not lexically valid.

Parameters

word_in – SanskritImmutableString word to split

Returns

set of tuple of strings of possible split forms, or None if no split can be performed

split_at(word_in, idx)[source]

Split sandhi at the given index of word. Warning: Will generate splits that are not lexically valid.

Parameters
  • word_in – SanskritImmutableString word to split

  • idx – position within word at which to try the split

Returns

set of tuple of strings of possible split forms, or None if no split can be performed

Submodules

Indices and tables