parser.sandhi¶
Intro¶
Sandhi splitter for Samskrit. Builds up a database of sandhi rules and utilizes them for both performing sandhi and splitting words.
Will generate splits that may not all be valid words. That is left to the calling module to validate. See for example SanskritLexicalAnalyzer
- Example usage:
from sandhi import Sandhi sandhi = Sandhi() joins = sandhi.join(‘tasmin’, ‘iti’) splits = sandhi.split_at(‘tasminniti’, 5)
Draws inspiration from https://github.com/sanskrit/sanskrit
@author: Avinash Varna (github: @avinashvarna)
Usage¶
The Sandhi
class can be used to join/split words:
>>> from sanskrit_parser.parser.sandhi import Sandhi
>>> sandhi = Sandhi()
>>> word1 = SanskritImmutableString('te')
>>> word2 = SanskritImmutableString('eva')
>>> joins = sandhi.join(word1, word2)
>>> for join in joins:
... print(join)
...
teeva
taeva
ta eva
tayeva
To split at a specific position, use the Sandhi.split_at()
method:
>>> w = SanskritImmutableString('taeva')
>>> splits = sandhi.split_at(w, 1)
>>> for split in splits:
... print(split)
...
(u'tar', u'eva')
(u'tas', u'eva')
(u'taH', u'eva')
(u'ta', u'eva')
To split at all possible locations, use the Sandhi.split_all()
method:
>>> splits_all = sandhi.split_all(w)
>>> for split in splits_all:
... print(split)
...
(u't', u'aeva')
(u'tar', u'eva')
(u'taev', u'a')
(u'to', u'eva')
(u'ta', u'eva')
(u'te', u'eva')
(u'taH', u'eva')
(u'tae', u'va')
(u'taeva', u'')
(u'tas', u'eva')
Note: As mentioned previously, both over-generation and
under-generation are possible with the Sandhi
class.
Command line usage¶
$ python -m sanskrit_parser.parser.sandhi --join te eva
Joining te eva
set([u'teeva', u'taeva', u'ta eva', u'tayeva'])
$ python -m sanskrit_parser.parser.sandhi --split taeva 1
Splitting taeva at 1
set([(u'tar', u'eva'), (u'tas', u'eva'), (u'taH', u'eva'), (u'ta', u'eva')])
$ python -m sanskrit_parser.parser.sandhi --split taeva --all
All possible splits for taeva
set([(u't', u'aeva'), (u'tar', u'eva'), (u'taev', u'a'), (u'to', u'eva'),
(u'ta', u'eva'), (u'te', u'eva'), (u'taH', u'eva'), (u'tae', u'va'),
(u'taeva', u''), (u'tas', u'eva')])
-
class
sanskrit_parser.parser.sandhi.
Sandhi
(rules_dir=None, use_default_rules=True, logger=None)[source]¶ Bases:
object
Class to hold all the sandhi rules and methods for joining and splitting. Uses SLP1 encoding for all internal operations.
-
join
(first_in, second_in)[source]¶ Performs sandhi. Warning: May generate forms that are not lexically valid.
- Parameters
first_in – SanskritImmutableString first word of the sandhi
second_in – SanskritImmutableString word of the sandhi
- Returns
list of strings of possible sandhi forms, or None if no sandhi can be performed
-
split_all
(word_in, start=None, stop=None)[source]¶ Split word at all possible locations and return splits. Warning: Will generate splits that are not lexically valid.
- Parameters
word_in – SanskritImmutableString word to split
- Returns
set of tuple of strings of possible split forms, or None if no split can be performed
-
split_at
(word_in, idx)[source]¶ Split sandhi at the given index of word. Warning: Will generate splits that are not lexically valid.
- Parameters
word_in – SanskritImmutableString word to split
idx – position within word at which to try the split
- Returns
set of tuple of strings of possible split forms, or None if no split can be performed
-