Code¶
Parser¶
Sanskrit parser is a python library to help parse Sanskrit input
- It provides three main levels of output, in order of increasing complexity:
tags - Morphological analysis of a word
sandhi - Sandhi Split of a phrase
vakya - Morpho-syntactic Analysis of a sentence (after Sandhi split)
Code resides at: https://github.com/kmadathil/sanskrit_parser/
Please report any issues at: https://github.com/kmadathil/sanskrit_parser/issues
API¶
Code Usage¶
The Parser
class can be used to generate vakya parses thus:
from sanskrit_parser import Parser
string = "astyuttarasyAMdiSi"
input_encoding = "SLP1"
output_encoding = "SLP1"
parser = Parser(input_encoding=input_encoding,
output_encoding=output_encoding,
replace_ending_visarga='s')
print('Splits:')
for split in parser.split(string, limit=10):
print(f'Lexical Split: {split}')
for i, parse in enumerate(split.parse(limit=2)):
print(f'Parse {i}')
print(f'{parse}')
break
This produces the output:
Lexical Split: ['asti', 'uttarasyAm', 'diSi']
Parse 0
asti => (asti, ['samAsapUrvapadanAmapadam', 'strIliNgam']) : samasta of uttarasyAm
uttarasyAm => (uttara#1, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam'])
diSi => (diS, ['saptamIviBaktiH', 'ekavacanam', 'strIliNgam']) : viSezaRa of uttarasyAm
Parse 1
asti => (asti, ['samAsapUrvapadanAmapadam', 'strIliNgam']) : samasta of uttarasyAm
uttarasyAm => (uttara#2, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam']) : viSezaRa of diSi
diSi => (diS#2, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam'])
Parse 2
asti => (as#1, ['kartari', 'praTamapuruzaH', 'law', 'parasmEpadam', 'ekavacanam', 'prATamikaH'])
uttarasyAm => (uttara#2, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam']) : viSezaRa of diSi
diSi => (diS, ['saptamIviBaktiH', 'ekavacanam', 'strIliNgam']) : aDikaraRam of asti
-
class
sanskrit_parser.api.
JSONEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
-
default
(o)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
-
class
sanskrit_parser.api.
Parse
(split: sanskrit_parser.api.Split, parse_graph, cost)[source]¶
-
class
sanskrit_parser.api.
ParseEdge
(predecessor: sanskrit_parser.api.ParseNode, node: sanskrit_parser.api.ParseNode, label: str)[source]¶ Bases:
sanskrit_parser.api.Serializable
-
predecessor
: sanskrit_parser.api.ParseNode¶
-
-
class
sanskrit_parser.api.
ParseNode
(node: sanskrit_parser.parser.datastructures.VakyaGraphNode, strict_io: bool, encoding: str)[source]¶
-
class
sanskrit_parser.api.
ParseTag
(root: str, tags: Sequence[str])[source]¶ Bases:
sanskrit_parser.api.Serializable
-
class
sanskrit_parser.api.
Parser
(strict_io: bool = False, input_encoding: Optional[str] = None, output_encoding: str = 'SLP1', lexical_lookup: str = 'combined', score: bool = True, split_above: int = 5, replace_ending_visarga: Optional[str] = None, fast_merge: bool = True)[source]¶ Bases:
object
-
class
sanskrit_parser.api.
Serializable
[source]¶ Bases:
abc.ABC
Base class to indicate an object is serializable into JSON
-
class
sanskrit_parser.api.
Split
(parser: sanskrit_parser.api.Parser, input_string: str, split: Sequence[sanskrit_parser.base.sanskrit_base.SanskritObject], vgraph: sanskrit_parser.parser.datastructures.VakyaGraph = None)[source]¶ Bases:
sanskrit_parser.api.Serializable
-
parser
: sanskrit_parser.api.Parser¶
-
split
: Sequence[sanskrit_parser.base.sanskrit_base.SanskritObject]¶
-
vgraph
: sanskrit_parser.parser.datastructures.VakyaGraph = None¶
-
Parser¶
Sanskrit Parser Data Structures
@author: Karthik Madathil (github: @kmadathil)
-
class
sanskrit_parser.parser.datastructures.
SandhiGraph
[source]¶ Bases:
object
DAG class to hold Sandhi Lexical Analysis Results
Represents the results of lexical sandhi analysis as a DAG Nodes are SanskritObjects
-
add_node
(node)[source]¶ Extend dag with node inserted at root
- Params:
Node (SanskritObject) : Node to add root (Boolean) : Make a root node end (Boolean) : Add an edge to end
-
append_to_node
(t, nodes)[source]¶ Create edges from t to nodes
- Params:
t (SanskritObject) : Node to append to nodes (iterator(nodes)) : Nodes to append to t
-
end
= '__end__'¶
-
find_all_paths
(max_paths=10, sort=True, score=True)[source]¶ Find all paths through DAG to End
- Params:
- max_paths (int :default:=10): Number of paths to find
If this is > 1000, all paths will be found
- sort (bool)If True (default), sort paths
in ascending order of length
-
has_node
(t)[source]¶ Does a given node exist in the graph?
- Params:
t (SanskritObject): Node
- Returns:
boolean
-
lock_start
()[source]¶ Make the graph ready for search by adding a start node
Add a start node, add arcs to all current root nodes, and clear self.roots
-
start
= '__start__'¶
-
-
class
sanskrit_parser.parser.datastructures.
VakyaGraph
(path, max_parse_dc=4, fast_merge=True)[source]¶ Bases:
object
DAG class for Sanskrit Vakya Analysis
Represents Sanskrit Vakya Analysis as a DAG Nodes are SanskritObjects with morphological tags Edges are potential relationships between them
-
add_node
(node)[source]¶ Extend dag with node inserted at root
- Params:
Node (VakyaGraphNode) : Node to add
-
add_sentence_conjunctions
(laks, krts)[source]¶ Add sentence conjunction links
For all nodes which match sentence_conjuction keys add vAkyasambanDaH link between y-t pair (if vibhakti matches where relevant) Reverse all edges to node, add sambadDa- to link label (eg sambadDa-karma, if node is not vIpsa if node is saMyojakaH, and not vIpsA add saMbadDakriyA links to verbs if associated t* doesn’t exist vAkyasambanDaH links from verbs
-
check_parse_validity
()[source]¶ Validity Check for parses
Remove parses with double kArakas Remove parses with multiple to edges into a node Remove parses with cycles
-
-
class
sanskrit_parser.parser.datastructures.
VakyaParse
(nodepair)[source]¶ Bases:
object
-
activate_and_extinguish_alternatives
(node)[source]¶ Make node active, extinguish other nodes in its partition
-
-
sanskrit_parser.parser.datastructures.
getSLP1Tagset
(n)[source]¶ Given a (base, tagset) pair, extract the tagset
Usage¶
Use the LexicalSandhiAnalyzer
to split a sentence (wrapped in a
SanskritObject
) and retrieve the top 10 splits:
>>> from __future__ import print_function
>>> from sanskrit_parser.parser.sandhi_analyzer import LexicalSandhiAnalyzer
>>> from sanskrit_parser.base.sanskrit_base import SanskritObject, SLP1
>>> sentence = SanskritObject("astyuttarasyAMdishidevatAtmA")
>>> analyzer = LexicalSandhiAnalyzer()
>>> splits = analyzer.getSandhiSplits(sentence).findAllPaths(10)
>>> for split in splits:
... print(split)
...
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA']
[u'asti', u'uttara', u'syAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'at', u'mA']
[u'asti', u'uttarasyAm', u'diSi', u'de', u'vatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA']
[u'asti', u'uttas', u'rasyAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'de', u'avatA', u'AtmA']
The sandhi_analyzer can also be used to look up the tags for a given word form: (Note that the database stores words ending in visarga with an ‘s’ at the end)
>>> word = SanskritObject('hares')
>>> tags = analyzer.getMorphologicalTags(word)
>>> for tag in tags:
... print(tag)
...
('hf#1', set(['cj', 'snd', 'prim', 'para', 'md', 'sys', 'prs', 'v', 'np', 'sg', 'op']))
('hari#1', set(['na', 'mas', 'sg', 'gen']))
('hari#1', set(['na', 'mas', 'abl', 'sg']))
('hari#1', set(['na', 'fem', 'sg', 'gen']))
('hari#1', set(['na', 'fem', 'abl', 'sg']))
('hari#2', set(['na', 'mas', 'sg', 'gen']))
('hari#2', set(['na', 'mas', 'abl', 'sg']))
('hari#2', set(['na', 'fem', 'sg', 'gen']))
('hari#2', set(['na', 'fem', 'abl', 'sg']))
-
class
sanskrit_parser.parser.sandhi_analyzer.
LexicalSandhiAnalyzer
(lexical_lookup='combined')[source]¶ Bases:
object
Singleton class to hold methods for Sanskrit lexical sandhi analysis.
We define lexical sandhi analysis to be the process of taking an input sequence and transforming it to a collection (represented by a DAG) of potential sandhi splits of the sequence. Each member of a split is guaranteed to be a valid lexical form.
-
getMorphologicalTags
(obj, tmap=True)[source]¶ Get Morphological tags for a word
- Params:
obj(SanskritString): word tmap(Boolean=True): If True, maps
tags to our format
- Returns
list: List of (base, tagset) pairs
-
getSandhiSplits
(o, tag=False, pre_segmented=False)[source]¶ Get all valid Sandhi splits for a string
- Params:
o(SanskritString): Input object tag(Boolean) : When True (def=False), return a
morphologically tagged graph
- Returns:
SandhiGraph : DAG all possible splits
-
hasTag
(obj, name, tagset)[source]¶ Check if word matches morhphological tags
- Params:
obj(SanskritString): word name(str): name in tag tagset(set): set of tag elements
- Returns
- list: List of (base, tagset) pairs for obj that
match (name,tagset), or None
-
preSegmented
(sl, tag=False)[source]¶ Get a SandhiGraph for a pre-segmented sentence
- Params:
sl (list of SanskritString): Input object tag(Boolean) : When True (def=False), return a
morphologically tagged graph
- Returns:
SandhiGraph : DAG all possible splits
-
sandhi
= <sanskrit_parser.parser.sandhi.Sandhi object>¶
-
Intro¶
Sandhi splitter for Samskrit. Builds up a database of sandhi rules and utilizes them for both performing sandhi and splitting words.
Will generate splits that may not all be valid words. That is left to the calling module to validate. See for example SanskritLexicalAnalyzer
- Example usage:
from sandhi import Sandhi sandhi = Sandhi() joins = sandhi.join(‘tasmin’, ‘iti’) splits = sandhi.split_at(‘tasminniti’, 5)
Draws inspiration from https://github.com/sanskrit/sanskrit
@author: Avinash Varna (github: @avinashvarna)
Usage¶
The Sandhi
class can be used to join/split words:
>>> from sanskrit_parser.parser.sandhi import Sandhi
>>> sandhi = Sandhi()
>>> word1 = SanskritImmutableString('te')
>>> word2 = SanskritImmutableString('eva')
>>> joins = sandhi.join(word1, word2)
>>> for join in joins:
... print(join)
...
teeva
taeva
ta eva
tayeva
To split at a specific position, use the Sandhi.split_at()
method:
>>> w = SanskritImmutableString('taeva')
>>> splits = sandhi.split_at(w, 1)
>>> for split in splits:
... print(split)
...
(u'tar', u'eva')
(u'tas', u'eva')
(u'taH', u'eva')
(u'ta', u'eva')
To split at all possible locations, use the Sandhi.split_all()
method:
>>> splits_all = sandhi.split_all(w)
>>> for split in splits_all:
... print(split)
...
(u't', u'aeva')
(u'tar', u'eva')
(u'taev', u'a')
(u'to', u'eva')
(u'ta', u'eva')
(u'te', u'eva')
(u'taH', u'eva')
(u'tae', u'va')
(u'taeva', u'')
(u'tas', u'eva')
Note: As mentioned previously, both over-generation and
under-generation are possible with the Sandhi
class.
Command line usage¶
$ python -m sanskrit_parser.parser.sandhi --join te eva
Joining te eva
set([u'teeva', u'taeva', u'ta eva', u'tayeva'])
$ python -m sanskrit_parser.parser.sandhi --split taeva 1
Splitting taeva at 1
set([(u'tar', u'eva'), (u'tas', u'eva'), (u'taH', u'eva'), (u'ta', u'eva')])
$ python -m sanskrit_parser.parser.sandhi --split taeva --all
All possible splits for taeva
set([(u't', u'aeva'), (u'tar', u'eva'), (u'taev', u'a'), (u'to', u'eva'),
(u'ta', u'eva'), (u'te', u'eva'), (u'taH', u'eva'), (u'tae', u'va'),
(u'taeva', u''), (u'tas', u'eva')])
-
class
sanskrit_parser.parser.sandhi.
Sandhi
(rules_dir=None, use_default_rules=True, logger=None)[source]¶ Bases:
object
Class to hold all the sandhi rules and methods for joining and splitting. Uses SLP1 encoding for all internal operations.
-
join
(first_in, second_in)[source]¶ Performs sandhi. Warning: May generate forms that are not lexically valid.
- Parameters
first_in – SanskritImmutableString first word of the sandhi
second_in – SanskritImmutableString word of the sandhi
- Returns
list of strings of possible sandhi forms, or None if no sandhi can be performed
-
split_all
(word_in, start=None, stop=None)[source]¶ Split word at all possible locations and return splits. Warning: Will generate splits that are not lexically valid.
- Parameters
word_in – SanskritImmutableString word to split
- Returns
set of tuple of strings of possible split forms, or None if no split can be performed
-
split_at
(word_in, idx)[source]¶ Split sandhi at the given index of word. Warning: Will generate splits that are not lexically valid.
- Parameters
word_in – SanskritImmutableString word to split
idx – position within word at which to try the split
- Returns
set of tuple of strings of possible split forms, or None if no split can be performed
-