Code

Parser

Sanskrit parser is a python library to help parse Sanskrit input

It provides three main levels of output, in order of increasing complexity:
  1. tags - Morphological analysis of a word

  2. sandhi - Sandhi Split of a phrase

  3. vakya - Morpho-syntactic Analysis of a sentence (after Sandhi split)

Code resides at: https://github.com/kmadathil/sanskrit_parser/

Please report any issues at: https://github.com/kmadathil/sanskrit_parser/issues

Cmd Line

@author: Karthik Madathil (github: @kmadathil)

sanskrit_parser.cmd_line.cmd_line()[source]

Command Line Wrapper Function

sanskrit_parser.cmd_line.getSandhiArgs(argv=None)[source]

Argparse routine. Returns args variable

sanskrit_parser.cmd_line.getTagsArgs(argv=None)[source]

Argparse routine. Returns args variable

sanskrit_parser.cmd_line.getVakyaArgs(argv=None)[source]

Argparse routine. Returns args variable

sanskrit_parser.cmd_line.sandhi(argv=None)[source]
sanskrit_parser.cmd_line.tags(argv=None)[source]
sanskrit_parser.cmd_line.vakya(argv=None)[source]

API

Code Usage

The Parser class can be used to generate vakya parses thus:

from sanskrit_parser import Parser
string = "astyuttarasyAMdiSi"
input_encoding = "SLP1"
output_encoding = "SLP1"
parser = Parser(input_encoding=input_encoding,
                output_encoding=output_encoding,
                replace_ending_visarga='s')
print('Splits:')
for split in parser.split(string, limit=10):
    print(f'Lexical Split: {split}')
    for i, parse in enumerate(split.parse(limit=2)):
        print(f'Parse {i}')
        print(f'{parse}')
    break

This produces the output:

Lexical Split: ['asti', 'uttarasyAm', 'diSi']
Parse 0
asti => (asti, ['samAsapUrvapadanAmapadam', 'strIliNgam']) : samasta of uttarasyAm
uttarasyAm => (uttara#1, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam'])
diSi => (diS, ['saptamIviBaktiH', 'ekavacanam', 'strIliNgam']) : viSezaRa of uttarasyAm
Parse 1
asti => (asti, ['samAsapUrvapadanAmapadam', 'strIliNgam']) : samasta of uttarasyAm
uttarasyAm => (uttara#2, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam']) : viSezaRa of diSi
diSi => (diS#2, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam'])
Parse 2
asti => (as#1, ['kartari', 'praTamapuruzaH', 'law', 'parasmEpadam', 'ekavacanam', 'prATamikaH'])
uttarasyAm => (uttara#2, ['saptamIviBaktiH', 'strIliNgam', 'ekavacanam']) : viSezaRa of diSi
diSi => (diS, ['saptamIviBaktiH', 'ekavacanam', 'strIliNgam']) : aDikaraRam of asti
class sanskrit_parser.api.JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

default(o)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
class sanskrit_parser.api.Parse(split: sanskrit_parser.api.Split, parse_graph, cost)[source]

Bases: sanskrit_parser.api.Serializable

serializable()[source]

Return an object that can be serialized by json.JSONEncoder

to_conll()[source]
to_dot()[source]
class sanskrit_parser.api.ParseEdge(predecessor: sanskrit_parser.api.ParseNode, node: sanskrit_parser.api.ParseNode, label: str)[source]

Bases: sanskrit_parser.api.Serializable

label: str
node: sanskrit_parser.api.ParseNode
predecessor: sanskrit_parser.api.ParseNode
serializable()[source]

Return an object that can be serialized by json.JSONEncoder

class sanskrit_parser.api.ParseNode(node: sanskrit_parser.parser.datastructures.VakyaGraphNode, strict_io: bool, encoding: str)[source]

Bases: sanskrit_parser.api.Serializable

serializable()[source]

Return an object that can be serialized by json.JSONEncoder

class sanskrit_parser.api.ParseTag(root: str, tags: Sequence[str])[source]

Bases: sanskrit_parser.api.Serializable

root: str
serializable()[source]

Return an object that can be serialized by json.JSONEncoder

tags: Sequence[str]
class sanskrit_parser.api.Parser(strict_io: bool = False, input_encoding: Optional[str] = None, output_encoding: str = 'SLP1', lexical_lookup: str = 'combined', score: bool = True, split_above: int = 5, replace_ending_visarga: Optional[str] = None, fast_merge: bool = True)[source]

Bases: object

split(input_string: str, limit: int = 10, pre_segmented: bool = False, dot_file=None)[source]
class sanskrit_parser.api.Serializable[source]

Bases: abc.ABC

Base class to indicate an object is serializable into JSON

abstract serializable()[source]

Return an object that can be serialized by json.JSONEncoder

class sanskrit_parser.api.Split(parser: sanskrit_parser.api.Parser, input_string: str, split: Sequence[sanskrit_parser.base.sanskrit_base.SanskritObject], vgraph: sanskrit_parser.parser.datastructures.VakyaGraph = None)[source]

Bases: sanskrit_parser.api.Serializable

input_string: str
parse(limit=10, min_cost_only=False)[source]
parser: sanskrit_parser.api.Parser
serializable()[source]

Return an object that can be serialized by json.JSONEncoder

split: Sequence[sanskrit_parser.base.sanskrit_base.SanskritObject]
vgraph: sanskrit_parser.parser.datastructures.VakyaGraph = None
write_dot(basepath)[source]

Parser

Sanskrit Parser Data Structures

@author: Karthik Madathil (github: @kmadathil)

class sanskrit_parser.parser.datastructures.SandhiGraph[source]

Bases: object

DAG class to hold Sandhi Lexical Analysis Results

Represents the results of lexical sandhi analysis as a DAG Nodes are SanskritObjects

add_end_edge(node)[source]

Add an edge from node to end

add_node(node)[source]

Extend dag with node inserted at root

Params:

Node (SanskritObject) : Node to add root (Boolean) : Make a root node end (Boolean) : Add an edge to end

add_roots(roots)[source]
append_to_node(t, nodes)[source]

Create edges from t to nodes

Params:

t (SanskritObject) : Node to append to nodes (iterator(nodes)) : Nodes to append to t

draw(*args, **kwargs)[source]
end = '__end__'
find_all_paths(max_paths=10, sort=True, score=True)[source]

Find all paths through DAG to End

Params:
max_paths (int :default:=10): Number of paths to find

If this is > 1000, all paths will be found

sort (bool)If True (default), sort paths

in ascending order of length

has_node(t)[source]

Does a given node exist in the graph?

Params:

t (SanskritObject): Node

Returns:

boolean

lock_start()[source]

Make the graph ready for search by adding a start node

Add a start node, add arcs to all current root nodes, and clear self.roots

score_graph()[source]
start = '__start__'
write_dot(path)[source]
class sanskrit_parser.parser.datastructures.VakyaGraph(path, max_parse_dc=4, fast_merge=True)[source]

Bases: object

DAG class for Sanskrit Vakya Analysis

Represents Sanskrit Vakya Analysis as a DAG Nodes are SanskritObjects with morphological tags Edges are potential relationships between them

add_avyayas(bases)[source]

Add Avyaya Links

add_bhavalakshana(krts, laks)[source]

Add bhavalakshana edges from saptami krts to lakaras

add_conjunctions(bases)[source]

Add samuccita links for conjunctions/disjunctions

add_edges()[source]
add_karakas(bases)[source]

Add karaka edges from base node (dhatu) base

add_kriya_kriya(lakaras, krts)[source]

Add kriya-kriya edges from lakaras to krts

add_kriyavisheshana(bases)[source]

Add kriyaviSezaRa edges from base node (dhatu) base

add_node(node)[source]

Extend dag with node inserted at root

Params:

Node (VakyaGraphNode) : Node to add

add_non_karaka_vibhaktis()[source]

Add Non-Karaka Vibhaktis

add_samastas()[source]

Add samasta links from next samasta/tiN

add_sentence_conjunctions(laks, krts)[source]

Add sentence conjunction links

For all nodes which match sentence_conjuction keys add vAkyasambanDaH link between y-t pair (if vibhakti matches where relevant) Reverse all edges to node, add sambadDa- to link label (eg sambadDa-karma, if node is not vIpsa if node is saMyojakaH, and not vIpsA add saMbadDakriyA links to verbs if associated t* doesn’t exist vAkyasambanDaH links from verbs

add_shashthi()[source]

Add zazWI-sambanDa links to next tiN

add_vipsa()[source]
add_visheshana()[source]
check_parse_validity()[source]

Validity Check for parses

Remove parses with double kArakas Remove parses with multiple to edges into a node Remove parses with cycles

draw(*args, **kwargs)[source]
find_krtverbs()[source]

Find non ti~Nanta verbs

find_lakaras()[source]

Find the ti~Nanta

get_dot_dict()[source]
get_parses_dc()[source]

Returns all parses

Uses modified Kruskal Algorithm to compute (generalized) spanning tree of k-partite VakyaGraph

lock()[source]
write_dot(path)[source]
class sanskrit_parser.parser.datastructures.VakyaParse(nodepair)[source]

Bases: object

activate_and_extinguish_alternatives(node)[source]

Make node active, extinguish other nodes in its partition

can_merge(other, length)[source]

Can we merge two VakyaParses

copy()[source]

Return a one level deep copy - in between a shallow and a fully deep copy

extend(pred, node)[source]

Extend current parse with edge from pred to node

is_extinguished(node)[source]

Is a node extinguished

is_safe(pred, node)[source]

Checks if a partial parse is compatible with a given node and predecessor pair

merge_f(other)[source]

Merge two VakyaParses: Fast method

merge_s(other, length)[source]

Merge two VakyaParses: Slow method

sanskrit_parser.parser.datastructures.getSLP1Tagset(n)[source]

Given a (base, tagset) pair, extract the tagset

Intro

Sandhi Analyzer for Sanskrit words

@author: Karthik Madathil (github: @kmadathil)

Usage

Use the LexicalSandhiAnalyzer to split a sentence (wrapped in a SanskritObject) and retrieve the top 10 splits:

>>> from __future__ import print_function
>>> from sanskrit_parser.parser.sandhi_analyzer import LexicalSandhiAnalyzer
>>> from sanskrit_parser.base.sanskrit_base import SanskritObject, SLP1
>>> sentence = SanskritObject("astyuttarasyAMdishidevatAtmA")
>>> analyzer = LexicalSandhiAnalyzer()
>>> splits = analyzer.getSandhiSplits(sentence).findAllPaths(10)
>>> for split in splits:
...    print(split)
...
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA']
[u'asti', u'uttara', u'syAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'at', u'mA']
[u'asti', u'uttarasyAm', u'diSi', u'de', u'vatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA']
[u'asti', u'uttas', u'rasyAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'de', u'avatA', u'AtmA']

The sandhi_analyzer can also be used to look up the tags for a given word form: (Note that the database stores words ending in visarga with an ‘s’ at the end)

>>> word = SanskritObject('hares')
>>> tags = analyzer.getMorphologicalTags(word)
>>> for tag in tags:
...    print(tag)
...
('hf#1', set(['cj', 'snd', 'prim', 'para', 'md', 'sys', 'prs', 'v', 'np', 'sg', 'op']))
('hari#1', set(['na', 'mas', 'sg', 'gen']))
('hari#1', set(['na', 'mas', 'abl', 'sg']))
('hari#1', set(['na', 'fem', 'sg', 'gen']))
('hari#1', set(['na', 'fem', 'abl', 'sg']))
('hari#2', set(['na', 'mas', 'sg', 'gen']))
('hari#2', set(['na', 'mas', 'abl', 'sg']))
('hari#2', set(['na', 'fem', 'sg', 'gen']))
('hari#2', set(['na', 'fem', 'abl', 'sg']))
class sanskrit_parser.parser.sandhi_analyzer.LexicalSandhiAnalyzer(lexical_lookup='combined')[source]

Bases: object

Singleton class to hold methods for Sanskrit lexical sandhi analysis.

We define lexical sandhi analysis to be the process of taking an input sequence and transforming it to a collection (represented by a DAG) of potential sandhi splits of the sequence. Each member of a split is guaranteed to be a valid lexical form.

getMorphologicalTags(obj, tmap=True)[source]

Get Morphological tags for a word

Params:

obj(SanskritString): word tmap(Boolean=True): If True, maps

tags to our format

Returns

list: List of (base, tagset) pairs

getSandhiSplits(o, tag=False, pre_segmented=False)[source]

Get all valid Sandhi splits for a string

Params:

o(SanskritString): Input object tag(Boolean) : When True (def=False), return a

morphologically tagged graph

Returns:

SandhiGraph : DAG all possible splits

hasTag(obj, name, tagset)[source]

Check if word matches morhphological tags

Params:

obj(SanskritString): word name(str): name in tag tagset(set): set of tag elements

Returns
list: List of (base, tagset) pairs for obj that

match (name,tagset), or None

preSegmented(sl, tag=False)[source]

Get a SandhiGraph for a pre-segmented sentence

Params:

sl (list of SanskritString): Input object tag(Boolean) : When True (def=False), return a

morphologically tagged graph

Returns:

SandhiGraph : DAG all possible splits

sandhi = <sanskrit_parser.parser.sandhi.Sandhi object>
tagSandhiGraph(g)[source]

Tag a Sandhi Graph with morphological tags for each node

Params:

g (SandhiGraph) : input lexical sandhi graph

sanskrit_parser.parser.sandhi_analyzer.getArgs(argv=None)[source]

Argparse routine. Returns args variable

sanskrit_parser.parser.sandhi_analyzer.main(argv=None)[source]

Intro

Sandhi splitter for Samskrit. Builds up a database of sandhi rules and utilizes them for both performing sandhi and splitting words.

Will generate splits that may not all be valid words. That is left to the calling module to validate. See for example SanskritLexicalAnalyzer

Example usage:

from sandhi import Sandhi sandhi = Sandhi() joins = sandhi.join(‘tasmin’, ‘iti’) splits = sandhi.split_at(‘tasminniti’, 5)

Draws inspiration from https://github.com/sanskrit/sanskrit

@author: Avinash Varna (github: @avinashvarna)

Usage

The Sandhi class can be used to join/split words:

>>> from sanskrit_parser.parser.sandhi import Sandhi
>>> sandhi = Sandhi()
>>> word1 = SanskritImmutableString('te')
>>> word2 = SanskritImmutableString('eva')
>>> joins = sandhi.join(word1, word2)
>>> for join in joins:
...    print(join)
...
teeva
taeva
ta eva
tayeva

To split at a specific position, use the Sandhi.split_at() method:

>>> w = SanskritImmutableString('taeva')
>>> splits = sandhi.split_at(w, 1)
>>> for split in splits:
...    print(split)
...
(u'tar', u'eva')
(u'tas', u'eva')
(u'taH', u'eva')
(u'ta', u'eva')

To split at all possible locations, use the Sandhi.split_all() method:

>>> splits_all = sandhi.split_all(w)
>>> for split in splits_all:
...    print(split)
...
(u't', u'aeva')
(u'tar', u'eva')
(u'taev', u'a')
(u'to', u'eva')
(u'ta', u'eva')
(u'te', u'eva')
(u'taH', u'eva')
(u'tae', u'va')
(u'taeva', u'')
(u'tas', u'eva')

Note: As mentioned previously, both over-generation and under-generation are possible with the Sandhi class.

Command line usage

$ python -m sanskrit_parser.parser.sandhi --join te eva
Joining te eva
set([u'teeva', u'taeva', u'ta eva', u'tayeva'])

$ python -m sanskrit_parser.parser.sandhi --split taeva 1
Splitting taeva at 1
set([(u'tar', u'eva'), (u'tas', u'eva'), (u'taH', u'eva'), (u'ta', u'eva')])

$ python -m sanskrit_parser.parser.sandhi --split taeva --all
All possible splits for taeva
set([(u't', u'aeva'), (u'tar', u'eva'), (u'taev', u'a'), (u'to', u'eva'),
(u'ta', u'eva'), (u'te', u'eva'), (u'taH', u'eva'), (u'tae', u'va'),
(u'taeva', u''), (u'tas', u'eva')])
class sanskrit_parser.parser.sandhi.Sandhi(rules_dir=None, use_default_rules=True, logger=None)[source]

Bases: object

Class to hold all the sandhi rules and methods for joining and splitting. Uses SLP1 encoding for all internal operations.

join(first_in, second_in)[source]

Performs sandhi. Warning: May generate forms that are not lexically valid.

Parameters
  • first_in – SanskritImmutableString first word of the sandhi

  • second_in – SanskritImmutableString word of the sandhi

Returns

list of strings of possible sandhi forms, or None if no sandhi can be performed

split_all(word_in, start=None, stop=None)[source]

Split word at all possible locations and return splits. Warning: Will generate splits that are not lexically valid.

Parameters

word_in – SanskritImmutableString word to split

Returns

set of tuple of strings of possible split forms, or None if no split can be performed

split_at(word_in, idx)[source]

Split sandhi at the given index of word. Warning: Will generate splits that are not lexically valid.

Parameters
  • word_in – SanskritImmutableString word to split

  • idx – position within word at which to try the split

Returns

set of tuple of strings of possible split forms, or None if no split can be performed

Indices and tables