formats Package

formats Package

gate Module

class ternip.formats.gate.GateDocument(file)[source]

Bases: object

A class to facilitate communication with GATE

get_dct_sents()[source]

Returns the creation time sents for this document.

get_sents()[source]

Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.

reconcile(sents)[source]

Update this document with the newly annotated tokens.

reconcile_dct(dct)[source]

Adds a TIMEX to the DCT tag and return the DCT

tempeval2 Module

class ternip.formats.tempeval2.TempEval2Document(file, docid='', dct='XXXXXXXX')[source]

Bases: object

A class which uses the format of stand-off format of TempEval-2

static create(sents, docid='')[source]

Creates a TempEval-2 document from the internal representation

sents is the [[(word, pos, timexes), ...], ...] format.

get_attrs()[source]

Print out the format suitable for timex-attributes.tab

get_dct_sents()[source]

Returns the creation time sents for this document.

get_extents()[source]

Print out the format suitable for timex-extents.tab

get_sents()[source]

Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.

static load_multi(file, dct_file)[source]

Load multiple documents from a single base-segmentation.tab

reconcile(sents)[source]

Update this document with the newly annotated tokens.

reconcile_dct(dct)[source]

Adds a TIMEX to the DCT tag and return the DCT

tern Module

class ternip.formats.tern.TernDocument(file, nodename='TEXT', has_S=False, has_LEX=False, pos_attr=False)[source]

Bases: ternip.formats.timex2.Timex2XmlDocument

A class which can handle TERN documents

static create(sents, docid, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False, dct='')[source]

Creates a TERN document from the internal representation

sents is the [[(word, pos, timexes), ...], ...] format.

tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.

If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S

add_LEX is similar, but for token boundaries

pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.

dct is the document creation time string

get_dct_sents()[source]

Returns the creation time sents for this document.

reconcile_dct(dct, add_S=False, add_LEX=False, pos_attr=False)[source]

Adds a TIMEX to the DCT tag and return the DCT

timeml Module

class ternip.formats.timeml.TimeMlDocument(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]

Bases: ternip.formats.timex3.Timex3XmlDocument

A class which holds a TimeML representation of a document.

Suitable for use with the AQUAINT dataset.

static create(sents, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False)[source]

Creates a TimeML document from the internal representation

sents is the [[(word, pos, timexes), ...], ...] format.

tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.

If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S

add_LEX is similar, but for token boundaries

pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.

timex2 Module

class ternip.formats.timex2.Timex2XmlDocument(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]

Bases: ternip.formats.xml_doc.XmlDocument

A class which takes any random XML document and adds TIMEX2 tags to it.

timex3 Module

class ternip.formats.timex3.Timex3XmlDocument(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]

Bases: ternip.formats.xml_doc.XmlDocument

A class which takes any random XML document and adds TIMEX3 tags to it.

Suitable for use with Timebank, which contains many superfluous tags that aren’t in the TimeML spec, even though it claims to be TimeML.

xml_doc Module

exception ternip.formats.xml_doc.BadNodeNameError[source]

Bases: exceptions.Exception

exception ternip.formats.xml_doc.NestingError(s)[source]

Bases: exceptions.Exception

exception ternip.formats.xml_doc.TokeniseError(s)[source]

Bases: exceptions.Exception

class ternip.formats.xml_doc.XmlDocument(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]

Bases: object

An abstract base class which all XML types can inherit from. This implements almost everything, apart from the conversion of timex objects to and from timex tags in the XML. This is done by child classes

static create(sents, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False)[source]

This is an abstract function for building XML documents from the internal representation only. You are not guaranteed to get out of get_sents what you put in here. Sentences and words will be retokenised and retagged unless you explicitly add S and LEX tags and the POS attribute to the document using the optional arguments.

sents is the [[(word, pos, timexes), ...], ...] format.

tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.

If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S

add_LEX is similar, but for token boundaries

pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.

get_dct_sents()[source]

Returns the creation time sents for this document.

get_sents()[source]

Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.

If there are any TIMEXes in the input document that cross sentence boundaries (and the input is not already broken up into sentences with the S tag), then those TIMEXes are disregarded.

reconcile(sents, add_S=False, add_LEX=False, pos_attr=False)[source]

Reconciles this document against the new internal representation. If add_S is set to anything other than False, this means tags are indicated to indicate the sentence boundaries, with the tag names being the value of add_S. add_LEX is the same, but for marking token boundaries, and pos_attr is the name of the attribute which holds the POS tag for that token. This is mainly useful for transforming the TERN documents into something that GUTime can parse.

If your document already contains S and LEX tags, and add_S/add_LEX is set to add them, old S/LEX tags will be stripped first. If pos_attr is set and the attribute name differs from the old POS attribute name on the lex tag, then the old attribute will be removed.

Sentence/token boundaries will not be altered in the final document unless add_S/add_LEX is set. If you have changed the token boundaries in the internal representation from the original form, but are not then adding them back in, reconciliation may give undefined results.

There are some inputs which would output invalid XML. For example, if this document has elements which span multiple sentences, but not whole parts of them, then you will be unable to add XML tags and get valid XML, so failure will occur in unexpected ways.

If you are adding LEX tags, and your XML document contains tags internal to tokens, then reconciliation will fail, as it expects tokens to be in a continuous piece of whitespace.

reconcile_dct(dct, add_S=False, add_LEX=False, pos_attr=False)[source]

Adds a TIMEX to the DCT tag and return the DCT

strip_tag(tagname)[source]

Remove this tag from the document.

strip_timexes()[source]

Strips all timexes from this document. Useful if we’re evaluating the software - we can just feed in the gold standard directly and compare the output then.