formats Package¶
formats
Package¶
gate
Module¶
tempeval2
Module¶
-
class
ternip.formats.tempeval2.
TempEval2Document
(file, docid='', dct='XXXXXXXX')[source]¶ Bases:
object
A class which uses the format of stand-off format of TempEval-2
-
static
create
(sents, docid='')[source]¶ Creates a TempEval-2 document from the internal representation
sents is the [[(word, pos, timexes), ...], ...] format.
-
get_sents
()[source]¶ Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.
-
static
tern
Module¶
-
class
ternip.formats.tern.
TernDocument
(file, nodename='TEXT', has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.timex2.Timex2XmlDocument
A class which can handle TERN documents
-
static
create
(sents, docid, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False, dct='')[source]¶ Creates a TERN document from the internal representation
sents is the [[(word, pos, timexes), ...], ...] format.
tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.
If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S
add_LEX is similar, but for token boundaries
pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.
dct is the document creation time string
-
static
timeml
Module¶
-
class
ternip.formats.timeml.
TimeMlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.timex3.Timex3XmlDocument
A class which holds a TimeML representation of a document.
Suitable for use with the AQUAINT dataset.
-
static
create
(sents, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False)[source]¶ Creates a TimeML document from the internal representation
sents is the [[(word, pos, timexes), ...], ...] format.
tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.
If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S
add_LEX is similar, but for token boundaries
pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.
-
static
timex2
Module¶
-
class
ternip.formats.timex2.
Timex2XmlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.xml_doc.XmlDocument
A class which takes any random XML document and adds TIMEX2 tags to it.
timex3
Module¶
-
class
ternip.formats.timex3.
Timex3XmlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.xml_doc.XmlDocument
A class which takes any random XML document and adds TIMEX3 tags to it.
Suitable for use with Timebank, which contains many superfluous tags that aren’t in the TimeML spec, even though it claims to be TimeML.
xml_doc
Module¶
-
class
ternip.formats.xml_doc.
XmlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
object
An abstract base class which all XML types can inherit from. This implements almost everything, apart from the conversion of timex objects to and from timex tags in the XML. This is done by child classes
-
static
create
(sents, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False)[source]¶ This is an abstract function for building XML documents from the internal representation only. You are not guaranteed to get out of get_sents what you put in here. Sentences and words will be retokenised and retagged unless you explicitly add S and LEX tags and the POS attribute to the document using the optional arguments.
sents is the [[(word, pos, timexes), ...], ...] format.
tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.
If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S
add_LEX is similar, but for token boundaries
pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.
-
get_sents
()[source]¶ Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.
If there are any TIMEXes in the input document that cross sentence boundaries (and the input is not already broken up into sentences with the S tag), then those TIMEXes are disregarded.
-
reconcile
(sents, add_S=False, add_LEX=False, pos_attr=False)[source]¶ Reconciles this document against the new internal representation. If add_S is set to anything other than False, this means tags are indicated to indicate the sentence boundaries, with the tag names being the value of add_S. add_LEX is the same, but for marking token boundaries, and pos_attr is the name of the attribute which holds the POS tag for that token. This is mainly useful for transforming the TERN documents into something that GUTime can parse.
If your document already contains S and LEX tags, and add_S/add_LEX is set to add them, old S/LEX tags will be stripped first. If pos_attr is set and the attribute name differs from the old POS attribute name on the lex tag, then the old attribute will be removed.
Sentence/token boundaries will not be altered in the final document unless add_S/add_LEX is set. If you have changed the token boundaries in the internal representation from the original form, but are not then adding them back in, reconciliation may give undefined results.
There are some inputs which would output invalid XML. For example, if this document has elements which span multiple sentences, but not whole parts of them, then you will be unable to add XML tags and get valid XML, so failure will occur in unexpected ways.
If you are adding LEX tags, and your XML document contains tags internal to tokens, then reconciliation will fail, as it expects tokens to be in a continuous piece of whitespace.
-
static