Release Notes and Manual for Version 1.0.
Marc Verhagen, November 2007
The Tarsqi Toolkit (TTK) is a set of components for extracting temporal information from a news wire text. TTK extracts time expressions, events, subordination links and temporal links; in addition, it ensures consistency of temporal information. See http://tarsqi.org for more general information on the TARSQI project and for descriptions of TTK and its components. The Tarsqi Toolkit comes bundled with the Tango annotation tool (http://timeml.org/site/tango/) and a graphical user interface.
This manual contains the following sections:
The Tarsqi Toolkit is copyright ©2007 of Brandeis University and is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. |
The Tempex module is copyright of The MITRE corporation and is distributed under the license in tempex-license.pdf.
The Python wrapper for the TreeTagger (treetaggerwrapper.py) is copyright ©2004 of CNRS and distributed under the GNU-GPL Version 2. It was developed by Laurent Pointal.
The data in data/in/TimeBank
are copyrighted by the
various content providers and can be used for academic purposes only
.
2. Prerequisites
The toolkit requires at least version 2.3 of Python and version 5.8 of Perl. Older Perl versions may work. The toolkit has been tested on the following platforms:
XML::Parser
module. This is problematic for those who use OS X because the version
of Perl that is bundled with OS X does not contain
XML::Parser
. You have several options here. One is to
download and install XML::Parser from CPAN. Another is to install ActivePerl
from ActiveState.
The toolkit graphical user interface requires the wxPython package.
Note to Windows users.
There currently is no neatly packaged Windows version of TTK. However,
most code is written to be cross-platform and the toolkit can be made
to run on Windows (albeit with some effort). We are currently working
on a neat package for Windows by integrating or better integrating the
part-of-speech tagger and the MaxEnt classifier. A Windows-friendly
version will be released asap.
3. Installation
This is a three step process: (i) unpacking the archive, (ii)
installing the part-of-speech tagger and (iii) setting up TTK for your
platform and environment.
% gunzip -c ttk-1.0.tar.gz | tar xp
This will unpack TTK into a directory named ttk-1.0
.
ttk-1.0/code/components/preprocessing/treetagger/
and this directory needs to have sub directories bin
and
lib
.
Other POS taggers can be used instead of the IMS TreeTagger. The
easiest case would be a tagger that uses the same input format as the TreeTagger and creates
files in the same output
format. In that case, only the tag_fragment
method in
ttk-1.0/code/components/preprocessor/wrapper.py
will need to be edited according to your needs. A future version of
the toolkit will make it easier to install other taggers.
ttk-1.0/code
and running the
setup.py
script. The toolkit comes bundled with
classifier binaries and the setup script installs the correct
version. The only two platforms for which classifier binaries are
included are Mac OSX and Linux.
The script also makes a guess as to where to find a Perl executable
that is sufficient for TTK (that is, it includes an XML parser). In
general, it will simply use perl
as the Perl command
unless it finds an ActivePerl distribution. This default can be
overruled by either supplying arguments to the setup script or by
editing the file settings.txt
. The two most likely
ways to use the script are:
% python setup.py platform=linux
% python setup.py platform=osx perl=/usr/local/ActivePerl-5.8/bin/perl
The first example sets up the classifier for linux, and sets the Perl
path to perl
(unless specified otherwise in
settings.txt
). The second installs the classifier for osx
and sets the Perl path to the given value. You could also use the perl
switch to point to other non-standard perl locations. See the
documentation in setup.py
for more details.
A future version of TTK will have added Windows as a supported platform.
code
directory of the distribution and
type
python tarsqi.py <input_type> [flags] <infile> <outfile>
python tarsqi.py <input_type> [flags] <indir> <outdir>
The following input types are defined:
Flags are feature-value pairs where the feature and value are separated by an equals sign. The following flags are defined:simple-xml
An input type that should be used for default XML. It assumes that a document contains a tag named TEXT that wraps the data that need to be parsed and that the data have not yet been processed in any way. The value of the tag that wraps the data can be overridden using the content_tag flag (see below). This input type can be used for the files incode/data/in/simple-xml
.timebank
This input type is very similar tosimple-xml
, the main difference is that it activates a component that parses document creation times for the various TimeBank formats. This input type can be used for the files incode/data/in/TimeBank
, which consists of all TimeBank files with all tags (except some document level tags) stripped out.rte3
Use this input type when processing pre-processed data fromcode/data/in/RTE3
, which contains data from the Third Pascal Textual Entailment Challenge.
extension=String
Puts a restriction on what files are processed, this is useful when processing an entire directory. The default is the empty string, which matches any extension.trap_errors=(True|False)
Determines whether errors inside of components are trapped. The default is that errors are not trapped.content_tag=String
Can be used to overrule the default content tag of the input type.pipeline=String
Can be used to overrule the default pipeline determined by the data source identifier. A pipeline is a comma-separated string of component names. Allowed component names arePREPROCESSOR, GUTIME, EVITA, SLINKET, S2T, BLINKER, CLASSIFIER
andLINK_MERGER
. The order of the components in the pipeline specification is significant. Some examples are:The first example instructs TTK to take a file, preprocess it and add time expressions and events. For the second example, preprocessing, times and events are taken for granted and only links are added.pipeline=PREPROCESSOR,GUTIME,EVITA
pipeline=SLINKET,S2T,BLINKER,CLASSIFIER,LINK_MERGER
% pythonw gui.py
% python gui.py
Use pythonw
when available, python
otherwise. In both cases the wxPython package needs to be
installed. The GUI has three advantages over using the command line
version:
There is no separate manual for the GUI, but usage should be pretty straightforward. Functionality can be summed up as follows:
data/in/User
directory, which is then selected as the
input file.
pydoc
command. Unfortunately, this
command crashes on many of the toolkit modules.
To create browsable documentation in ttk-1.0/docs/code
you
can use the make_documentation.py
script:
% cd ttk-1.0
% python make_documentation.py
This creates an index.html
with a list of links to all
modules. For each module, and each class and function in that module,
the documentation strings are printed. There are also links to the
source code of each function.
The Tango annotation tool was developed by Linda van Guilder, Andrew
See, Bob Knippen and Alex Baron.
8. Contact
If you have problems installing the toolkit or if you want to report a
bug, please send an email to ttk-bugs@timeml.org. When reporting a
bug, please tell us what platform you're using (including Perl and
Python versions) and include a file that illustrates the errant
behavior. A database with known issues will be made available on the
web site.
Suggestions, criticisms, disappointments, feature requests and kudos
are also welcome at the above address.
9. Future Work
The next major revision of the TARSQI Toolkit will be numbered 1.1 and will
be released in early 2008. Minor revisions will be released with
version number 1.0.X and will concentrate on bug fixes (as well as on
providing a non-problematic Windows version).
The following major changes to the code base are now in progress or under consideration:
text
document type.