The TARSQI Toolkit

Release Notes and Manual for Version 1.0.

Marc Verhagen, November 2007

Introduction

The Tarsqi Toolkit (TTK) is a set of components for extracting temporal information from a news wire text. TTK extracts time expressions, events, subordination links and temporal links; in addition, it ensures consistency of temporal information. See http://tarsqi.org for more general information on the TARSQI project and for descriptions of TTK and its components. The Tarsqi Toolkit comes bundled with the Tango annotation tool (http://timeml.org/site/tango/) and a graphical user interface.

This manual contains the following sections:

Copyright and License
Prerequisites
Installation
Using the Tarsqi Toolkit
Using the Toolkit GUI
Code Documentation
Contributors
Contact
Future Work

1. Copyright and License

The Tempex module is copyright of The MITRE corporation and is distributed under the license in tempex-license.pdf.

The data in data/in/TimeBank are copyrighted by the various content providers and can be used for academic purposes only .

2. Prerequisites

The toolkit requires at least version 2.3 of Python and version 5.8 of Perl. Older Perl versions may work. The toolkit has been tested on the following platforms:

Red Hat Linux 5, with Python 2.4.3 and Perl 5.8.8
Mac OS X, with Python 2.3.5 and Perl 5.8.8

The Perl installation needs to include the XML::Parser module. This is problematic for those who use OS X because the version of Perl that is bundled with OS X does not contain XML::Parser. You have several options here. One is to download and install XML::Parser from CPAN. Another is to install ActivePerl from ActiveState.

The toolkit graphical user interface requires the wxPython package.

Note to Windows users.
There currently is no neatly packaged Windows version of TTK. However, most code is written to be cross-platform and the toolkit can be made to run on Windows (albeit with some effort). We are currently working on a neat package for Windows by integrating or better integrating the part-of-speech tagger and the MaxEnt classifier. A Windows-friendly version will be released asap.

3. Installation

This is a three step process: (i) unpacking the archive, (ii) installing the part-of-speech tagger and (iii) setting up TTK for your platform and environment.

Downloading and Unpacking the Archive

Unpack the archive into the desired directory. For example, on Linux, you would need to type

% gunzip -c ttk-1.0.tar.gz | tar xp

This will unpack TTK into a directory named ttk-1.0.

Installing the Part-of-Speech Tagger

The Tarsqi toolkit is designed to work seamlessly with the IMS TreeTagger. Download the packages needed for your platform from the TreeTagger website and follow the directions. The TreeTagger needs to be installed in

ttk-1.0/code/components/preprocessing/treetagger/

and this directory needs to have sub directories bin and lib.

Other POS taggers can be used instead of the IMS TreeTagger. The easiest case would be a tagger that uses the same input format as the TreeTagger and creates files in the same output format. In that case, only the tag_fragment method in

ttk-1.0/code/components/preprocessor/wrapper.py

will need to be edited according to your needs. A future version of the toolkit will make it easier to install other taggers.

Platform-specific setup

Platform-specific setup needs to be performed by changing the working directory to ttk-1.0/code and running the setup.py script. The toolkit comes bundled with classifier binaries and the setup script installs the correct version. The only two platforms for which classifier binaries are included are Mac OSX and Linux. The script also makes a guess as to where to find a Perl executable that is sufficient for TTK (that is, it includes an XML parser). In general, it will simply use perl as the Perl command unless it finds an ActivePerl distribution. This default can be overruled by either supplying arguments to the setup script or by editing the file settings.txt. The two most likely ways to use the script are:

% python setup.py platform=linux % python setup.py platform=osx perl=/usr/local/ActivePerl-5.8/bin/perl

The first example sets up the classifier for linux, and sets the Perl path to perl (unless specified otherwise in settings.txt). The second installs the classifier for osx and sets the Perl path to the given value. You could also use the perl switch to point to other non-standard perl locations. See the documentation in setup.py for more details.

A future version of TTK will have added Windows as a supported platform.

4. Using the TARSQI Toolkit

To run the TARSQI Toolkit, open a terminal, change the working directory to the code directory of the distribution and type

python tarsqi.py <input_type> [flags] <infile> <outfile> python tarsqi.py <input_type> [flags] <indir> <outdir>

The following input types are defined:

simple-xml
An input type that should be used for default XML. It assumes that a document contains a tag named TEXT that wraps the data that need to be parsed and that the data have not yet been processed in any way. The value of the tag that wraps the data can be overridden using the content_tag flag (see below). This input type can be used for the files in code/data/in/simple-xml.
timebank
This input type is very similar to simple-xml, the main difference is that it activates a component that parses document creation times for the various TimeBank formats. This input type can be used for the files in code/data/in/TimeBank, which consists of all TimeBank files with all tags (except some document level tags) stripped out.
rte3
Use this input type when processing pre-processed data from code/data/in/RTE3, which contains data from the Third Pascal Textual Entailment Challenge.

Flags are feature-value pairs where the feature and value are separated by an equals sign. The following flags are defined:

extension=String
Puts a restriction on what files are processed, this is useful when processing an entire directory. The default is the empty string, which matches any extension.
trap_errors=(True|False)
Determines whether errors inside of components are trapped. The default is that errors are not trapped.
content_tag=String
Can be used to overrule the default content tag of the input type.
pipeline=String
Can be used to overrule the default pipeline determined by the data source identifier. A pipeline is a comma-separated string of component names. Allowed component names are PREPROCESSOR, GUTIME, EVITA, SLINKET, S2T, BLINKER, CLASSIFIER and LINK_MERGER. The order of the components in the pipeline specification is significant. Some examples are:
pipeline=PREPROCESSOR,GUTIME,EVITA pipeline=SLINKET,S2T,BLINKER,CLASSIFIER,LINK_MERGER
The first example instructs TTK to take a file, preprocess it and add time expressions and events. For the second example, preprocessing, times and events are taken for granted and only links are added.

5. Using the Toolkit GUI

The Toolkit GUI can be used by typing one of

% pythonw gui.py % python gui.py

Use pythonw when available, python otherwise. In both cases the wxPython package needs to be installed. The GUI has three advantages over using the command line version:

It is faster when using on a file-by-file basis because all libraries are loaded either at start-up or when the first file is processed.
It is easier to use.
It allows the user to type some input and see what happens.

The main disadvantage is that it is not possible to process all files in a directory. Here's a screen shot:

There is no separate manual for the GUI, but usage should be pretty straightforward. Functionality can be summed up as follows:

Use "Load File" to select a file to be processed.
Use "Load Text" to type in text. This will create a file in the data/in/User directory, which is then selected as the input file.
Use the"Document Type", "Trap Errors" and "Content Tag" pull-downs and text field to enter processing options. Setting the "Document Type" option will set the "Content Tag" text field to the default for the document type. It will also select the default components for the document type. The default components can be overridden by checking and unchecking the component check boxes. Note that the three pre-processing components are all either on or off.
Use "Process File" to process the input file conform to the selected settings.
Finally, "View Results" will pop up a new window that displays the processing results in a more informative way (the large text field in the control panel just prints the result file as is). The buttons in the results view control what is shown in it. There are two exceptions: (i) the "Graph" button will start the Tango annotation tool on the results file, showing a graph of the annotation and including a "TBox" button which displays an alternative format, and (ii) the "TBox" button doesn't do anything yet, in the future it will provide a more direct way to the TBox representation.

6. Code Documentation

Documentation of the code is uneven: some modules are well-documented, other have spotty and underwhelming documentation, and yet others are not documented at all. For some modules, documentation files can be created using the pydoc command. Unfortunately, this command crashes on many of the toolkit modules. To create browsable documentation in ttk-1.0/docs/code you can use the make_documentation.py script:

% cd ttk-1.0 % python make_documentation.py

This creates an index.html with a list of links to all modules. For each module, and each class and function in that module, the documentation strings are printed. There are also links to the source code of each function.

7. Contributors

Many people have contributed to the Tarsqi toolkit, they are listed here in alphabetical order: Catherine Havasi, Seokbae Jang, Bob Knippen, Congmin Lee, Inderjeet Mani, Emin Mimaroglu, Jessica Moszkowicz, Feng Pan, Alex Plotnick, James Pustejovsky, Hongyuan Qiu, Anna Rumshisky, Roser Saurí, Barry Schiffman, Amber Stubbs, Marc Verhagen, and Ben Wellner.

The Tango annotation tool was developed by Linda van Guilder, Andrew See, Bob Knippen and Alex Baron.

8. Contact

If you have problems installing the toolkit or if you want to report a bug, please send an email to ttk-bugs@timeml.org. When reporting a bug, please tell us what platform you're using (including Perl and Python versions) and include a file that illustrates the errant behavior. A database with known issues will be made available on the web site.

Suggestions, criticisms, disappointments, feature requests and kudos are also welcome at the above address.

9. Future Work

The next major revision of the TARSQI Toolkit will be numbered 1.1 and will be released in early 2008. Minor revisions will be released with version number 1.0.X and will concentrate on bug fixes (as well as on providing a non-problematic Windows version).

The following major changes to the code base are now in progress or under consideration:

Re-implement the temporal closure code into Python and integrate it better with the toolkit. Includes major speed optimizations (the temporal closure components is by far the slowest component of the toolkit and it slows down to a crawl on large documents).
Re-implement GUTime in Python, splitting it into two components, one to add the TIMEX3 tags and one to fill in the VAL field.
Eliminate all remaining Perl components (glue scripts and filtering scripts) and replace them with Python components
Use standoff annotation instead of in-line XML. This would make it much easier to deal with existing tags.
Finish porting all rule types from GutenLink to Blinker
Hook up the toolkit to other pre-processing components (for example the NLTK toolkit).
Add browsers and editors for the dictionaries and pattern sets.
Improve on the merging code. It seems to throw away a few to many temporal links that are considered high-precision (most notable links between timex tags generated by Blinker).
Add a text document type.

A much longer list of feature requests and known bugs will be put on the toolkit website in the near future.