AQUAINT TimeML 1.0 Corpus Documentation

Marc Verhagen and Jessica Moszkowicz
Brandeis University, version 1.0, January 2008.

Contents

  1. Introduction
  2. Overview of TimeML
  3. Sources and Statistics
  4. Annotation Effort
  5. Contributors
  6. Contents of the Distribution
  7. Contact
  8. Copyright Notice

Introduction

The AQUAINT TimeML Corpus (previously referred to as the Opinion Corpus or the AQUAINT TimeBank corpus) contains 73 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specifications. This file includes a brief discussion of TimeML as well as a description of how the AQUAINT TimeML Corpus was created.

The most recent information on TimeML is always available at www.timeml.org.

Overview of TimeML

TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. For a detailed description of TimeML, see the TimeML 1.2.1 Specification and Guidelines, available at http://timeml.org/site/publications/specs.html. Here, we give a summary of the most important tags.

TIMEX3 — This tag is used to capture dates, times, durations, and sets of dates and times. All TIMEX3 tags include a type and a value along with some other possible attributes. The value is given according to the ISO 8601 standard. The TIMEX3 tag allows specification of a temporal anchor. This facilitates the use of temporal functions to calculate the value of an under specified temporal expression. For example, an article might include a document creation time such as "January 3, 2006". Later in the article, the temporal expression "today" may occur. By anchoring the TIMEX3 for "today" to the document creation time, we can determine the exact value of the TIMEX3.

EVENT — The EVENT tag is used to annotate those elements in a text that mark the semantic events described by it. Any event that can be temporally anchored or ordered is captured with this tag. An EVENT includes a class attribute with values such as occurrence, state, or reporting. The class of an EVENT may indicate what relationships the event participates in. In addition to the EVENT tag, events are also annotated with one or more MAKEINSTANCE tags that include information about a particular instance of the event. This includes part of speech, tense, aspect, modality, and polarity. When an event participates in a relationship, it is actually the event instance that is referenced. This is to allow for statements such as "John taught on Monday but not on Tuesday." Here, there are actually two instances of the teaching-event: one that has a positive polarity and one that is negative. Further, each instance participates in its own temporal relationship with respect to "Monday" and "Tuesday".

SIGNAL — The SIGNAL tag is used to annotate temporal function words such as "after", "during", and "when". These signals are then used in the representation of a temporal relationship.

The following three tags are link tags. They capture temporal, subordination, and aspectual relationships found in the text. These tags do not consume any actual text, but they do relate the three tag types above to each other.

TLINK — Temporal links are represented with a TLINK tag. A TLINK can temporally relate two temporal expressions, two event instances, or a temporal expression and an event instance. Along with an identification marker for each of these two elements, a relation type is given such as before, includes, or ended_by. When a signal is present that helps to define the relationship, an ID for the SIGNAL is given as well.

SLINK — This tag is used to capture subordination relationships that involve event modality, evidentiality, and factuality. An SLINK includes an event instance ID for the subordinating event and an event instance ID for the subordinated event. Possible relation types for SLINK include modal, evidential, and factive. An SLINK will typically not include a signal ID unless it has the relation type conditional. Three specific EVENT classes interact with SLINK: reporting, i_state, and i_action.

ALINK — An aspectual connection between two event instances is represented with ALINK. As with SLINK, this tag includes two event instance IDs, one that introduces the ALINK and one that is the event argument to that event. The introducing event has the class aspectual. Some possible relation types for ALINK are initiates, terminates, and continues.

Sources and Statistics

The AQUAINT TimeML corpus contains 73 articles from a variety of news reports. These particular sources were chosen because they offered text rich with temporal information both in the form of temporal expressions and events that could be anchored or ordered in time. The documents were taken from four topics from the TREC novelty track (see http://trec.nist.gov/tracks.html):

N16 Kenya Tanzania Embassy bombings
N19 Elian Gonzalez Cuba
N35 NATO, Poland, Czech Republic, Hungary
N45 Slepian abortion murder

The corpus contains about 35,000 tokens and some 16,000 tags were added (12,000 if we adjust for the redundancy introduced by the EVENT and MAKEINSTANCE tags). Some annotation statistics are printed in the table below:

  N16 N19 N35 N45 total

EVENT 765 2117 490 1060 4432
MAKEINSTANCE 765 2117 490 1060 4432
TIMEX3 115 253 83 154 605
SIGNAL 33 77 62 96 268
ALINK 11 39 13 8 71
SLINK 96 203 117 259 675
TLINK 1013 2788 516 1048 5365

Total tags 2798 7594 1771 3685 15848
Total tokens 7027 16242 3631 7254 34154
Documents 23 25 10 15 73

The data in N16, N19 and N45 contain only TimeML tags, the documents in N35 also contain document-level tags like DOCNO, HEADER, TEXT, and others.

Annotation Effort

Each article was annotated by one of three experienced annotators from Brandeis University or Georgetown University. All documents were then validated against version 1.2.1 of the TimeML Document Type Definition. Validity checking against the DTD was performed using the Perl XML::Checker::Parser module, available as part of XML-Checker-0.13 from www.cpan.org, using the validate.pl script.

It should be noted that although we validated all annotations, the AQUAINT TimeML corpus is not as mature as TimeBank 1.2. More specifically, we did not go through several rounds of annotation and annotation reviews. Also, neither TimeBank 1.2 nor the AQUAINT TimeML corpus have used dual annotation.

Inter-annotator agreement scores were not created for the AQUAINT TimeML corpus. Refer to the TimeBank 1.2 documentation for the IAA scores for TimeBank 1.2.

Contributors

The following people have contributed to the AQUAINT TimeML corpus:


Annotation Jenna Fernandes, Jessica Moszkowicz, Stephanie Poisson
Validation Seo-Hyun Im, Emin Mimaroglu, Jessica Moszkowicz, Hongyuan Qiu, Marc Verhagen
Other James Pustejovsky, Inderjeet Mani, Roser Saurí, Amber Stubbs, Marc Verhagen

The AQUAINT TimeML corpus was created as part of the TARSQI project which was funded under the ARDA/DTO AQUAINT program under grant number NBCH040027.

Contents of the Distribution

aquaint_timeml_1.0/data/

Contains the 73 annotated documents, grouped in four directories.

aquaint_timeml_1.0/doc/

This file plus the TimeML specifications and guidelines.

aquaint_timeml_1.0/validation/

Contains two versions of the DTD and the Perl script used for validation.

Contact

Please contact us at timebank@timeml.org if you have any questions or remarks.

Copyright Notice

Creative Commons License

The annotations in this data collection are copyrighted by Brandeis University and are released under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Note: The textual news documents annotated in this corpus have been collected from a wide range of sources and are not copyrighted by Brandeis University. The user acknowledges that the use of these news documents is restricted to research and academic purposes only.