SOME QUERIES

Queries consist of a number of variable bindings followed by
a colon and then the match conditions.  A variable binding has
a variable (with a dollar sign) and an (optional) data type.
The data model here is of data elements with a type and a set
of string or numeric attributes that can have children (in this
corpus, always ordered) and/or pointers with named roles to other 
elements (always unordered).  Think of this as a tree structure
for the basic syntax with some arbitrary graph structure over
the top for things that don't fit into the tree.  (NITE allows
for a multi-rooted tree where each node can have more than one
parent of different types, unordered with respect to each other, 
but only one ordered set of children --- but we don't make use
of this facility at the moment.)

($m markable):($m@animacy == "human-group")

This means "Markables where the animacy code is human-group".
@ means "attribute".  

($m markable):($m@animacy ~ /.*human.*/)

This means "Markables where the animacy code has human in it
somewhere."  The dot (.) means any character, and the star (*) means
zero or more times.  This is what's called a "regular expression".

($m markable):($m@animacy ~ /human.*/)

This means "Markables where the animacy code is human followed by any
number of other characters".  Note that this doesn't pick up what you
coded as org-human.  That's because the code has to *start* with human.

($m markable):($m@animacy ~ /human/)

This doesn't pick up any markables, because all of the codes are human
followed by something or preceded by something, and in this query
language, regular expressions specify complete matches.

($m markable):($m@animacy == "human") && ($m@status=="old")

&& means "and"; similarly, || for "or".  

($n nt)($w word):($n ^ $w) && ($w@pos=="VBZ") && ($n@cat="NP")

^ means "is an ancestor of".  In this corpus, an nt is a (nonterminal)
syntactic constituent.  So this finds pairs of nts and words where the
word is in the nt, the nt is a noun phrase, and the word has part of speech
VBZ.  Verify this by seeing that each result has bindings for two things,
one of which is an nt and the other of which is a word.  Note that the
same word can show up in two returns, if it is in two NPs (one embedded
in the other).

Note that

($n nt)($w word):

simply gives all the nt/word pairs --- a vast number.  A common mistake
in queries is to forget the relational conditions (in this case, the
one with ^).  Also, perhaps counter-intuitively, an element is ^ itself.
For this reason, the idiom

($a)($b): ($a ^ $b) && ($a != $b)

is common in queries where the variables are bound to the same data type.
By the way,

($a):

matches every element in the corpus regardless of data type, and it is
possible to match on a type disjunction, i.e.,

($n nt | word):

will match on all nts and all words.

($m markable)($n nt):($m@animacy == "human") && ($m >"at" $n)  && ($n@subcat="SBJ")

In the corpus, markables point at nts (nonterminals); this query finds
markables with human animacy in subject position.  Pointers always have
some role name given by whoever designed the corpus; in this case, it is
"at".  

($n nt)($w1 word)($w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) &&
   ($w1@pos = "DT") && ($w2@pos="NN")

nts that contain both a DT and an NN.  Of course, this can match on
different (embedded) nts for the same DT/NN pair.

($n nt)($w1 word)($w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) &&
   ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2)

The same, but the DT has to be before the NN.

($n nt)(exists $w1 word)(exists $w2 word): 
   ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) &&
   ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2)

For when you get tired of seeing the words in the match list.  Exists
does the same match but doesn't return the variable in the result
set.

($m1 markable)($m2 markable)(exists $l link):($l >"antecedent" $m1) && ($l >"anaphor" $m2)

Pairs of markables in the same coreferential link.

($m1 markable)($m2 markable)(exists $l link):($l >"antecedent" $m1) && ($l >"anaphor" $m2) && ($m2@animacy != $m2@animacy) 

Same, but where the two markables don't have the same animacy code. 
There are two points to this query:  (1) you don't have to specify
a textual string in the inequality condition as long as you can get
one from somewhere, and (2) one might wish to consider queries where
one expects no matches because they can diagnose problems with the 
annotation (in this case, of course, more match conditions are needed).

($w1 word)($w2 word):($w1 <> $w2)

Pairs of words where the first precedes the second.  Note that
this says nothing about being in the same sentence; that would be

($w1 word)($w2 word)($n nt):($w1 <> $w2) && ($n@cat=="S")&&
    ($n^$w1)&&($n^$w2)

($w1 word)($n nt)(exists $w2 word):($n@cat=="S")&& ($n^$w1)&&
    (($n^$w2) ->($w1 <> $w2))

All words excluding last words of sentences.

($w1 word)($n nt)(forall $w2 word):($n@cat=="S")&& ($n^$w1)&&
    ((($w1 !=$w2) && ($n^$w2)) ->($w1 <> $w2))

Only the first words of sentences.  Note the inequality condition;
forall really means for *all*.

($w1 word)($w2 word)(forall $w3 word): 
   ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2) && 
((($w1 != $w3) && ($w2 != $w3)&& ($w1 <> $w3)) -> ($w2 <> $w3))

DTs and NNs adjacent to each other with the DT first (i.e., forall
other words, if they're after the DT they're also after the NN).
If this is too slow on your machine, try

($t1 turn)($t2 turn)(forall $t3 turn): 
  ($t1 <> $t2) && 
((($t1 != $t3) && ($t2 != $t3)&& ($t1 <> $t3)) -> ($t2 <> $t3))

for adjacent turns (which is faster because there are fewer of them).

Note that on forall queries, the interim reports about numbers of
matches found are a bit wonky!  I think they are multiples of the real
number of matches by the number of bindings tested for the forall
variable.

($w word): ($w@pos="PRP$")::($m markable):($m >"at" $w)

A complex query; the first query (before the ::) matches, and
then any the results are passed to the second query, which can
bind new variables as well as referring to the old ones.  The
return list is hierarchically structure; for each match n-tuple
to the first query, one gets a list of match n-tuples to the
second.  Beware:  if there are no matches to the second query
for some match to the first query, then that match to the first
query is removed from the result list.  (This makes sense in
database terms but some people find this strongly counter-intuitive.)

($n nt)($w word): ($n ^ $w) && (ID($w) == "s1_1")

Every data element has a unique id, which can be used in queries
in this way.  There's no reason you would want to do this except
when you can't figure out why a query is going wrong and want to
quickly find out whether a specific example is on the return list.

($w word): (TEXT($w) == "the")

This is how to query the orthography.  Posix regular expressions work
here, too.  

($w word): (TEXT($w) ~/the.*/)

keeping in mind that the regexp much match the entire string.


ADVANCED EXAMPLES:  THE QUERIES USED TO ADD MARKABLES

There are two, run in order; for each result, 
we create a new markable that points (with role "at")
to the nt or word.

($n nt)(forall $up nt): 
   (($n@cat == 'NP') or ($n@cat == 'WHNP')) and 
   (not (($n@subcat ~ /.*ADV.*/) or ($n@subcat ~ /.*LOC.*/) or 
         ($n@subcat ~ /.*DIR.*/) or ($n@subcat ~ /.*UNF.*/))) 
    and ((($n != $up) and($up ^ $n)) ->
           ((not ($up@cat == 'EDITED')) and 
            (not (($up@cat == 'ADVP') and 
            (($up@subcat ~ /.*LOC.*/) or ($up@subcat ~ /.*DIR.*/))))))
    
NPs and WHNPs that aren't adverbials, locatives, directives, or
unfinished, and where there aren't any dominating nts marked
as EDITED (that is, disfluent) or as locative or directive 
adverbials.

($w word)(exists $n nt)(exists $m markable)(forall $up nt): 
    ($w@pos = 'PRP$') and 
    ($n ^ $w) and 
    ($m >'at' $n) and 
    (($up != $n) -> (not (($n ^ $up) and ($up ^ $w))))

Possessive pronouns where the first nt you get to by climbing up
counts as a markable. 
  


THE CORPUS STRUCTURE

You can't write queries without understanding the structure of
the corpus.  First, we gloss the most important relationships
for an easy start, but the only way to get at everything is to
read the metadata file (and perhaps look at some of the data for
reassurance), so we also explain how to do that.

The corpus uses parenthood for the following relationships:

turn ^ parse ^ nt ^ (word | sil | trace | punc)
     with any number of levels of nt, usually starting at the top 
     with at nt with cat S.

It uses pointers for the following relationships:

markable >"at" (nt | word)
   (the word cases are just possessive pronouns)

disfluency >"reparandum" (word|sil|trace|punc)*
       (i.e., zero or more terminals)
disfluency >"repair" (word|sil|trace|punc)* 

movement >"source" nt
movement >"target" trace

link >"antecedent" markable
link >"anaphor" markable

It uses the following attributes:

turn has speaker (A, B)

nt has cat (S, NP, VP, SBARQ, ...)
       subcat (SBJ, ...)

word has pos (VBZ, ...)

This list is *not* complete.  We think that everything on the original
Switchboard data has been preserved in some way.

To find out exactly what the corpus structure is, open 
Data/meta/swbd-metadata.xml.  (Many web browsers will make a display
for XML files that is easier to read than in, say, emacs, so do try
that first.)  

If you see

<code name="FOO">

Then foo is a valid data type.  The definition of FOO runs until
you see </code>, but

<code name="FOO"/>

is shorthand for

<code name="FOO">
</code>

If you see

<attribute name="BAR">

then the containing code has that attribute, so you can say

($n FOO):($n@BAR).

Enumerated attributes must choose a value from the given list;
otherwise they can be free-value strings or numbers.

The layer structure defines the permissible relations among data types
(or codes).  Each layer can be uniquely identified by name and defines
a set of data types that are interchangeable in the structure because
they can occur in the same positions.  A structural layer can point to
another layer, which means that elements with data types in the former
layer have children with data types drawn from the data types in the
latter layer.  If it points recursively, then there are any number
of layers of the former type ending in one of the latter type (this
is handy, say, for syntax).

If you see

<pointer number="BAR" role="BAZ" target="BAM"/>

then the containing code can have pointers with role BAZ where the
element pointed to has a data type drawn from the layer BAM.  BAR can
be an integer (pointer points to exactly that many elements), *
(points to zero or more; i.e. Kleene *) and + (points to one or more).
But I'm not sure how well the implementation enforces these number 
definitions.  The design is meant to restrict pointers to featural
layers, but the implementation is actually more flexible, with pointers
allowed anywhere.

Layers are themselves separated into codings.  This isn't very
important for this corpus (the codings are what allows for multi-rooted
trees) but it does tell you what file to look in for the elements
of a particular type; each coding is stored in a different XML file.
Where files need to refer to each other, they use stand-off annotation.

For instance, 
<nt cat="INTJ" nite:id="s1_500">
  <nite:child href="sw2065.terminals.xml#id(s1_1)" /> 
  <nite:child href="sw2065.terminals.xml#id(s1_2)" /> 
</nt>

means the nt dominates/has as children two elements, the ones
in the file sw2065.terminals.xml with the ids s1_1 and s1_2.

Domination can also be represented in a single file by containment:

<foo>
   <baz/>
</foo>

means foo dominates baz.

 <nite:child href="sw2065.terminals.xml#id(s1_1)..id(s1_5)" /> 

means *all* elements between s1_1 and s1_5 in the named file
regardless of type or id and is only defined if s1_1 and s1_5
are sisters under the same element.

The file syntax for pointers is very similar; e.g.

 <markable nite:id="sw2062.markable.1" animacy="nonconc">
    <nite:pointer role="at" href="sw2062.syntax.xml#id(s1_502)" /> 
 </markable>

The metadata contains everything one needs to know about corpus 
structure, but some people find it easier to look at sample data
itself.