Noun Gender and Number Data for Coreference Resolution

Summary

[GENDER/NUMBER DATA] Split in 6 parts, gzipped, 70 MB uncompressed

If you use this data in your work, please cite as:

Please send an e-mail to sbergsma@ualberta.ca if you use the data. We'd also be happy to help if you need any assistance.

This data was generated from a large amount of online news articles while Shane Bergsma was doing an Engineering Internship at Google Inc. (www.google.com). We gratefully acknowledge Google for allowing us to share our noun gender information.

Gender in NLP

Knowledge of the gender and number of a noun phrase provides one of the most essential constraints for pronoun resolution:

E.g. "John and Alice went to the Olympic games. He takes her to them every time."

Knowing "John" is usually masculine, "Alice" feminine, and "Olympic games" plural allows us to resolve the "he", "her", and "them" pronouns properly.

Building lexicons to capture gender/number information is complicated by the fact different word tokens can have different genders in different contexts. Rather than using it as a hard constraint, gender can be regarded as a probabilistic value. In our "Bootstrapping Path-Based Pronoun Resolution" paper (see publication list), we show how probabilistic gender/number information can be extracted precisely from a large body of raw news text. (See also our "Automatic Acquisition of Gender Information for Anaphora Resolution" paper for alternative gender/number acquisition strategies.)

Essentially, we count the number of times each noun is connected to a masculine, feminine, neutral, or plural pronoun along a path that likely implies coreference. The proportion of times the noun is observed with each pronoun gender can be taken as the gender probability estimate for that noun.

The Data

We provide here these gender counts for use in your own coreference resolution systems, and, perhaps, for sociological interest (look at the counts for doctor, teacher, president, your own name, etc. -- it's fun!):

[GENDER/NUMBER DATA] Split in 6 parts, gzipped, 70 MB uncompressed

If you use this data in your work, please cite as:
Shane Bergsma and Dekang Lin, Bootstrapping Path-Based Pronoun Resolution, In Proceedings of the Conference on Computational Lingustics / Association for Computational Linguistics (COLING/ACL-06), Sydney, Australia, July 17-21, 2006.

File Format

The file contains an alphabetical listing of extracted noun tokens and their gender counts. In each line, the (possibly multi-word) noun phrase is followed by a tab and then four columns holding the counts for the corresponding gender/number:

noun phrase [TAB] Masculine_Count [SPACE] Feminine_Count [SPACE] Neutral_Count [SPACE] Plural_Count

Special Characters

Two special characters are also used:
First of all, all consecutive sequences of numbers are replaced by the "#" character. For example, "$435,003.01" is converted to "$#,#.#".
Secondly, we also store multi-word noun phrases with their corresponding gender in three forms: 1) all the words in the phrase, 2) with the first word in the phrase followed by the "!" character, and 3) with "!" followed by the last word in the phrase. For example, we store "dr. martin luther king jr. boulevard" as "dr. martin luther king jr. boulevard", "dr. !" and "! boulevard". In this way, we collected counts showing us "dr. !" is usually feminine or masculine and "! boulevard" is usually neutral. Given a new phrase not present in the gender data beginning with "dr." or ending with "boulevard," we can often make a reasonable guess as to its value based on the back-off counts. I've also devised better ways to use this back-off information.

Examples

The following excerpts from the data show how the special charactes are used:

! # 18505 3400 100817 22690
...
! bennett 5155 768 72 127
...
! inc. 682 109 190174 3621
...
! thomas 15185 2287 345 461
...
! tribunal 91 8 335 29
...
publican ! 12 1 0 0
...
publican scott thomas 1 0 0 0
publican tony bennett 1 0 0 0
publicans 2 3 8 251
publicard 0 0 1 0
publicard, ! 0 0 3 0
publicard, inc. 0 0 3 0
publication 93 20 3152 110
publication ! 0 0 2 0
publication # 0 0 1 0
publication standards tribunal 0 0 1 0

Note in particular how words ending with ".inc" have much higher neutral counts, "publicans" has a high plural count, and how "publication #" (where a sequence of numbers has been converted to "#") and "publication standards tribunal" together give the total count for "publication !", while each partly contributes to the "! #" and "! tribunal" counts.

Notes

As with everything in statistical NLP, note that noun phrases with few observances (1 or 2 counts) are more likely to have misleading gender counts (and hence incorrectly assigned probabilistic gender), while those with higher counts are generally more accurate. Luckily, the phrases with higher counts are more likely to be seen in your own work. See the literature for further discussion of the gender assignment accuracy of various gender approaches.

FAQ

Question: There are some symbols that I can't figure out - brackets "(" or square brackets "[" or dash "-". Sometimes there are just a set of random letters in the end of noun phrase, for instance : ! ##deutlk= 0 0 1 0. Can you provide me some explanation?

Answer: I assume you know that the "!" symbol subs in for any sequence of tokens in that position of the NP. The other symbols are just noise. This data was collected from 30GB of automatically-processed text. At some point, the sequence of tokens "deutsche telekom ag dtegn.de ##deutlk=" was parsed as a noun phrase, and linked to a neutral gender. This phrase is likely just noise in the original web-data, but could perhaps also be created by the tokenization. Brackets and dashes and things like that would also be due to the automatic parsing of noun phrases. Fortunately, these things aren't a problem in practice -- when you're using the data, you tend to look up the counts for complete phrases and hence just ignore the counts for garbage sequences. However, if you know that the phrases you look up will not contain certain punctuation, you could filter the gender counts in advance to save you computation and storage costs.



Shane Bergsma
May 11, 2006
Updated: August 23, 2006, webpage text
Updated: October 26, 2007, files now gzipped
Updated: January 7, 2010, created a FAQ