Resources

Additional software and resources may be found on the individual websites of the people currently in our research group.

Software

The offical repository of the BLLIP reranking parser is now located at http://github.com/BLLIP/bllip-parser. The respective references are:

Eugene Charniak. A Maximum-Entropy-Inspired Parser. NAACL'00, pp. 132-139.( PDF )

Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. ACL'05. ( PDF )

David McClosky's self-trained parsing model can be found here. The references are:

David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. HLT-NAACL'06 ( PDF)
David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. COLING-ACL'06 ( PDF )

An older version of the Charniak PCFG parser and the Charniak-Johnson Max-Ent reranking parser are distributed together (as a single gzipped tar) here.

An (even )older version of the parser is available which supports a distinct POS-training stage (for domains in which POS tags are available but no treebank). Its use is discussed in the paper:

Matthew Lease and Eugene Charniak. Parsing Biomedical Literature. Second International Joint Conference on Natural Language Processing (IJCNLP'05). ( PDF ) ©Springer-Verlag

Software for unsupervised pronoun resolution can be found here. The reference is:

Eugene Charniak and Micha Elsner. EM Works for Pronoun Anaphora Resolution. EACL '09 (PDF)

Mark Johnson has a variety of software available

David Ellis has released a new version of the EVALB bracket scoring program (created by Satoshi Sekine and Michael Collins) which fixes a bug in the original distribution. Both the new and original versions can be obtained from the EVALB website.

To obtain Don Blaheta's function-tagger or tsed treebank manipulation tool, contact him. The respective references are:

Don Blaheta. Function tagging. PhD thesis, 28 Aug 2003. ( PDF )

Don Blaheta. Handling noisy training and testing data. EMNLP'02, 111-116. ( PDF)

Corpora

Movie Review Corpus and Elizabethan Drama Corpus (description pending)

The BLLIP'99 Corpus (1987-89 WSJ Corpus Release 1, LDC2000T43) can be obtained from LDC. This automatically-annoated corpus contains the three year Wall Street Journal (WSJ) collection from the ACL and is approximately 30 million words. Annotations were generated using the parser (and POS-tagger), Mark Johnson's tool for empty node restoration, and Don Blaheta's function tagger. Don Blaheta, Sharon Caraballo, Sharon Goldwater, and Mark Johnson also all contributed to the parser.

The Brown-GENIA treebank contains hand-parses for 21 abstracts (215 sentences) from the GENIA corpus of MEDLINE abstracts related to transcription factors in human blood cells. There is no overlap with the GENIA treebank (beta version, 500 abstracts), so both may be used in combination. The reference for this treebank is:

Matthew Lease and Eugene Charniak. Parsing Biomedical Literature. Second International Joint Conference on Natural Language Processing (IJCNLP'05). ( PDF ) ©Springer-Verlag

The original, unannotated Brown corpus, a balanced sampling of English language usage, was collected in the 60s (long before BLLIP) by Brown linguists Francis and Kucera. See their accompanying manual. The annotated Penn Treebank version (Treebank-3) of the Brown corpus we know and love and can be obtained from LDC.

Last update: Thursday, January 23 2014, 06:29 PM

Brown Laboratory for Linguistic Information Processing (BLLIP)

Resources