[an error occurred while processing this directive]
Resources
Software
The latest version of the Charniak PCFG parser and the
Charniak-Johnson Max-Ent reranking parser are distributed together (as
a single gzipped
tar) here. The
respective references are:
- Eugene Charniak. A Maximum-Entropy-Inspired
Parser. NAACL'00, pp. 132-139.( PDF )
-
Eugene Charniak and Mark Johnson. Coarse-to-fine
n-best parsing and MaxEnt discriminative reranking. ACL'05. ( PDF )
David McClosky's self-trained
parsing model can be found here. The references
are:
- David McClosky, Eugene Charniak, and Mark Johnson. Effective
Self-Training for Parsing. HLT-NAACL'06 ( PDF)
- David McClosky, Eugene Charniak, and Mark Johnson. Reranking
and Self-Training for Parser Adaptation. COLING-ACL'06 ( PDF )
An older version of the parser is available
which supports a distinct POS-training stage (for domains in which POS tags
are available but no treebank). Its use is discussed in the paper:
The parser descvribed in Eugene Charniak's EMNLP-10 paer
can be found here
- Matthew Lease and Eugene Charniak. Parsing Biomedical
Literature. Second International Joint Conference on Natural Language Processing
(IJCNLP'05). ( PDF
) ©Springer-Verlag
Software for unsupervised pronoun resolution can be found
here. The reference is:
- Eugene Charniak and Micha Elsner. EM Works for Pronoun Anaphora Resolution. EACL '09 (PDF)
Mark Johnson has
a variety of software
available
David Ellis has released a new
version of the EVALB bracket scoring program (created by Satoshi
Sekine and Michael Collins) which fixes a bug in
the original distribution. Both the new and original versions can be obtained
from the EVALB website.
To obtain Don Blaheta's
function-tagger or tsed treebank manipulation tool, contact him. The
respective references are:
- Don Blaheta. Function tagging. PhD thesis,
28 Aug 2003. ( PDF )
- Don Blaheta. Handling noisy training and
testing data. EMNLP'02, 111-116. ( PDF)
Corpora
The BLLIP'99
Corpus (1987-89 WSJ Corpus Release 1, LDC2000T43) can be obtained from
LDC. This automatically-annoated
corpus contains the three year Wall Street Journal (WSJ) collection from
the ACL and is approximately 30 million words. Annotations were generated
using the parser (and POS-tagger), Mark Johnson's tool for empty node restoration,
and Don Blaheta's function tagger. Don Blaheta, Sharon Caraballo, Sharon
Goldwater, and Mark Johnson also all contributed to the parser.
The Brown-GENIA
treebank contains hand-parses for 21 abstracts (215 sentences) from the
GENIA corpus of
MEDLINE abstracts related to transcription factors in human blood cells. There
is no overlap with the GENIA
treebank (beta version, 500 abstracts), so both may be used in combination.
The reference for this treebank is:
- Matthew Lease and Eugene Charniak. Parsing Biomedical
Literature. Second International Joint Conference on Natural Language Processing
(IJCNLP'05). ( PDF
) ©Springer-Verlag
The original, unannotated Brown corpus, a balanced sampling of English language
usage, was collected in the 60s (long before BLLIP) by Brown linguists Francis
and Kucera. See their accompanying manual. The annotated
Penn Treebank version (Treebank-3)
of the Brown corpus we know and love and can be obtained from LDC.
[an error occurred while processing this directive]