Brown Laboratory for Linguistic Information Processing (BLLIP)

BLLIP Home

People

Publications

Resources

Lab Publications

By Year: 2008 · 2007 · 2006 · 2005 · 2004 · 2003 · 2002 · 2001 · 2000 · 1999 · 1998 · 1997 · 1996 · 1995 · 1994 · 1993 · 1992 ·

By Author: Eugene Charniak · David Ellis · Micha Elsner · Heidi Fox · Stuart Geman · Will Headden · Mark Johnson · Matt Lease · David McClosky · Brendan Shean · Jenine Turner ·

2008

2007

Micha Elsner, Joseph Austerweil, and Eugene Charniak. A Unified Local and Global Model for Discourse Coherence. In Proceedings of HLT-NAACL '07, Rochester, New York, April 2007. Association for Computational Linguistics. [ bib | .pdf | slides ]

Micha Elsner and Eugene Charniak. A Generative Discourse-New Model for Text Coherence. Technical Report CS-07-04, Brown University, Providence, RI, USA, 2007. [ bib | .pdf ]

Recent models of document coherence have focused on the referents of noun phrases, ignoring their syntax. However, syntax depends on discourse function; NPs which introduce new entities are often more complex. We develop a generative model for NP syntax which describes this difference. It can be used to model discourse coherence in the Wall Street Journal; combining it with the local coherence model of Elsner ('07) yields substantial improvements. Our model is competitive with previous systems on the discourse-new detection task; its performance is comparable to Uryupina ('03).

Jianfeng Gao, Galen Andrew, Mark Johnson, and Kristina Toutanova. A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing. In Proceedings of the Association for Computational Linguistics (ACL'07), 2007. [ bib ]

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Distributional Cues to Word Segmentation: Context is Important. In Proceedings of the 31st Boston University Conference on Language Development, 2007. [ bib | .pdf ]

Mark Johnson. Why Doesn't EM Find Good HMM POS-Taggers? In Proceedings of Empirical Methods in Natural Language Processing (EMNLP'07), 2007. [ bib ]

Mark Johnson. Transforming Projective Bilexical Dependency Grammars into Efficiently-Parsable CFGs with Unfold-Fold. In Proceedings of the Association for Computational Linguistics (ACL'07), 2007. [ bib ]

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of the North American Conference on Computational Linguistics (NAACL'07), 2007. [ bib | .pdf ]

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. Adaptor Grammars: a Framework for Specifying Compositional Nonparametric Bayesian Models. In Advances in Neural Information Processing Systems 19, 2007. [ bib | .pdf ]

Matthew Lease and Eugene Charniak. Brown at CL-SR'07: Retrieving Conversational Speech in English and Czech. In Working Notes of the Cross-Language Evaluation Forum (CLEF): Cross-Language Speech Retrieval (CL-SR) track, 2007. Corrected version. [ bib | .pdf ]

Matthew Lease. Natural Language Processing for Information Retrieval: the time is ripe (again). In Proceedings of the 1st Ph.D. Workshop at the ACM Conference on Information and Knowledge Management (PIKM), 2007. Best Paper award. [ bib | .pdf ]

Jenine Turner and Eugene Charniak. Language Modeling for Determiner Selection. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 177-180, Rochester, New York, April 2007. Association for Computational Linguistics. [ bib | .pdf ]

2006

Ann Bies, Stephanie Strassel, Haejoong Lee, Kazuaki Maeda, Seth Kulick, Yang Liu, Mary Harper, and Matthew Lease. Linguistic Resources for Speech Parsing. In Fifth International Conference on Language Resources and Evaluation (LREC'06), Genoa, Italy, 2006. [ bib | .pdf ]

Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. Multilevel Coarse-to-Fine PCFG Parsing. In Proceedings of the Human Language Technology Conference of the NAACL (HLT-NAACL'06), pages 168-175, New York City, USA, June 2006. Association for Computational Linguistics. [ bib | .pdf | slides ]

Sharon Goldwater, Tom Griffiths, and Mark Johnson. Interpolating between types and tokens by estimating power-law generators. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 459-466, Cambridge, MA, 2006. MIT Press. [ bib | .pdf ]

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Contextual Dependencies in Unsupervised Word Segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association or Computational Linguistics (COLING_ACL'06), pages 673-680, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf ]

John Hale, Izhak Shafran, Lisa Yung, Bonnie J. Dorr, Mary Harper, Anna Krasnyanskaya, Matthew Lease, Yang Liu, Brian Roark, Matthew Snover, and Robin Stewart. PCFGs with Syntactic and Prosodic Indicators of Speech Repairs. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL'06), pages 161-168, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf ]

William P. Headden III, Eugene Charniak, and Mark Johnson. Learning Phrasal Categories. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 301-307, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf ]

Matthew Lease, Mark Johnson, and Eugene Charniak. Recognizing disfluencies in conversational speech. IEEE Transactions on Audio, Speech and Language Processing, 14(5):1566-1573, September 2006. [ bib | .pdf ]

We present a system for modeling disfluency in conversational speech: repairs, fillers, and self-interruption points (IPs). For each sentence, candidate repair analyses are generated by a stochastic tree adjoining grammar (TAG) noisy-channel model. A probabilistic syntactic language model scores the fluency of each analysis, and a maximum-entropy model selects the most likely analysis given the language model score and other features. Fillers are detected independently via a small set of deterministic rules, and IPs are detected by combining the output of repair and filler detection modules. In the recent Rich Transcription Fall 2004 (RT-04F) blind evaluation, systems competed to detect these three forms of disfluency under two input conditions: a best-case scenario of manually transcribed words and a fully automatic case of automatic speech recognition (ASR) output. For all three tasks and on both types of input, our system was the top performer in the evaluation.

Keywords: Disfluency modeling, natural language processing, rich transcription, speech processing

Matthew Lease, Eugene Charniak, Mark Johnson, and David McClosky. A Look At Parsing and Its Applications. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), 16-20 July 2006. [ bib | .pdf ]

Matthew Lease and Mark Johnson. Early Deletion of Fillers In Processing Conversational Speech. In Proceedings of the Human Language Technology Conference of the NAACL (HLT-NAACL'06), Companion Volume: Short Papers, pages 73-76, New York City, USA, June 2006. Association for Computational Linguistics. Version here corrects Table 2 in published version. [ bib | .pdf ]

David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL'06), pages 337-344, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf | .ps ]

David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152-159, New York City, USA, June 2006. Association for Computational Linguistics. [ bib | .pdf | slides | .ps ]

B. Roark, Yang Liu, M. Harper, R. Stewart, M. Lease, M. Snover, I. Shafran, B. Dorr, J. Hale, A. Krasnyanskaya, and L. Yung. Reranking for Sentence Boundary Detection in Conversational Speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'06), pages 545-548, May 14-19 2006. [ bib | .pdf ]

We present a reranking approach to sentence-like unit (SU) boundary detection, one of the EARS metadata extraction tasks. Techniques for generating relatively small n-best lists with high oracle accuracy are presented. For each candidate, features are derived from a range of information sources, including the output of a number of parsers. Our approach yields significant improvements over the best performing system from the NIST RT-04F community evaluation.

Brian Roark, Mary Harper, Eugene Charniak, Bonnie Dorr, Mark Johnson, Jeremy G. Kahn, Yang Liu, Mari Ostendorf, John Hale, Anna Krasnyanskaya, Matthew Lease, Izhak Shafran, Matthew Snover, Robin Stewart, and Lisa Yung. SParseval: Evaluation Metrics for Parsing Speech. In Fifth International Conference on Language Resources and Evaluation (LREC'06), Genoa, Italy, 2006. [ bib | .pdf ]

2005

Eugene Charniak and Mark Johnson. Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 173-180, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]

Micha Elsner, Mary Swift, James Allen, and Daniel Gildea. Online Statistics for a Unification-Based Dialogue Parser. In Proceedings of the Ninth International Workshop on Parsing Technology (IWPT'05), pages 198-199, Vancouver, British Columbia, October 2005. Association for Computational Linguistics. [ bib | .pdf | poster ]

Heidi Fox. Dependency-Based Statistical Machine Translation. In Proceedings of the ACL Student Research Workshop, pages 91-96, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]

Dmitriy Genzel. Inducing a Multilingual Dictionary from a Parallel Multitext in Related Languages. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 875-882, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [ bib | .pdf ]

Sharon Goldwater and David McClosky. Improving Statistical MT through Morphological Analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP'05), pages 676-683, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [ bib | .pdf | .ps ]

Sharon Goldwater and Mark Johnson. Representational Bias in Unsupervised Learning of Syllable Structure. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 112-119, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]

Jeremy G. Kahn, Matthew Lease, Eugene Charniak, Mark Johnson, and Mari Ostendorf. Effective Use of Prosody in Parsing Conversational Speech. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (EMNLP'05), pages 233-240, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [ bib | .pdf ]

Matthew Lease, Eugene Charniak, and Mark Johnson. Parsing and its applications for conversational speech. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), volume 5, pages 961-964, March 18 - March 23 2005. [ bib | .pdf ]

This paper provides an introduction to recent work in statistical parsing and its applications for conversational speech, with particular emphasis on the relationship between parsing and detecting speech repairs. While historically parsing and repair detection have been studied independently, we present a line of research which has spanned the boundary between the two and demonstrated the efficacy of this synergistic approach. Our presentation highlights successes to date, remaining challenges, and promising future work.

Matthew Lease and Eugene Charniak. Parsing Biomedical Literature. In R. Dale, K.-F. Wong, J. Su, and O. Kwong, editors, Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP'05), volume 3651 of Lecture Notes in Computer Science, pages 58 - 69, Jeju Island, Korea, October 11 - October 13 2005. Springer-Verlag. [ bib | .pdf ]

We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.

Heng Lian. Chinese Language Parsing with Maximum-Entropy-Inspired Parser. Master's thesis, Brown University, Providence, RI, 2005. [ bib | .pdf ]

The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art parser is much worse than that for the English language, with an f-score about 10% below that of English. We present the result of a maximum-entropy-inspired parser [3] on Penn Chinese TreeBank 1.0 and 4.0, achieving precision/recall of 78.6/75.6 on CTB1.0 and 79.1/75.0 on CTB 4.0. We also apply the MaxEnt reranker [4] on the 50 best parses and get about 6% error reduction. The parser is also applied directly to unsegmented sentences and also achieves state-of-the-art performance.

Jenine Turner and Eugene Charniak. Supervised and Unsupervised Learning for Sentence Compression. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 290-297, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]

2004

Massimiliano Ciaramita and Mark Johnson. Multi-component Word Sense Disambiguation. In Rada Mihalcea and Phil Edmonds, editors, Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 97-100, Barcelona, Spain, July 2004. Association for Computational Linguistics. [ bib | .pdf ]

Sharon Goldwater and Mark Johnson. Priors in Bayesian Learning of Phonological Rules. In Proceedings of the Seventh Meeting of the ACL Special Interest Group in Computational Phonology, pages 35-42, Barcelona, Spain, July 2004. Association for Computational Linguistics. [ bib | .pdf ]

Michelle Gregory and Yasemin Altun. Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume, pages 677-683, Barcelona, Spain, July 2004. [ bib | .pdf ]

Michelle Gregory, Mark Johnson, and Eugene Charniak. Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 81-88, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. [ bib | .pdf ]

Keith B. Hall and Mark Johnson. Attention Shifting for Parsing Speech. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume, pages 40-46, Barcelona, Spain, July 2004. [ bib | .pdf ]

Mark Johnson and Eugene Charniak. A TAG-based noisy-channel model of speech repairs. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 33-39, Barcelona, Spain, July 2004. [ bib | .pdf ]

Mark Johnson, Eugene Charniak, and Matthew Lease. An Improved Model For Recognizing Disfluencies in Conversational Speech. In Rich Transcription 2004 Fall Workshop (RT-04F), 2004. [ bib | .pdf ]

Ron Kaplan, Stefan Riezler, Tracy H King, John T Maxwell III, Alex Vasserman, and Richard Crouch. Speed and Accuracy in Shallow and Deep Stochastic Parsing. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 97-104, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. [ bib | .ps | .pdf ]

Brian Roark, Murat Saraclar, Michael Collins, and Mark Johnson. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm. In ACL, pages 47-54, 2004. [ bib ]

2003

Yasemin Altun, Mark Johnson, and Thomas Hofmann. Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences. In Michael Collins and Mark Steedman, editors, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages 145-152, 2003. [ bib | .pdf ]

Yasemin Altun and Thomas Hofmann. Large Margin Methods for Label Sequence Learning. In Proceedings of the Eighth European Conference on Speech Communication and Technology (EuroSpeech'03), 2003. [ bib | .pdf ]

Label sequence learning is the problem of inferring a state sequence from an observation sequence, where the state sequence may encode a labeling, annotation or segmentation of the sequence. In this paper we give an overview of discriminative methods developed for this problem. Special emphasis is put on large margin methods by generalizing multiclass Support Vector Machines and AdaBoost to the case of label sequences. An experimental evaluation demonstrates the advantages over classical approaches like Hidden Markov Models and the competitiveness with methods like Conditional Random Fields.

Eugene Charniak, Kevin Knight, and Kenji Yamada. Syntax-based Language Models for Statistical Machine Translation. In Proceedings of the Ninth Machine Translation Summit of the International Association for Machine Translation, New Orleans, Louisiana, September 2003. [ bib | .pdf ]

Massimiliano Ciaramita, Thomas Hofmann, and Mark Johnson. Hierarchical Semantic Classification: Word Sense Disambiguation with World Knowledge. In Georg Gottlob and Toby Walsh, editors, IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003, pages 817-822. Morgan Kaufmann, 2003. [ bib | .pdf | .ps ]

Massimiliano Ciaramita and Mark Johnson. Supersense Tagging of Unknown Nouns in WordNet. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03), pages 168-175, 2003. [ bib | .pdf ]

Stuart Geman and Mark Johnson. Probability and statistics in computational linguistics, a brief review. Mathematical foundations of speech and language processing, 138:1-26, 2003. [ bib | .pdf ]

Dmitriy Genzel and Eugene Charniak. Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number. In Michael Collins and Mark Steedman, editors, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP'03), pages 65-72, 2003. [ bib | .pdf ]

Sharon Goldwater and Mark Johnson. Learning OT Constraint Rankings Using a Maximum Entropy Model. In Proceedings of the Workshop on Variation within Optimality Theory, Stockholm University, 2003. [ bib | .pdf | .ps ]

Keith Hall and Mark Johnson. Language modelling using efficient best-first bottom-up parsing. In Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE ASRU 2003, 2003. [ bib | .pdf ]

Thomas Hofmann, Lijuan Cai, and Massimiliano Ciaramita. Learning with taxonomies: Classifying documents and words. In Workshop on Syntax, Semantics and Statistics (NIPS-03)., 2003. [ bib | .pdf ]

Mark Johnson. Learning and Parsing Stochastic Unification-Based Grammars. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, volume 2777 of Lecture Notes in Computer Science, pages 671-683. Springer, 2003. [ bib | .pdf ]

2002

Yasemin Altun, Thomas Hofmann, and Mark Johnson. Discriminative Learning for Label Sequences via Boosting. In Proceedings of Neural Information Processing Systems (NIPS02), 2002. [ bib | .pdf ]

Don Blaheta. Handling noisy training and testing data. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, Philadelpha, Pennsylvania, July 2002. [ bib | .pdf ]

Massimiliano Ciaramita. Boosting automatic lexical acquisition with morphological information. In Unsupervised Lexical Acquisition: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 17-25, Philadelphia, July 2002. Association for Computational Linguistics. [ bib | .ps | .pdf ]

Donald Engel, Eugene Charniak, and Mark Johnson. Parsing and Disfluency Placement. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 49-54, 2002. [ bib | .pdf ]

Heidi Fox. Phrasal Cohesion and Statistical Machine Translation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 304-311, Philadelphia, Pennsylvania, July 2002. Association for Computational Linguistics. [ bib | .pdf ]

Stuart Geman and Mark Johnson. Dynamic programming for parsing and estimation of stochastic unification-based grammars. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL'02), pages 279-286, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [ bib | .pdf ]

Stuart Geman and Mark Johnson. Probabilistic Grammars and their Applications. In N.J. Smelser and P.B. Baltes, editors, International Encyclopedia of the Social & Behavioral Sciences, pages 12075-12082, Pergamon, Oxford, 2002. [ bib | .pdf ]

Dmitriy Genzel and Eugene Charniak. Entropy Rate Constancy in Text. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 00-00, 2002. [ bib | .pdf ]

Mark Johnson. The DOP Estimation Method is Biased and Inconsistent. Computational Linguistics, 28(1):71-76, 2002. [ bib | .pdf ]

Mark Johnson. A Simple Pattern-matching Algorithm for Recovering Empty Nodes and their Antecedents. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 136-143, 2002. [ bib | .pdf | .ps ]

Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. III Maxwell, and Mark Johnson. Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 271-278, 2002. [ bib | .pdf ]

2001

Yasemin Altun and Mark Johnson. Inducing SFA with Epsilon-Translations Using Minimum Description Length. In Finite State Methods in Natural Language Processing Workshop, ESSLLI 2001, 2001. [ bib | .pdf ]

Don Blaheta and Mark Johnson. Unsupervised learning of multi-word verbs. In Proceedings of the 2001 ACL Workshop on Collocation, 2001. [ bib | .pdf ]

Eugene Charniak and Mark Johnson. Edit Detection and Parsing for Transcribed Speech. In Proceedings of the Second Conference of the North American chapter of the Association for Computational Linguistics (NAACL '01), 2001. [ bib | .pdf ]

Eugene Charniak. Immediate-Head Parsing for Language Models. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 124-131, 2001. [ bib | .pdf | .ps ]

Eugene Charniak. Unsupervised Learning of Name Structure From Coreference Data. In Second Meeting of the North American Chapter of the Association for Computational Linguistics (NACL-01), 2001. [ bib | .pdf ]

Keith Hall. A Statistical Model of Nominal Anaphora. Master's thesis, Brown University, Providence, RI, 2001. [ bib | .pdf ]

Mark Johnson. Joint and Conditional Estimation of Tagging and Parsing Models. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL-01), 2001. [ bib | .pdf ]

Brian Roark. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249-276, 2001. [ bib | .pdf ]

2000

Don Blaheta and Eugene Charniak. Assigning function tags to parsed text. In Proceedings of the First Conference of the North American chapter of the Association for Computational Linguistics (NAACL '00), pages 234-240, 2000. [ bib | .pdf ]

Eugene Charniak. Parsing to Meaning, Statistically. In Canadian Conference on AI, page 442, 2000. [ bib | .pdf ]

Eugene Charniak. A maximum-entropy-inspired parser. In Proceedings of the first conference on North American chapter of the Association for Computational Linguistics, pages 132-139, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [ bib | .pdf | tech-report ]

Eugene Charniak, Yasemin Altun, Rodrigo de Salvo Braz, Benjamin Garrett, Margaret Kosmala, Tomer Moscovich, Lixin Pang, Changbee Pyo, Ye Sun, Wei Wy, Z. Yang, S. Zeller, and L. Zorn. Reading Comprehension Programs in a Statistical-Language-Processing Class. In In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems (ANLP/NAACL-00), 2000. [ bib | .pdf ]

Massimiliano Ciaramita and Mark Johnson. Explaining away ambiguity: Learning verb selectional preference with Bayesian networks. In Proceedings of the 18th International Conference on Computational Linguistics, 2000. [ bib | .pdf ]

Keith Hall and Thomas Hofmann. Learning Curved Multinomial Subfamilies for Natural Language Processing and Information Retrieval. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Standord, CA, USA, June 29 - July 2, 2000, pages 351-358. Morgan Kaufmann, 2000. [ bib | .pdf ]

Mark Johnson and Brian Roark. Compact non-left-recursive grammars using the selective left-corner transform and factoring. In Proceedings of the 18th conference on Computational linguistics (COLING '00), pages 355-361, 2000. [ bib | .pdf ]

Mark Johnson and Stefan Riezler. Exploiting auxiliary distributions in stochastic unification-based grammars. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NACL-00), pages 154-161, 2000. [ bib | .pdf ]

Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Weischedel. A novel use of statistical parsing to extract information from text. In Proceedings of the first conference on North American chapter of the Association for Computational Linguistics (NAACL'00), pages 226-233, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [ bib | .pdf ]

Stefan Riezler, Detlef Prescher, Jonas Kuhn, and Mark Johnson. Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training. In In Proceedings of 38th Annual Meeting of the Association for Compuational Linguistics (ACL-00), 2000. [ bib | .pdf ]

Brian Roark and Eugene Charniak. Measuring efficiency in high-accuracy, broad-coverage statistical parsing. In Proceedings of the COLING'00 Workshop on Efficiency in Large-scale Parsing Systems, pages 29-36, 2000. [ bib | .pdf ]

1999

Matthew Berland and Eugene Charniak. Finding parts in very large corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL '99), pages 57-64, 1999. [ bib | .pdf | tech-report ]

Don Blaheta and Eugene Charniak. Automatic compensation for parser figure-of-merit flaws. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL'99), pages 513-518, Morristown, NJ, 1999. Association for Computational Linguistics. [ bib | .pdf ]

Sharon A. Caraballo and Eugene Charniak. Determining the Specificity of Nouns from Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-99), 1999. [ bib | .ps ]

Mark Johnson. Type-driven semantic interpretation and Feature dependencies in R-LFG. Semantics and Syntax in Lexical Functional Grammar, pages 359-388, 1999. [ bib | .pdf ]

Mark Johnson. A Resource Sensitive Interpretation of Lexical Functional Grammar. Journal of Logic, Language and Information, 8(1):45-81, 1999. [ bib | .pdf | .ps ]

Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. Estimators for Stochastic Unification-Based Grammars. In 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 535-541, 1999. [ bib | .pdf ]

Brian Roark and Mark Johnson. Efficient probabilistic top-down and left-corner parsing. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99), pages 421-428, 1999. [ bib | .pdf ]

1998

Sharon Caraballo and Eugene Charniak. New Figures of Merit for Best-First Probabalistic Chart Parsing. Computational Linguistics, 24(2):275-298, 1998. [ bib | .pdf ]

Eugene Charniak, Sharon Goldwater, and Mark Johnson. Edge-Based Best-First Chart Parsing. In Sixth Workshop on Very Large Corpora, pages 127-133, 1998. [ bib | .pdf ]

Zhiyi Chi and Stuart Geman. Estimation of probabilistic context-free grammars. Computational Linguistics, 24(2):299-305, 1998. [ bib | .pdf ]

Niyu Ge, John Hale, and Eugene Charniak. A statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora, Orlando, Florida, 1998. Harcourt Brace. [ bib | .pdf ]

This paper presents an algorithm for identifying pronominal anaphora and two experiments based upon this algorithm. We incorporate multiple anaphora resolution factors into a statistical framework - specifically the distance between the pronoun and the proposed antecedent, gender/number/animaticity of the proposed antecedent, governing head information and noun phrase repetition. We combine them into a single probability that enables us to identify the referent. Our first experiment shows the relative contribution of each source of information and demonstrates a success rate of 82.9% for all sources combined. The second experiment investigates a method for unsupervised learning of gender/number/animaticity information. We present some experiments illustrating the accuracy of the method and note that with this information added, our pronoun resolution method achieves 84.2% accuracy.

John Hale and Eugene Charniak. Getting Useful Gender Statistics from English Text. Technical Report CS-98-06, Brown University, Providence, RI, 1998. [ bib | .ps.Z | .html ]

Gender, understood as a lexical feature, is important for anaphora because it narrows down the number of possible referents involved in a typical pronoun resolution situation. This work describes an automatic method for obtaining reliable guesses about the gender of entities in a corpus using free text. By using a simple but unreliable anaphora algorithm repeatedly over a large corpus, the probable genders of referenced entities can be compiled and given a salience ranking. These statistics are an inexpensive way to add on gender-feature information to a statistical anaphora resolution algorithm.

Mark Johnson. Proof Nets and the Complexity of Processing Center Embedded Constructions. Journal of Logic, Language and Information, 7(4):433-447, 1998. [ bib | .pdf ]

Mark Johnson. The Effect of Alternative Tree Representations on Tree Bank Grammars. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning: (NeMLaP3/CoNLL98), pages 39-48, Somerset, New Jersey, 1998. Association for Computational Linguistics. [ bib | .pdf ]

Mark Johnson. PCFG Models of Linguistic Tree Representations. Computational Linguistics, 24(4):613-632, 1998. [ bib | .pdf | .ps.gz ]

Mark Johnson. Finite-state Approximation of Constraint-based Grammars using Left-corner Grammar Transforms. In COLING-ACL, pages 619-623, 1998. [ bib | .pdf | .ps ]

1997

Eugene Charniak. Statistical Techniques for Natural Language Parsing. AI Magazine, 18(4):33-44, 1997. [ bib | .pdf | .ps ]

Eugene Charniak. Statistical Parsing with a Context-Free Grammar and Word Statistics. In Proceedings of AAAI, pages 598-603, 1997. [ bib | .pdf | tech-report | .ps ]

We describe a parsing system based upon a language model for English that is, in turn, based upon assigning probabilities to possible parses for a sentence. This model is used in a parsing system by finding the parse for the sentence with the highest probability. This system outperforms previous schemes. As this is the third in a series of parsers by different authors that are similar enough to invite detailed comparisons but different enough to give rise to different levels of performance, we also report on some experiments designed to identify what aspects of these systems best explain their relative performance.

Mark Johnson. Features as resources in R-LFG. In Proceedings of the 1997 LFG Conference, 1997. [ bib | .ps ]

1996

Sharon Caraballo and Eugene Charniak. Figures of Merit for Best-First Probabilistic Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'96), pages 127-132, 1996. [ bib | tech-report | .pdf ]

Best-first parsing methods for natural language try to parse efficiently by considering the most likely constituents first. Some figure of merit is needed by which to compare the likelihood of constituents, and the choice of this figure has a substantial impact on the efficiency of the parser. While several parsers described in the literature have used such techniques, there is no published data on their efficacy, much less attempts to judge their relative merits. We propose and evaluate several figures of merit for best-first parsing.

Keywords: parsing, nlp

Eugene Charniak. Tree-bank Grammars. In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-96), 1996. [ bib | tech-report ]

By a “tree-bank grammar” we mean a context-free grammar created by reading the production rules directly from hand-parsed sentences in a tree bank. Common wisdom has it that such grammars do not perform well, though we know of no published data on the issue. The primary purpose of this paper is to show that the common wisdom is wrong. In particular we present results on a tree-bank grammar based on the Penn Wall Street Journal tree bank. To the best of our knowledge, this grammar out-performs all other non-word-based statistical parsers/grammars on this corpus. That is, it out-performs parsers that consider the input as a string of tags and ignore the actual words of the corpus.

Eugene Charniak, Glenn Carroll, John Adcock, Anthony R. Cassandra, Yoshihiko Gotoh, Jeremy Katz, Michael L. Littman, and John McCann. Taggers for Parsers. Artificial Intelligence, 85(1-2):45-57, 1996. [ bib | tech-report | .ps ]

We consider what tagging models are most appropriate as front ends for probabilistic context-free-grammar parsers. In particular we ask if using a tagger that returns more than one tag, a “multple tagger,” improves parsing performance. Our conclusion is somewhat surprising: single tag Markov-model taggers are quite adequate for the task. First of all, parsing accuracy, as measured by the correct assignment of parts of speech to words, does not increase significantly when parsers select the tags themselves. In addition, the work required to parse a sentence goes up with increasing tag ambiguity, though not as much as one might expect. Thus, for the moment, single taggers are the best taggers.

Eugene Charniak. Expected-Frequency Interpolation. Technical Report CS-96-37, Brown University, Providence, RI, 1996. [ bib | .html ]

Expected-frequency interpolation is a technique for improving the performance of deleted interpolation smoothing. It allows a system to make finer-grained estimates of how often one would expect to see a particular combination of events than is possible with traditional frequency interpolation. This allows the system to better weigh the emphasis given to the various probability distributions being mixed. We show that more traditional frequency interpolation, based solely on the frequency of conditioning events, can lead to some anomalous results. We then show that while the equations for expected-frequency interpolation are not exact, they are close, depending on how well some seemingly reasonable assumptions hold. We then present an experiment in which the introduction of expected-frequency interpolation to a statistical parsing system improved performance by .4% with essentially no extra work, and essentially no change in the workings of the system. We also note that even before the change, the system in question was the top performer at its task, so a .4% improvement was well worth obtaining.

Mark Johnson. Resource-sensitivity in Lexical-Functional Grammar. Proceedings of the 1996 Roma Workshop, 1996. [ bib ]

1995

Sam Bayer and Mark Johnson. Features and Agreement. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pages 70-76, 1995. [ bib | .pdf ]

This paper compares the consistency-based account of agreement phenomena in `unification-based' grammars with an implication-based account based on a simple feature extension to Lambek Categorial Grammar (LCG). We show that the LCG treatment accounts for constructions that have been recognized as problematic for `unification-based' treatments.

Eugene Charniak. Parsing with context-free grammars and word statistics. Technical Report CS-95-28, Brown University, Providence, RI, 1995. [ bib | .ps.Z ]

We present a language model in which the probability of a sentence is the sum of the individual parse probabilities, and these are calculated using a probabilistic context-free grammar (PCFG) plus statistics on individual words and how they fit into parses. We have used the model to improve syntactic disambiguation. After training on Wall Street Journal (WSJ) text we tested on about 200 WSJ sentence restricted to the 5400 most common words from our training. We observed a 41% reduction in bracket-crossing errors compared to the performance of our PCFG without the use of the word statistics.

Murat Ersan and Eugene Charniak. A statistical syntactic disambiguation program and what it learns. In Stefan Wermter, Ellen Riloff, and Gabriele Scheler, editors, Symbolic, Connectionist, and Statistical Approaches to Learning for Natural Language Processing, 1995. [ bib | tech-report ]

Mark Johnson. Memorization in Top-Down Parsing. Computational Linguistics, 21(3):405-415, 1995. [ bib | .pdf ]

Mark Johnson and Sam Bayer. Features and Agreement in Lambek Categorial Grammar. In Proceedings of the 1995 ESSLLI Formal Grammar Workshop, pages 123-137, 1995. [ bib | .ps.Z ]

Mark Johnson and Jochen Dorre. Memoization of coroutined constraints. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 100-107, Morristown, NJ, USA, 1995. Association for Computational Linguistics. [ bib | .pdf ]

1994

Glenn Carroll and Eugene Charniak. Combining Grammars For Improved Learning. Technical Report CS-94-08, Department of Computer Science, Brown University, February 1994. [ bib | .pdf | .ps | .html ]

We report experimental work on improving learning methods for probabilistic context-free grammars (PCFGs). From stacked regression we borrow the basic idea of combining grammars. Smoothing, a domain-independent method for combining grammars, does not offer noticeable performance gains. However, PCFGs allow much tighter, domain-dependent coupling, and we show that this maybe exploited for significant performance gains. Finally, we compare two strategies for acquiring the varying grammars needed for any combining method. We suggest that an unorthodox strategy, “leave-one-in” learning, is more effective than the more familiar “leave-one-out”.

Eugene Charniak, Glenn Carroll, John Adcock, Antony Cassandra, Yoshihiko Gotoh, Jeremy Katz, Michael Littman, and John McCann. Expected-Frequency Interpolation. Technical Report CS-94-06, Brown University, Providence, RI, 1994. [ bib | .ps.Z ]

Eugene Charniak and Glenn Carroll. Context-Sensitive Statistics for Improved Grammatical Language Models. Technical Report CS-94-07, Brown University, Providence, RI, 1994. [ bib | .ps.Z | .html ]

We develop a language model using probabilistic context-free grammars (PCFGs) that is “pseudo context-sensitive” in that the probability that a non-terminal $N$ expands using a rule $r$ depends on $N$'s parent. We derive the equations for estimating the necessary probabilities using a variant of the inside-outside algorithm. We give experimental results showing that, beginning with a high-performance PCFG, one can develop a pseudo PCSG that yields significant performance gains. Analysis shows that the benefits from the context-sensitive statistics are localized, suggesting that we can use them to extend the original PCFG. Experimental results confirm that this is both feasible and the resulting grammar retains the performance gains. This implies that our scheme may be useful as a novel method for PCFG induction.

Mark Johnson. Computing with Features as Formulae. Computational Linguistics, 20(1):1-25, 1994. [ bib | .pdf ]

1993

Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. Equations for Part-of-Speech Tagging. In National Conference on Artificial Intelligence, pages 784-789, 1993. [ bib | .ps ]

We derive from first principles the basic equations for a few of the basic hidden-Markov-model word taggers as well as equations for other models which may be novel (the descriptions in previous papers being too spare to be sure). We give performance results for all of the models. The results from our best model (96.45% on an unused test sample from the Brown corpus with 181 distinct tags) is on the upper edge of reported results. We also hope these results clear up some confusion in the literature about the best equations to use. However, the major purpose of this paper is to show how the equations for a variety of models may be derived and thus encourage future authors to give the equations for their model and the derivations thereof.

Eugene Charniak. Statistical Language Learning. The MIT Press, Cambridge, Massachusetts, 1993. [ bib | http ]

1992

Glenn Carroll and Eugene Charniak. Two Experiments on Learning Probabilistic Dependency Grammars from Corpora. Technical Report CS-92-16, Brown University, Providence, RI, USA, 1992. [ bib | .pdf ]

We present a scheme for learning probabilistic dependency grammars from positive training examples plus constraints on rules. In particular, we present the results of two experiments. The first, in which the constraints were minimal, was unsuccessful. The second, with significant constraints, was successful within the bounds of the task we had set.




Last update: Saturday, March 01 2008, 10:07 PM