Lab Publications
By Year: 2016 · 2015 · 2014 · 2013 · 2012 · 2011 · 2010 · 2009 · 2008 · 2007 · 2006 · 2005 · 2004 · 2003 · 2002 · 2001 · 2000 · 1999 · 1998 · 1997 · 1996 · 1995 · 1994 · 1993 · 1992 ·By Author: Eugene Charniak · Micha Elsner · Heidi Fox · Stuart Geman · Will Headden · Mark Johnson · Matt Lease · David McClosky · Rebecca Mason · Ben Swanson · Do Kook Choe · Chris Tanner ·
2016
2015
Byron C. Wallace, Do Kook Choe, and Eugene Charniak. Sparse, Contextually Informed Models for Irony Detection: Exploiting User Communities, Entities and Sentiment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1035-1044, Beijing, China, July 2015. Association for Computational Linguistics. [ bib | .pdf ]
Do Kook Choe and David McClosky. Parsing Paraphrases with Joint Inference. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1223-1233, Beijing, China, July 2015. Association for Computational Linguistics. [ bib | .pdf ]
Do Kook Choe, David McClosky, and Eugene Charniak. Syntactic Parse Fusion. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1360-1366, Lisbon, Portugal, September 2015. Association for Computational Linguistics. [ bib | .pdf ]
Chris Tanner and Eugene Charniak. A Hybrid Generative/Discriminative Approach To Citation Prediction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 75-83, Denver, Colorado, May-June 2015. Association for Computational Linguistics. [ bib | .pdf ]
2014
Rebecca Mason and Eugene Charniak. Domain-Specific Image Captioning. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 11-20, Ann Arbor, Michigan, June 2014. Association for Computational Linguistics. [ bib | .pdf ]
Rebecca Mason and Eugene Charniak. Nonparametric Method for Data-driven Image Captioning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 592-598, Baltimore, Maryland, June 2014. Association for Computational Linguistics. [ bib | .pdf ]
Byron C. Wallace, Do Kook Choe, Laura Kertz, and Eugene Charniak. Humans Require Context to Infer Ironic Intent (so Computers Probably do, too). In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 512-516, Baltimore, Maryland, June 2014. Association for Computational Linguistics. [ bib | .pdf ]
Ben Swanson and Eugene Charniak. Data Driven Language Transfer Hypotheses. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 169-173, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. [ bib | .pdf ]
2013
Rebecca Mason. Domain-Independent Captioning of Domain-Specific Images. In Proceedings of the 2013 NAACL HLT Student Research Workshop, pages 69-76, Atlanta, Georgia, June 2013. Association for Computational Linguistics. [ bib | .pdf ]
Rebecca Mason and Eugene Charniak. Annotation of Online Shopping Images without Labeled Training Examples. In Proceedings of Workshop on Vision and Language, Atlanta, Georgia, June 2013. Association for Computational Linguistics. [ bib | .pdf ]
Do Kook Choe and Eugene Charniak. Naive Bayes Word Sense Induction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1433-1437, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. [ bib | .pdf ]
Ben Swanson and Eugene Charniak. Extracting the Native Language Signal for Second Language Acquisition. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 85-94, Atlanta, Georgia, June 2013. Association for Computational Linguistics. [ bib | .pdf ]
Ben Swanson, Elif Yamangil, Eugene Charniak, and Stuart Shieber. A Context Free TAG Variant. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 302-310, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. [ bib | .pdf ]
2012
Rebecca Mason and Eugene Charniak. Apples to Oranges: Evaluating Image Annotations from Natural Language Processing Systems. In NAACL-2012: Main Proceedings, Montreal, Canada, 2012. Association for Computational Linguistics. [ bib | .pdf ]
Ben Swanson and Eugene Charniak. Native language detection with tree substitution grammars. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL '12, pages 193-197, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. [ bib | .pdf ]
Ben Swanson and Elif Yamangil. Correction detection and error type selection as an ESL educational aid. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT '12, pages 357-361, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. [ bib | .pdf ]
2011
Micha Elsner and Eugene Charniak. Disentangling chat with local coherence models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11, pages 1179-1189, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [ bib | .pdf ]
Micha Elsner and Deepak Santhanam. Learning to fuse disparate sentences. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, MTTG '11, pages 54-63, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [ bib | .pdf ]
Micha Eisner and Eugene Charniak. Extending the entity grid with entity-specific features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT '11, pages 125-129, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [ bib | .pdf ]
Rebecca Mason and Eugene Charniak. Extractive Multi-Document Summaries Should Explicitly Not Contain Document-Specific Content. In Proceedings of the ACL 2011 Workshop on Automatic Summarization for Different Genres, Media, and Languages, Portland, Oregon, 2011. Association for Computational Linguistics. [ bib | .pdf ]
Rebecca Mason and Eugene Charniak. BLLIP at TAC 2011: A General Summarization System for a Guided Summarization Task. In Proceedings of TAC 2011, 2011. [ bib | .pdf ]
2010
Eugene Charniak. Top-down nearly-context-sensitive parsing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP '10, pages 674-683, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [ bib | .pdf ]
Micha Elsner and Eugene Charniak. The Same-head Heuristic for Coreference. In Proceedings of ACL 10, Uppsala, Sweden, July 2010. Association for Computational Linguistics. [ bib | .pdf ]
David McClosky, Eugene Charniak, and Mark Johnson. Automatic domain adaptation for parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 28-36, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [ bib | .pdf ]
2009
Eugene Charniak and Micha Elsner. EM Works for Pronoun Anaphora Resolution. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-09), Athens, Greece, 2009. [ bib | .pdf ]
Micha Elsner and Warren Schudy. Bounding and Comparing Methods for Correlation Clustering Beyond ILP. In Proceedings of the NAACL/HLT 2009 Workshop on Integer Linear Programming for Natural Language Processing (ILP-NLP '09), Boulder, Colorado, June 2009. [ bib | .pdf ]
Micha Elsner, Eugene Charniak, and Mark Johnson. Structured Generative Models for Unsupervised Named-Entity Clustering. In Proceedings of NAACL-09: HLT, Boulder, Colorado, June 2009. Association for Computational Linguistics. [ bib | .pdf ]
William P. Headden III, Mark Johnson, and David McClosky. Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference (to appear), Boulder, Colorado, May 2009. [ bib ]
Matthew Lease. An Improved Markov Random Field Model for Supporting Verbose Queries. In Proceedings of the 32nd Annual ACM SIGIR Conference, 2009. 16% acceptance rate, to appear. [ bib ]
Recent work in supervised learning of term-based retrieval models has shown that significantly improved accuracy can often be achieved in practice via better model estimation. In this paper, we show retrieval accuracy with the Markov random field (MRF) approach can be similarly improved via supervised estimation. While the original MRF method estimates a parameter for each feature class from data, parameters within each class are set using the same fixed weighting scheme as the standard unigram. Because this scheme does not model context-sensitivity, its use particularly limits retrieval accuracy with verbose queries. By employing supervised estimation instead, this deficit can be remedied. Retrieval experiments with verbose queries on three TREC document collections show our improved MRF consistently out-performs both the original MRF and the supervised unigram model. Additional experiments using blind-feedback and evaluation with optimal weighting demonstrate both the immediate value and further potential of more accurate MRF model estimation.
Matthew Lease, James Allan, and W. Bruce Croft. Regression Rank: Learning to Meet the Opportunity of Descriptive Queries. In Proceedings of the 31st European Conference on Information Retrieval (ECIR), pages 90-101, 2009. 22% acceptance rate. [ bib | .pdf ]
We present a new learning to rank framework for estimating context-sensitive term weights without use of feedback. Specifically, knowledge of effective term weights on past queries is used to estimate term weights for new queries. This generalization is achieved by introducing secondary features correlated with term weights and applying regression to predict term weights given features. To improve support for more focused retrieval like question answering, we conduct document retrieval experiments with TREC description queries on three document collections. Results show significantly improved retrieval accuracy.
2008
Micha Elsner and Eugene Charniak. You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement. In Proceedings of ACL-08: HLT, pages 834-842, Columbus, Ohio, June 2008. Association for Computational Linguistics. [ bib | .pdf | slides ]
Micha Elsner and Eugene Charniak. Coreference-inspired Coherence Modeling. In Proceedings of ACL-08: HLT, Short Papers, pages 41-44, Columbus, Ohio, June 2008. Association for Computational Linguistics. [ bib | .pdf | poster ]
William P. Headden III, David McClosky, and Eugene Charniak. Evaluating Unsupervised Part-of-Speech Tagging for Grammar Induction. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING'08), Manchester, UK, August 2008. [ bib | .pdf | .ps ]
Matthew Lease. Incorporating Relevance and Psuedo-relevance Feedback in the Markov Random Field Model: Brown at the TREC'08 Relevance Feedback Track. In Proceedings of the 17th Text Retrieval Conference (TREC'08), 2008. Best results in track. This paper supersedes an earlier version appearing in conference's Working Notes. [ bib | .pdf ]
We present a new document retrieval approach combining relevance feedback, pseudo-relevance feedback, and Markov random field modeling of term interaction. Overall effectiveness of our combined model and the relative contribution from each component is evaluated on the GOV2 webpage collection. Given 0-5 feedback documents, we find each component contributes unique value to the overall ensemble, achieving significant improvement individually and in combination. Comparative evaluation in the 2008 TREC Relevance Feedback track further shows our complete system typically performs as well or better than peer systems.
Matthew Lease and Eugene Charniak. A Dirichlet-smoothed Bigram Model for Retrieving Spontaneous Speech. In Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Revised Selected Papers, volume 5152 of Lecture Notes in Computer Science. Springer-Verlag, 2008. [ bib | .pdf ]
David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. In Proceedings of ACL-08: HLT, Short Papers, pages 101-104, Columbus, Ohio, June 2008. Association for Computational Linguistics. [ bib | .pdf ]
David McClosky, Eugene Charniak, and Mark Johnson. When is Self-training Effective for Parsing? In Proceedings of the 22nd International Conference on Computational Linguistics (COLING'08), Manchester, UK, August 2008. [ bib | .pdf | .ps ]
David McClosky. Modeling Valence Effects in Unsupervised Grammar Induction. Technical Report CS-09-01, Brown University, Providence, RI, USA, 2008. [ bib | tech-report ]
We extend the dependency grammar induction model of Klein and Manning (2004) to incorporate further valence information. Our extensions achieve significant improvements in the task of unsupervised dependency grammar induction. We use an expanded grammar which tracks higher orders of valence and allows each valence slot to be filled by a separate distribution rather than using one distribution for all slots. Additionally, we show that our performance improves if our grammar restricts the maximum number of attachments in each direction, forcing our system to focus on the common case. Taken together, these techniques constitute a 23.4% error reduction in dependency grammar induction over the model by Klein and Manning (2004) on English.
2007
Micha Elsner, Joseph Austerweil, and Eugene Charniak. A Unified Local and Global Model for Discourse Coherence. In Proceedings of HLT-NAACL '07, Rochester, New York, April 2007. Association for Computational Linguistics. [ bib | .pdf | slides ]
Micha Elsner and Eugene Charniak. A Generative Discourse-New Model for Text Coherence. Technical Report CS-07-04, Brown University, Providence, RI, USA, 2007. [ bib | .pdf ]
Recent models of document coherence have focused on the referents of noun phrases, ignoring their syntax. However, syntax depends on discourse function; NPs which introduce new entities are often more complex. We develop a generative model for NP syntax which describes this difference. It can be used to model discourse coherence in the Wall Street Journal; combining it with the local coherence model of Elsner ('07) yields substantial improvements. Our model is competitive with previous systems on the discourse-new detection task; its performance is comparable to Uryupina ('03).
Jianfeng Gao, Galen Andrew, Mark Johnson, and Kristina Toutanova. A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing. In Proceedings of the Association for Computational Linguistics (ACL'07), 2007. [ bib ]
Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Distributional Cues to Word Segmentation: Context is Important. In Proceedings of the 31st Boston University Conference on Language Development, 2007. [ bib | .pdf ]
Mark Johnson. Why Doesn't EM Find Good HMM POS-Taggers? In Proceedings of Empirical Methods in Natural Language Processing (EMNLP'07), 2007. [ bib ]
Mark Johnson. Transforming Projective Bilexical Dependency Grammars into Efficiently-Parsable CFGs with Unfold-Fold. In Proceedings of the Association for Computational Linguistics (ACL'07), 2007. [ bib ]
Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of the North American Conference on Computational Linguistics (NAACL'07), 2007. [ bib | .pdf ]
Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. Adaptor Grammars: a Framework for Specifying Compositional Nonparametric Bayesian Models. In Advances in Neural Information Processing Systems 19, 2007. [ bib | .pdf ]
Matthew Lease and Eugene Charniak. Brown at CL-SR'07: Retrieving Conversational Speech in English and Czech. In Working Notes of the Cross-Language Evaluation Forum (CLEF): Cross-Language Speech Retrieval (CL-SR) track, 2007. Corrected version. [ bib | .pdf ]
Matthew Lease. Natural Language Processing for Information Retrieval: the time is ripe (again). In Proceedings of the 1st Ph.D. Workshop at the ACM Conference on Information and Knowledge Management (PIKM), 2007. Best Paper award. [ bib | .pdf ]
Paraphrasing van Rijsbergen, the time is ripe for another attempt at using natural language processing (NLP) for information retrieval (IR). This paper introduces my dissertation study, which will explore methods for integrating modern NLP with state-of-the-art IR techniques. In addition to text, I will also apply retrieval to conversational speech data, which poses a unique set of considerations in comparison to text. Greater use of NLP has potential to improve both text and speech retrieval.
Jenine Turner and Eugene Charniak. Language Modeling for Determiner Selection. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 177-180, Rochester, New York, April 2007. Association for Computational Linguistics. [ bib | .pdf ]
2006
Ann Bies, Stephanie Strassel, Haejoong Lee, Kazuaki Maeda, Seth Kulick, Yang Liu, Mary Harper, and Matthew Lease. Linguistic Resources for Speech Parsing. In Fifth International Conference on Language Resources and Evaluation (LREC'06), Genoa, Italy, 2006. [ bib | .pdf ]
Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R. Shrivaths, Jeremy Moore, Michael Pozar, and Theresa Vu. Multilevel Coarse-to-Fine PCFG Parsing. In Proceedings of the Human Language Technology Conference of the NAACL (HLT-NAACL'06), pages 168-175, New York City, USA, June 2006. Association for Computational Linguistics. [ bib | .pdf | slides ]
Sharon Goldwater, Tom Griffiths, and Mark Johnson. Interpolating between types and tokens by estimating power-law generators. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 459-466, Cambridge, MA, 2006. MIT Press. [ bib | .pdf ]
Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Contextual Dependencies in Unsupervised Word Segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association or Computational Linguistics (COLING_ACL'06), pages 673-680, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf ]
John Hale, Izhak Shafran, Lisa Yung, Bonnie J. Dorr, Mary Harper, Anna Krasnyanskaya, Matthew Lease, Yang Liu, Brian Roark, Matthew Snover, and Robin Stewart. PCFGs with Syntactic and Prosodic Indicators of Speech Repairs. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL'06), pages 161-168, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf ]
William P. Headden III, Eugene Charniak, and Mark Johnson. Learning Phrasal Categories. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 301-307, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf ]
Matthew Lease, Mark Johnson, and Eugene Charniak. Recognizing disfluencies in conversational speech. IEEE Transactions on Audio, Speech and Language Processing, 14(5):1566-1573, September 2006. [ bib | .pdf ]
We present a system for modeling disfluency in conversational speech: repairs, fillers, and self-interruption points (IPs). For each sentence, candidate repair analyses are generated by a stochastic tree adjoining grammar (TAG) noisy-channel model. A probabilistic syntactic language model scores the fluency of each analysis, and a maximum-entropy model selects the most likely analysis given the language model score and other features. Fillers are detected independently via a small set of deterministic rules, and IPs are detected by combining the output of repair and filler detection modules. In the recent Rich Transcription Fall 2004 (RT-04F) blind evaluation, systems competed to detect these three forms of disfluency under two input conditions: a best-case scenario of manually transcribed words and a fully automatic case of automatic speech recognition (ASR) output. For all three tasks and on both types of input, our system was the top performer in the evaluation.
Keywords: Disfluency modeling, natural language processing, rich transcription, speech processing
Matthew Lease, Eugene Charniak, Mark Johnson, and David McClosky. A Look At Parsing and Its Applications. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), 16-20 July 2006. [ bib | .pdf ]
Matthew Lease and Mark Johnson. Early Deletion of Fillers In Processing Conversational Speech. In Proceedings of the Human Language Technology Conference of the NAACL (HLT-NAACL'06), Companion Volume: Short Papers, pages 73-76, New York City, USA, June 2006. Association for Computational Linguistics. Version here corrects Table 2 in published version. [ bib | .pdf ]
David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL'06), pages 337-344, Sydney, Australia, July 2006. Association for Computational Linguistics. [ bib | .pdf | .ps ]
David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152-159, New York City, USA, June 2006. Association for Computational Linguistics. [ bib | .pdf | slides | .ps ]
B. Roark, Yang Liu, M. Harper, R. Stewart, M. Lease, M. Snover, I. Shafran, B. Dorr, J. Hale, A. Krasnyanskaya, and L. Yung. Reranking for Sentence Boundary Detection in Conversational Speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'06), pages 545-548, May 14-19 2006. [ bib | .pdf ]
We present a reranking approach to sentence-like unit (SU) boundary detection, one of the EARS metadata extraction tasks. Techniques for generating relatively small n-best lists with high oracle accuracy are presented. For each candidate, features are derived from a range of information sources, including the output of a number of parsers. Our approach yields significant improvements over the best performing system from the NIST RT-04F community evaluation.
Brian Roark, Mary Harper, Eugene Charniak, Bonnie Dorr, Mark Johnson, Jeremy G. Kahn, Yang Liu, Mari Ostendorf, John Hale, Anna Krasnyanskaya, Matthew Lease, Izhak Shafran, Matthew Snover, Robin Stewart, and Lisa Yung. SParseval: Evaluation Metrics for Parsing Speech. In Fifth International Conference on Language Resources and Evaluation (LREC'06), Genoa, Italy, 2006. [ bib | .pdf ]
2005
Eugene Charniak and Mark Johnson. Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 173-180, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]
Micha Elsner, Mary Swift, James Allen, and Daniel Gildea. Online Statistics for a Unification-Based Dialogue Parser. In Proceedings of the Ninth International Workshop on Parsing Technology (IWPT'05), pages 198-199, Vancouver, British Columbia, October 2005. Association for Computational Linguistics. [ bib | .pdf | poster ]
Heidi Fox. Dependency-Based Statistical Machine Translation. In Proceedings of the ACL Student Research Workshop, pages 91-96, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]
Dmitriy Genzel. Inducing a Multilingual Dictionary from a Parallel Multitext in Related Languages. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 875-882, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [ bib | .pdf ]
Sharon Goldwater and David McClosky. Improving Statistical MT through Morphological Analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP'05), pages 676-683, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [ bib | .pdf | .ps ]
Sharon Goldwater and Mark Johnson. Representational Bias in Unsupervised Learning of Syllable Structure. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 112-119, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]
Jeremy G. Kahn, Matthew Lease, Eugene Charniak, Mark Johnson, and Mari Ostendorf. Effective Use of Prosody in Parsing Conversational Speech. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (EMNLP'05), pages 233-240, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [ bib | .pdf ]
Matthew Lease. Parsing and Disfluency Modeling. Technical Report CS-05-15, Brown University Department of Computer Science, 2005. [ bib | tech-report ]
Matthew Lease, Eugene Charniak, and Mark Johnson. Parsing and its applications for conversational speech. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), volume 5, pages 961-964, March 18 - March 23 2005. [ bib | .pdf ]
This paper provides an introduction to recent work in statistical parsing and its applications for conversational speech, with particular emphasis on the relationship between parsing and detecting speech repairs. While historically parsing and repair detection have been studied independently, we present a line of research which has spanned the boundary between the two and demonstrated the efficacy of this synergistic approach. Our presentation highlights successes to date, remaining challenges, and promising future work.
Matthew Lease and Eugene Charniak. Parsing Biomedical Literature. In R. Dale, K.-F. Wong, J. Su, and O. Kwong, editors, Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP'05), volume 3651 of Lecture Notes in Computer Science, pages 58 - 69, Jeju Island, Korea, October 11 - October 13 2005. Springer-Verlag. [ bib | .pdf ]
We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.
Heng Lian. Chinese Language Parsing with Maximum-Entropy-Inspired Parser. Master's thesis, Brown University, Providence, RI, 2005. [ bib | .pdf ]
The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art parser is much worse than that for the English language, with an f-score about 10% below that of English. We present the result of a maximum-entropy-inspired parser [3] on Penn Chinese TreeBank 1.0 and 4.0, achieving precision/recall of 78.6/75.6 on CTB1.0 and 79.1/75.0 on CTB 4.0. We also apply the MaxEnt reranker [4] on the 50 best parses and get about 6% error reduction. The parser is also applied directly to unsegmented sentences and also achieves state-of-the-art performance.
Jenine Turner and Eugene Charniak. Supervised and Unsupervised Learning for Sentence Compression. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 290-297, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [ bib | .pdf ]
2004
Massimiliano Ciaramita and Mark Johnson. Multi-component Word Sense Disambiguation. In Rada Mihalcea and Phil Edmonds, editors, Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 97-100, Barcelona, Spain, July 2004. Association for Computational Linguistics. [ bib | .pdf ]
Sharon Goldwater and Mark Johnson. Priors in Bayesian Learning of Phonological Rules. In Proceedings of the Seventh Meeting of the ACL Special Interest Group in Computational Phonology, pages 35-42, Barcelona, Spain, July 2004. Association for Computational Linguistics. [ bib | .pdf ]
Michelle Gregory and Yasemin Altun. Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume, pages 677-683, Barcelona, Spain, July 2004. [ bib | .pdf ]
Michelle Gregory, Mark Johnson, and Eugene Charniak. Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 81-88, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. [ bib | .pdf ]
Keith B. Hall and Mark Johnson. Attention Shifting for Parsing Speech. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume, pages 40-46, Barcelona, Spain, July 2004. [ bib | .pdf ]
Mark Johnson and Eugene Charniak. A TAG-based noisy-channel model of speech repairs. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 33-39, Barcelona, Spain, July 2004. [ bib | .pdf ]
Mark Johnson, Eugene Charniak, and Matthew Lease. An Improved Model For Recognizing Disfluencies in Conversational Speech. In Rich Transcription 2004 Fall Workshop (RT-04F), 2004. [ bib | .pdf ]
Ron Kaplan, Stefan Riezler, Tracy H King, John T Maxwell III, Alex Vasserman, and Richard Crouch. Speed and Accuracy in Shallow and Deep Stochastic Parsing. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 97-104, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. [ bib | .ps | .pdf ]
Brian Roark, Murat Saraclar, Michael Collins, and Mark Johnson. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm. In ACL, pages 47-54, 2004. [ bib ]
2003
Yasemin Altun, Mark Johnson, and Thomas Hofmann. Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences. In Michael Collins and Mark Steedman, editors, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pages 145-152, 2003. [ bib | .pdf ]
Yasemin Altun and Thomas Hofmann. Large Margin Methods for Label Sequence Learning. In Proceedings of the Eighth European Conference on Speech Communication and Technology (EuroSpeech'03), 2003. [ bib | .pdf ]
Label sequence learning is the problem of inferring a state sequence from an observation sequence, where the state sequence may encode a labeling, annotation or segmentation of the sequence. In this paper we give an overview of discriminative methods developed for this problem. Special emphasis is put on large margin methods by generalizing multiclass Support Vector Machines and AdaBoost to the case of label sequences. An experimental evaluation demonstrates the advantages over classical approaches like Hidden Markov Models and the competitiveness with methods like Conditional Random Fields.
Eugene Charniak, Kevin Knight, and Kenji Yamada. Syntax-based Language Models for Statistical Machine Translation. In Proceedings of the Ninth Machine Translation Summit of the International Association for Machine Translation, New Orleans, Louisiana, September 2003. [ bib | .pdf ]
Massimiliano Ciaramita, Thomas Hofmann, and Mark Johnson. Hierarchical Semantic Classification: Word Sense Disambiguation with World Knowledge. In Georg Gottlob and Toby Walsh, editors, IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003, pages 817-822. Morgan Kaufmann, 2003. [ bib | .pdf | .ps ]
Massimiliano Ciaramita and Mark Johnson. Supersense Tagging of Unknown Nouns in WordNet. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03), pages 168-175, 2003. [ bib | .pdf ]
Stuart Geman and Mark Johnson. Probability and statistics in computational linguistics, a brief review. Mathematical foundations of speech and language processing, 138:1-26, 2003. [ bib | .pdf ]
Dmitriy Genzel and Eugene Charniak. Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number. In Michael Collins and Mark Steedman, editors, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP'03), pages 65-72, 2003. [ bib | .pdf ]
Sharon Goldwater and Mark Johnson. Learning OT Constraint Rankings Using a Maximum Entropy Model. In Proceedings of the Workshop on Variation within Optimality Theory, Stockholm University, 2003. [ bib | .pdf | .ps ]
Keith Hall and Mark Johnson. Language modelling using efficient best-first bottom-up parsing. In Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE ASRU 2003, 2003. [ bib | .pdf ]
Thomas Hofmann, Lijuan Cai, and Massimiliano Ciaramita. Learning with taxonomies: Classifying documents and words. In Workshop on Syntax, Semantics and Statistics (NIPS-03)., 2003. [ bib | .pdf ]
Mark Johnson. Learning and Parsing Stochastic Unification-Based Grammars. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, volume 2777 of Lecture Notes in Computer Science, pages 671-683. Springer, 2003. [ bib | .pdf ]
2002
Yasemin Altun, Thomas Hofmann, and Mark Johnson. Discriminative Learning for Label Sequences via Boosting. In Proceedings of Neural Information Processing Systems (NIPS02), 2002. [ bib | .pdf ]
Don Blaheta. Handling noisy training and testing data. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, Philadelpha, Pennsylvania, July 2002. [ bib | .pdf ]
Massimiliano Ciaramita. Boosting automatic lexical acquisition with morphological information. In Unsupervised Lexical Acquisition: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 17-25, Philadelphia, July 2002. Association for Computational Linguistics. [ bib | .ps | .pdf ]
Donald Engel, Eugene Charniak, and Mark Johnson. Parsing and Disfluency Placement. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 49-54, 2002. [ bib | .pdf ]
Heidi Fox. Phrasal Cohesion and Statistical Machine Translation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 304-311, Philadelphia, Pennsylvania, July 2002. Association for Computational Linguistics. [ bib | .pdf ]
Stuart Geman and Mark Johnson. Dynamic programming for parsing and estimation of stochastic unification-based grammars. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL'02), pages 279-286, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [ bib | .pdf ]
Stuart Geman and Mark Johnson. Probabilistic Grammars and their Applications. In N.J. Smelser and P.B. Baltes, editors, International Encyclopedia of the Social & Behavioral Sciences, pages 12075-12082, Pergamon, Oxford, 2002. [ bib | .pdf ]
Dmitriy Genzel and Eugene Charniak. Entropy Rate Constancy in Text. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 00-00, 2002. [ bib | .pdf ]
Mark Johnson. The DOP Estimation Method is Biased and Inconsistent. Computational Linguistics, 28(1):71-76, 2002. [ bib | .pdf ]
Mark Johnson. A Simple Pattern-matching Algorithm for Recovering Empty Nodes and their Antecedents. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 136-143, 2002. [ bib | .pdf | .ps ]
Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. III Maxwell, and Mark Johnson. Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 271-278, 2002. [ bib | .pdf ]
2001
Yasemin Altun and Mark Johnson. Inducing SFA with Epsilon-Translations Using Minimum Description Length. In Finite State Methods in Natural Language Processing Workshop, ESSLLI 2001, 2001. [ bib | .pdf ]
Don Blaheta and Mark Johnson. Unsupervised learning of multi-word verbs. In Proceedings of the 2001 ACL Workshop on Collocation, 2001. [ bib | .pdf ]
Eugene Charniak and Mark Johnson. Edit Detection and Parsing for Transcribed Speech. In Proceedings of the Second Conference of the North American chapter of the Association for Computational Linguistics (NAACL '01), 2001. [ bib | .pdf ]
Eugene Charniak. Immediate-Head Parsing for Language Models. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 124-131, 2001. [ bib | .pdf | .ps ]
Eugene Charniak. Unsupervised Learning of Name Structure From Coreference Data. In Second Meeting of the North American Chapter of the Association for Computational Linguistics (NACL-01), 2001. [ bib | .pdf ]
Keith Hall. A Statistical Model of Nominal Anaphora. Master's thesis, Brown University, Providence, RI, 2001. [ bib | .pdf ]
Mark Johnson. Joint and Conditional Estimation of Tagging and Parsing Models. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL-01), 2001. [ bib | .pdf ]
Brian Roark. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249-276, 2001. [ bib | .pdf ]
2000
Don Blaheta and Eugene Charniak. Assigning function tags to parsed text. In Proceedings of the First Conference of the North American chapter of the Association for Computational Linguistics (NAACL '00), pages 234-240, 2000. [ bib | .pdf ]
Eugene Charniak. Parsing to Meaning, Statistically. In Canadian Conference on AI, page 442, 2000. [ bib | .pdf ]
Eugene Charniak. A maximum-entropy-inspired parser. In Proceedings of the first conference on North American chapter of the Association for Computational Linguistics, pages 132-139, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [ bib | .pdf | tech-report ]
Eugene Charniak, Yasemin Altun, Rodrigo de Salvo Braz, Benjamin Garrett, Margaret Kosmala, Tomer Moscovich, Lixin Pang, Changbee Pyo, Ye Sun, Wei Wy, Z. Yang, S. Zeller, and L. Zorn. Reading Comprehension Programs in a Statistical-Language-Processing Class. In In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems (ANLP/NAACL-00), 2000. [ bib | .pdf ]
Massimiliano Ciaramita and Mark Johnson. Explaining away ambiguity: Learning verb selectional preference with Bayesian networks. In Proceedings of the 18th International Conference on Computational Linguistics, 2000. [ bib | .pdf ]
Keith Hall and Thomas Hofmann. Learning Curved Multinomial Subfamilies for Natural Language Processing and Information Retrieval. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 351-358. Morgan Kaufmann, 2000. [ bib | .pdf ]
Mark Johnson and Brian Roark. Compact non-left-recursive grammars using the selective left-corner transform and factoring. In Proceedings of the 18th conference on Computational linguistics (COLING '00), pages 355-361, 2000. [ bib | .pdf ]
Mark Johnson and Stefan Riezler. Exploiting auxiliary distributions in stochastic unification-based grammars. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NACL-00), pages 154-161, 2000. [ bib | .pdf ]
Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Weischedel. A novel use of statistical parsing to extract information from text. In Proceedings of the first conference on North American chapter of the Association for Computational Linguistics (NAACL'00), pages 226-233, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [ bib | .pdf ]
Stefan Riezler, Detlef Prescher, Jonas Kuhn, and Mark Johnson. Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training. In In Proceedings of 38th Annual Meeting of the Association for Compuational Linguistics (ACL-00), 2000. [ bib | .pdf ]
Brian Roark and Eugene Charniak. Measuring efficiency in high-accuracy, broad-coverage statistical parsing. In Proceedings of the COLING'00 Workshop on Efficiency in Large-scale Parsing Systems, pages 29-36, 2000. [ bib | .pdf ]
1999
Matthew Berland and Eugene Charniak. Finding parts in very large corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL '99), pages 57-64, 1999. [ bib | .pdf | tech-report ]
Don Blaheta and Eugene Charniak. Automatic compensation for parser figure-of-merit flaws. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL'99), pages 513-518, Morristown, NJ, 1999. Association for Computational Linguistics. [ bib | .pdf ]
Sharon A. Caraballo and Eugene Charniak. Determining the Specificity of Nouns from Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-99), 1999. [ bib | .ps ]
Mark Johnson. Type-driven semantic interpretation and Feature dependencies in R-LFG. Semantics and Syntax in Lexical Functional Grammar, pages 359-388, 1999. [ bib | .pdf ]
Mark Johnson. A Resource Sensitive Interpretation of Lexical Functional Grammar. Journal of Logic, Language and Information, 8(1):45-81, 1999. [ bib | .pdf | .ps ]
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. Estimators for Stochastic Unification-Based Grammars. In 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 535-541, 1999. [ bib | .pdf ]
Brian Roark and Mark Johnson. Efficient probabilistic top-down and left-corner parsing. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99), pages 421-428, 1999. [ bib | .pdf ]
1998
Sharon Caraballo and Eugene Charniak. New Figures of Merit for Best-First Probabalistic Chart Parsing. Computational Linguistics, 24(2):275-298, 1998. [ bib | .pdf ]
Eugene Charniak, Sharon Goldwater, and Mark Johnson. Edge-Based Best-First Chart Parsing. In Sixth Workshop on Very Large Corpora, pages 127-133, 1998. [ bib | .pdf ]
Zhiyi Chi and Stuart Geman. Estimation of probabilistic context-free grammars. Computational Linguistics, 24(2):299-305, 1998. [ bib | .pdf ]
Niyu Ge, John Hale, and Eugene Charniak. A statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora, Orlando, Florida, 1998. Harcourt Brace. [ bib | .pdf ]
This paper presents an algorithm for identifying pronominal anaphora and two experiments based upon this algorithm. We incorporate multiple anaphora resolution factors into a statistical framework - specifically the distance between the pronoun and the proposed antecedent, gender/number/animaticity of the proposed antecedent, governing head information and noun phrase repetition. We combine them into a single probability that enables us to identify the referent. Our first experiment shows the relative contribution of each source of information and demonstrates a success rate of 82.9% for all sources combined. The second experiment investigates a method for unsupervised learning of gender/number/animaticity information. We present some experiments illustrating the accuracy of the method and note that with this information added, our pronoun resolution method achieves 84.2% accuracy.
John Hale and Eugene Charniak. Getting Useful Gender Statistics from English Text. Technical Report CS-98-06, Brown University, Providence, RI, 1998. [ bib | .ps.Z | .html ]
Gender, understood as a lexical feature, is important for anaphora because it narrows down the number of possible referents involved in a typical pronoun resolution situation. This work describes an automatic method for obtaining reliable guesses about the gender of entities in a corpus using free text. By using a simple but unreliable anaphora algorithm repeatedly over a large corpus, the probable genders of referenced entities can be compiled and given a salience ranking. These statistics are an inexpensive way to add on gender-feature information to a statistical anaphora resolution algorithm.
Mark Johnson. Proof Nets and the Complexity of Processing Center Embedded Constructions. Journal of Logic, Language and Information, 7(4):433-447, 1998. [ bib | .pdf ]
Mark Johnson. The Effect of Alternative Tree Representations on Tree Bank Grammars. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning: (NeMLaP3/CoNLL98), pages 39-48, Somerset, New Jersey, 1998. Association for Computational Linguistics. [ bib | .pdf ]
Mark Johnson. PCFG Models of Linguistic Tree Representations. Computational Linguistics, 24(4):613-632, 1998. [ bib | .pdf | .ps.gz ]
Mark Johnson. Finite-state Approximation of Constraint-based Grammars using Left-corner Grammar Transforms. In COLING-ACL, pages 619-623, 1998. [ bib | .pdf | .ps ]
1997
Eugene Charniak. Statistical Techniques for Natural Language Parsing. AI Magazine, 18(4):33-44, 1997. [ bib | .pdf | .ps ]
Eugene Charniak. Statistical Parsing with a Context-Free Grammar and Word Statistics. In Proceedings of AAAI, pages 598-603, 1997. [ bib | .pdf | tech-report | .ps ]
We describe a parsing system based upon a language model for English that is, in turn, based upon assigning probabilities to possible parses for a sentence. This model is used in a parsing system by finding the parse for the sentence with the highest probability. This system outperforms previous schemes. As this is the third in a series of parsers by different authors that are similar enough to invite detailed comparisons but different enough to give rise to different levels of performance, we also report on some experiments designed to identify what aspects of these systems best explain their relative performance.
Mark Johnson. Features as resources in R-LFG. In Proceedings of the 1997 LFG Conference, 1997. [ bib | .ps ]
1996
Sharon Caraballo and Eugene Charniak. Figures of Merit for Best-First Probabilistic Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'96), pages 127-132, 1996. [ bib | tech-report | .pdf ]
Best-first parsing methods for natural language try to parse efficiently by considering the most likely constituents first. Some figure of merit is needed by which to compare the likelihood of constituents, and the choice of this figure has a substantial impact on the efficiency of the parser. While several parsers described in the literature have used such techniques, there is no published data on their efficacy, much less attempts to judge their relative merits. We propose and evaluate several figures of merit for best-first parsing.
Keywords: parsing, nlp
Eugene Charniak. Tree-bank Grammars. In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-96), 1996. [ bib | tech-report ]
By a “tree-bank grammar” we mean a context-free grammar created by reading the production rules directly from hand-parsed sentences in a tree bank. Common wisdom has it that such grammars do not perform well, though we know of no published data on the issue. The primary purpose of this paper is to show that the common wisdom is wrong. In particular we present results on a tree-bank grammar based on the Penn Wall Street Journal tree bank. To the best of our knowledge, this grammar out-performs all other non-word-based statistical parsers/grammars on this corpus. That is, it out-performs parsers that consider the input as a string of tags and ignore the actual words of the corpus.
Eugene Charniak, Glenn Carroll, John Adcock, Anthony R. Cassandra, Yoshihiko Gotoh, Jeremy Katz, Michael L. Littman, and John McCann. Taggers for Parsers. Artificial Intelligence, 85(1-2):45-57, 1996. [ bib | tech-report | .ps ]
We consider what tagging models are most appropriate as front ends for probabilistic context-free-grammar parsers. In particular we ask if using a tagger that returns more than one tag, a “multple tagger,” improves parsing performance. Our conclusion is somewhat surprising: single tag Markov-model taggers are quite adequate for the task. First of all, parsing accuracy, as measured by the correct assignment of parts of speech to words, does not increase significantly when parsers select the tags themselves. In addition, the work required to parse a sentence goes up with increasing tag ambiguity, though not as much as one might expect. Thus, for the moment, single taggers are the best taggers.
Eugene Charniak. Expected-Frequency Interpolation. Technical Report CS-96-37, Brown University, Providence, RI, 1996. [ bib | .html ]
Expected-frequency interpolation is a technique for improving the performance of deleted interpolation smoothing. It allows a system to make finer-grained estimates of how often one would expect to see a particular combination of events than is possible with traditional frequency interpolation. This allows the system to better weigh the emphasis given to the various probability distributions being mixed. We show that more traditional frequency interpolation, based solely on the frequency of conditioning events, can lead to some anomalous results. We then show that while the equations for expected-frequency interpolation are not exact, they are close, depending on how well some seemingly reasonable assumptions hold. We then present an experiment in which the introduction of expected-frequency interpolation to a statistical parsing system improved performance by .4% with essentially no extra work, and essentially no change in the workings of the system. We also note that even before the change, the system in question was the top performer at its task, so a .4% improvement was well worth obtaining.
Mark Johnson. Resource-sensitivity in Lexical-Functional Grammar. Proceedings of the 1996 Roma Workshop, 1996. [ bib ]
1995
Sam Bayer and Mark Johnson. Features and Agreement. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pages 70-76, 1995. [ bib | .pdf ]
This paper compares the consistency-based account of agreement phenomena in `unification-based' grammars with an implication-based account based on a simple feature extension to Lambek Categorial Grammar (LCG). We show that the LCG treatment accounts for constructions that have been recognized as problematic for `unification-based' treatments.
Eugene Charniak. Parsing with context-free grammars and word statistics. Technical Report CS-95-28, Brown University, Providence, RI, 1995. [ bib | .ps.Z ]
We present a language model in which the probability of a sentence is the sum of the individual parse probabilities, and these are calculated using a probabilistic context-free grammar (PCFG) plus statistics on individual words and how they fit into parses. We have used the model to improve syntactic disambiguation. After training on Wall Street Journal (WSJ) text we tested on about 200 WSJ sentence restricted to the 5400 most common words from our training. We observed a 41% reduction in bracket-crossing errors compared to the performance of our PCFG without the use of the word statistics.
Murat Ersan and Eugene Charniak. A statistical syntactic disambiguation program and what it learns. In Stefan Wermter, Ellen Riloff, and Gabriele Scheler, editors, Symbolic, Connectionist, and Statistical Approaches to Learning for Natural Language Processing, 1995. [ bib | tech-report ]
Mark Johnson. Memorization in Top-Down Parsing. Computational Linguistics, 21(3):405-415, 1995. [ bib | .pdf ]
Mark Johnson and Sam Bayer. Features and Agreement in Lambek Categorial Grammar. In Proceedings of the 1995 ESSLLI Formal Grammar Workshop, pages 123-137, 1995. [ bib | .ps.Z ]
Mark Johnson and Jochen Dorre. Memoization of coroutined constraints. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 100-107, Morristown, NJ, USA, 1995. Association for Computational Linguistics. [ bib | .pdf ]
1994
Glenn Carroll and Eugene Charniak. Combining Grammars For Improved Learning. Technical Report CS-94-08, Department of Computer Science, Brown University, February 1994. [ bib | .pdf | .ps | .html ]
We report experimental work on improving learning methods for probabilistic context-free grammars (PCFGs). From stacked regression we borrow the basic idea of combining grammars. Smoothing, a domain-independent method for combining grammars, does not offer noticeable performance gains. However, PCFGs allow much tighter, domain-dependent coupling, and we show that this maybe exploited for significant performance gains. Finally, we compare two strategies for acquiring the varying grammars needed for any combining method. We suggest that an unorthodox strategy, “leave-one-in” learning, is more effective than the more familiar “leave-one-out”.
Eugene Charniak, Glenn Carroll, John Adcock, Antony Cassandra, Yoshihiko Gotoh, Jeremy Katz, Michael Littman, and John McCann. Expected-Frequency Interpolation. Technical Report CS-94-06, Brown University, Providence, RI, 1994. [ bib | .ps.Z ]
Eugene Charniak and Glenn Carroll. Context-Sensitive Statistics for Improved Grammatical Language Models. Technical Report CS-94-07, Brown University, Providence, RI, 1994. [ bib | .ps.Z | .html ]
We develop a language model using probabilistic context-free grammars (PCFGs) that is “pseudo context-sensitive” in that the probability that a non-terminal $N$ expands using a rule $r$ depends on $N$'s parent. We derive the equations for estimating the necessary probabilities using a variant of the inside-outside algorithm. We give experimental results showing that, beginning with a high-performance PCFG, one can develop a pseudo PCSG that yields significant performance gains. Analysis shows that the benefits from the context-sensitive statistics are localized, suggesting that we can use them to extend the original PCFG. Experimental results confirm that this is both feasible and the resulting grammar retains the performance gains. This implies that our scheme may be useful as a novel method for PCFG induction.
Mark Johnson. Computing with Features as Formulae. Computational Linguistics, 20(1):1-25, 1994. [ bib | .pdf ]
1993
Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. Equations for Part-of-Speech Tagging. In National Conference on Artificial Intelligence, pages 784-789, 1993. [ bib | .ps ]
We derive from first principles the basic equations for a few of the basic hidden-Markov-model word taggers as well as equations for other models which may be novel (the descriptions in previous papers being too spare to be sure). We give performance results for all of the models. The results from our best model (96.45% on an unused test sample from the Brown corpus with 181 distinct tags) is on the upper edge of reported results. We also hope these results clear up some confusion in the literature about the best equations to use. However, the major purpose of this paper is to show how the equations for a variety of models may be derived and thus encourage future authors to give the equations for their model and the derivations thereof.
Eugene Charniak. Statistical Language Learning. The MIT Press, Cambridge, Massachusetts, 1993. [ bib | http ]
1992
Glenn Carroll and Eugene Charniak. Two Experiments on Learning Probabilistic Dependency Grammars from Corpora. Technical Report CS-92-16, Brown University, Providence, RI, USA, 1992. [ bib | .pdf ]
We present a scheme for learning probabilistic dependency grammars from positive training examples plus constraints on rules. In particular, we present the results of two experiments. The first, in which the constraints were minimal, was unsuccessful. The second, with significant constraints, was successful within the bounds of the task we had set.