GUCL: Computational Linguistics @ Georgetown
We are a group of Georgetown University faculty, student, and staff researchers at the intersection of language and computation. Our areas of expertise include natural language processing, corpus linguistics, information retrieval, text mining, and more. Members belong to the Linguistics and/or Computer Science departments.
- 9/10/18: #MeToo Movement on Twitter (Lisa Singh)
- 8/29/18: Cliches in baseball (Nathan Schneider)
- 1/20/18: The Coptic Scriptorium project (Amir Zeldes)
- Congratulations to Arman Cohan, Nazli Goharian, and Georgetown alum Andrew Yates for winning a Best Long Paper award at EMNLP 2017! The paper is entitled "Depression and Self-Harm Risk Assessment in Online Forums."
- Congratulations to Ophir Frieder, who has been named to the European Academy of Sciences and Arts (EASA)!
- 9/19/16: "Email" Dominates What Americans Have Heard About Clinton (Lisa Singh)
- 7/12/16: Searching Harsh Environments (Ophir Frieder)
Mailing list: Contact Nathan Schneider to subscribe!
- Adam Poliak (JHU): GUCL, 1/10/20, 1:00 in St. Mary’s 326
- Lisa Singh (GU): CS Faculty Seminar, 10:00 in St. Mary’s 326
- Aylin Caliskan (GWU): CS, 1/24/20, 1:30
- Noah Smith (UW): CS, 2/28/20, 2:00
- Matt Gardner (AI2): CS, Wednesday 3/4/20, 12:30
- MASC-SLL at UMD College Park: 3/6/20
- Postponed due to COVID-19 precautions:
- CS Graduate Research Presentations: 4/3/20, 1:30 via Zoom
- Franco Maria Nardini (ISTI-CNR): CS, Wednesday 4/22/20, 12:30
- Nicola Tonellotto (Pisa): CS, 4/24/20 via Zoom
- Fabrizio Silvestri (Facebook): CS, Wednesday 4/29/20, 12:30
Hal Daumé (UMD)
CS Colloquium 10/14/16, 11:00 in St. Mary’s 326
Learning Language through Interaction
Machine learning-based natural language processing systems are amazingly effective, when plentiful labeled training data exists for the task/domain of interest. Unfortunately, for broad coverage (both in task and domain) language understanding, we're unlikely to ever have sufficient labeled data, and systems must find some other way to learn. I'll describe a novel algorithm for learning from interactions, and several problems of interest, most notably machine simultaneous interpretation (translation while someone is still speaking).
This is all joint work with some amazing (former) students He He, Alvin Grissom II, John Morgan, Mohit Iyyer, Sudha Rao and Leonardo Claudino, as well as colleagues Jordan Boyd-Graber, Kai-Wei Chang, John Langford, Akshay Krishnamurthy, Alekh Agarwal, Stéphane Ross, Alina Beygelzimer and Paul Mineiro.
Hal Daumé III is an associate professor in Computer Science at the University of Maryland, College Park. He holds joint appointments in UMIACS and Linguistics. He was previously an assistant professor in the School of Computing at the University of Utah. His primary research interest is in developing new learning algorithms for prototypical problems that arise in the context of language processing and artificial intelligence. This includes topics like structured prediction, domain adaptation and unsupervised learning; as well as multilingual modeling and affect analysis. He associates himself most with conferences like ACL, ICML, NIPS and EMNLP. He earned his PhD at the University of Southern California with a thesis on structured prediction for language (his advisor was Daniel Marcu). He spent the summer of 2003 working with Eric Brill in the machine learning and applied statistics group at Microsoft Research. Prior to that, he studied math (mostly logic) at Carnegie Mellon University. He still likes math and doesn't like to use C (instead he uses O'Caml or Haskell).
Yulia Tsvetkov (CMU/Stanford)
Linguistics Speaker Series 11/11/16, 3:30 in Poulton 230
On the Synergy of Linguistics and Machine Learning in Natural Language Processing
One way to provide deeper insight into data and to build more powerful, robust models is bridging between linguistic knowledge and statistical learning. I’ll present model-based approaches that incorporate linguistic knowledge in novel ways.
First, I’ll show how linguistic knowledge comes to the rescue in processing languages which lack large data resources. I’ll describe a new approach to cross-lingual knowledge transfer that models the historical process of lexical borrowing between languages, and I will show how its predictions can be used to improve statistical machine translation systems.
In the second part of my talk, I’ll argue that linguistic insight helps improve learning also in resource-rich conditions. I’ll present three methods to integrate linguistic knowledge in training data, neural network architectures, and into evaluation of word representations. The first method uses features quantifying linguistic coherence, prototypicality, simplicity, and diversity to find a better curriculum for learning distributed representations of words. Distributed representations of words capture which words have similar meanings and occur in similar contexts. With improved word representations, we improve part-of-speech tagging, parsing, named entity recognition, and sentiment analysis. The second describes polyglot language models, neural network architectures trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted. Finally, the third is an intrinsic evaluation measure of the quality of distributed representations of words. It is based on correlations of learned vectors with features extracted from manually crafted lexical resources. This computationally inexpensive method obtains strong correlation with performance of the vectors in a battery of downstream semantic and syntactic evaluation tasks. I’ll conclude with future research questions.
Yulia Tsvetkov is a postdoc in the Stanford NLP Group, where she works on computational social science with professor Dan Jurafsky. During her PhD in the Language Technologies Institute at Carnegie Mellon University, she worked on advancing machine learning techniques to tackle cross-lingual and cross-domain problems in natural language processing, focusing on computational phonology and morphology, distributional and lexical semantics, and statistical machine translation of both text and speech. In 2017, Yulia will join the Language Technologies Institute at CMU as an assistant professor.
Marine Carpuat (UMD)
Linguistics Speaker Series 11/18/16, 3:30 in Poulton 230
Toward Natural Language Inference Across Languages
Natural Language processing tasks as diverse as automatically extracting information from text, answering questions, translating or summarizing documents, all require the ability to compare and contrast the meaning of words and sentences. State-of-the-art techniques rely on dense vector representations which capture the distributional properties of words in large amounts of text in a single language. We seek to improve these representations to capture not only similarity in meaning between words or sentences, but also inference relations such as entailment and contradiction, and enable comparisons not only within, but also across languages.
In this talk, we will present novel approaches to inducing word representations from multilingual text corpora. First, we will show that translations in e.g. Chinese can be used as distant supervision to induce English word representations that can be composed into better representations of English sentences (Elgohary and Carpuat, ACL 2016). Then we will show how sparsity constraints can further improve word representations, and enable the detection not only semantic similarity (do "cure" and "remedy" have the same meaning?), but also entailment (does "antidote" entail "cure"?) between words in different languages (Vyas and Carpuat, NAACL 2016).
Marine Carpuat is an Assistant Professor in Computer Science at the University of Maryland, with a joint appointment at UMIACS. Her research interests are in natural language processing, with a focus on multilinguality. Marine was previously a Research Scientist at the National Research Council of Canada, and a postdoctoral researcher at the Columbia University Center for Computational Learning Systems. She received a PhD in Computer Science from the Hong Kong University of Science & Technology (HKUST) in 2008. She also earned a MPhil in Electrical Engineering from HKUST and an engineering degree from the French Grande Ecole Supélec.
Shomir Wilson (UC)
CS Colloquium 11/21/16, 11:00 in St. Mary’s 326
Text Analysis to Support the Privacy of Internet Users
Shomir Wilson is an Assistant Professor of Computer Science in the Department of Electrical Engineering and Computing Systems at the University of Cincinnati. His professional interests span pure and applied research in natural language processing, privacy, and artificial intelligence. Previously he held postdoctoral and lecturer positions in Carnegie Mellon University's School of Computer Science, and he spent a year as an NSF International Research Fellow in the University of Edinburgh's School of Informatics. He received his Ph.D. in Computer Science from the University of Maryland in 2011.
Mark Dredze (JHU)
CS Colloquium 11/29/16, 11:00 in St. Mary’s 326
Topic Models for Identifying Public Health Trends
Twitter and other social media sites contain a wealth of information about populations and has been used to track sentiment towards products, measure political attitudes, and study social linguistics. In this talk, we investigate the potential for Twitter and social media to impact public health research. Broadly, we explore a range of applications for which social media may hold relevant data. To uncover these trends, we develop new topic models that can reveal trends and patterns of interest to public health from vast quantities of data.
Mark Dredze is an Assistant Research Professor in Computer Science at Johns Hopkins University and a research scientist at the Human Language Technology Center of Excellence. He is also affiliated with the Center for Language and Speech Processing and the Center for Population Health Information Technology. His research in natural language processing and machine learning has focused on graphical models, semi-supervised learning, information extraction, large-scale learning, and speech processing. His focuses on public health informatics applications, including information extraction from social media, biomedical and clinical texts. He obtained his PhD from the University of Pennsylvania in 2009.
Mona Diab (GW)
CS Colloquium 12/2/16, 2:30 in St. Mary’s 414
Processing Arabic Social Media: Challenges and Opportunities
We recently witnessed an exponential growth in Arabic social media usage. Processing such media is of great utility for all kinds of applications ranging from information extraction to social media analytics for political and commercial purposes to building decision support systems. Compared to other languages, Arabic, especially the informal variety, poses a significant challenge to natural language processing algorithms since it comprises multiple dialects, linguistic code switching, and a lack of standardized orthographies, to top its relatively complex morphology. Inherently, the problem of processing Arabic in the context of social media is the problem of how to handle resource poor languages. In this talk I will go over some of our insights to some of these problems and show how there is a silver lining where we can generalize some of our solutions to other low resource language contexts.
Mona Diab is an Associate Professor in the Department of Computer Science, George Washington University (GW). She is the founder and Director of the GW NLP lab (CARE4Lang). Before joining GW, she was a Research Scientist (Principal Investigator) at the Center for Computational Learning Systems (CCLS), Columbia University in New York. She is also co-founder of the CADIM group with Nizar Habash and Owen Rambow, which is one of the leading places and reference points on computational processing of Arabic and its dialects. Her research interests span several areas in computational linguistics/natural language processing: computational lexical semantics, multilingual processing, social media processing, information extraction & text analytics, machine translation, and computational socio-pragmatics. She has a special interest in low resource language processing with a focus on Arabic dialects.
Joel Tetreault (Grammarly)
CS Colloquium 1/27/17, 11:00 in St. Mary’s 326
Analyzing Formality in Online Communication
Full natural language understanding requires comprehending not only the content or meaning of a piece of text or speech, but also the stylistic way in which it is conveyed. To enable real advancements in dialog systems, information extraction, and human-computer interaction, computers need to understand the entirety of what humans say, both the literal and the non-literal. This talk presents an in-depth investigation of one particular stylistic aspect, formality. First, we provide an analysis of humans' subjective perceptions of formality in four different genres of online communication. We highlight areas of high and low agreement and extract patterns that consistently differentiate formal from informal text. Next, we develop a statistical model for predicting formality at the sentence level, using rich NLP and deep learning features, and then evaluate the model's performance against human judgments across genres. Finally, we apply our model to analyze language use in online debate forums. Our results provide new evidence in support of theories of linguistic coordination, underlining the importance of formality for language generation systems.
This work was done with Ellie Pavlick (UPenn) during her summer internship at Yahoo Labs.
Joel Tetreault is Director of Research at Grammarly. His research focus is Natural Language Processing with specific interests in anaphora, dialogue and discourse processing, machine learning, and applying these techniques to the analysis of English language learning, automated essay scoring among others. Currently he works on the research and development of NLP tools and components for the next generation of intelligent writing assistance systems. Prior to joining Grammarly, he was a Senior Research Scientist at Yahoo Labs, Senior Principal Manager of the Core Natural Language group at Nuance Communications, Inc., and worked at Educational Testing Service for six years as a managing research scientist where he researched automated methods for essay scoring, detecting grammatical errors by non-native speakers, plagiarism detection, and content scoring. Tetreault received his B.A. in Computer Science from Harvard University and his M.S. and Ph.D. in Computer Science from the University of Rochester. He was also a postdoctoral research scientist at the University of Pittsburgh's Learning Research and Development Center, where he worked on developing spoken dialogue tutoring systems. In addition, he has co-organized the Building Educational Application workshop series for 8 years, several shared tasks, and is currently NAACL Treasurer.
Kenneth Heafield (Edinburgh)
CS Colloquium 2/2/17, 11:00 in St. Mary’s 326
Machine Translation is Too Slow
We're trying to make machine translation output less terrible, but we're impatient. A neural translation system took two weeks to train in 1996 and two weeks to train in 2016 because the field used twenty years of computing advances to build bigger and better models subject to the same patience limit. I'll talk about multiple efforts to make things faster: coarse-to-fine search algorithms and sparse gradient updates to reduce network communication.
Kenneth Heafield is a Lecturer (~Assistant Professor) in computer science at the University of Edinburgh. Motivated by machine translation problems, he takes a systems-heavy approach to improving quality and speed of neural systems. He is the creator of the widely-used KenLM library for efficient language modeling.
Margaret Mitchell (Google Research)
CS Colloquium 2/16/17, 11:00 in St. Mary’s 326
Algorithmic Bias in Artificial Intelligence: The Seen and Unseen Factors Influencing Machine Perception of Images and Language
The success of machine learning has recently surged, with similar algorithmic approaches effectively solving a variety of human-defined tasks. Tasks testing how well machines can perceive images and communicate about them have exposed strong effects of different types of bias, such as selection bias and dataset bias. In this talk, I will unpack some of these biases, and how they affect machine perception today. I will introduce and detail the first computational model to leverage human Reporting Bias—what people mention—in order to learn ground-truth facts about the visual world.
I am a Senior Research Scientist in Google's Research & Machine Intelligence group, working on advancing artificial intelligence towards positive goals, as well as ethics in AI and demographic diversity of researchers. My research is on vision-language and grounded language generation, focusing on how to help computers communicate based on what they can process. My work combines computer vision, natural language processing, social media, many statistical methods, and insights from cognitive science. Before Google, I was a founding member of Microsoft Research's "Cognition" group, focused on advancing vision-language artificial intelligence. Before MSR, I was a postdoctoral researcher at The Johns Hopkins University Center of Excellence, where I mainly focused on semantic role labeling and sentiment analysis using graphical models, working under Benjamin Van Durme. Before that, I was a postgraduate (PhD) student in the natural language generation (NLG) group at the University of Aberdeen, where I focused on how to naturally refer to visible, everyday objects. I primarily worked with Kees van Deemter and Ehud Reiter. I spent a good chunk of 2008 getting a Master's in Computational Linguistics at the University of Washington, studying under Emily Bender and Fei Xia. Simultaneously (2005 - 2012), I worked on and off at the Center for Spoken Language Understanding, part of OHSU, in Portland, Oregon. My title changed with time (research assistant/associate/visiting scholar), but throughout, I worked on technology that leverages syntactic and phonetic characteristics to aid those with neurological disorders under Brian Roark. I continue to balance my time between language generation, applications for clinical domains, and core AI research.
Glen Coppersmith (Qntfy & JHU)
CS Colloquium 2/24/17, 11:00 in St. Mary’s 326
Quantifying the White Space
Behavioral assessment and measurement today are typically invasive and human intensive (for both patient and clinician). Moreover, by their nature, they focus on retrospective analysis by the patient (or the patient’s loved ones) about emotionally charged situations—a process rife with biases, not repeatable, and expensive. We examine all the data in the “white space” between interactions with the healthcare system (social media data, wearables, activities, nutrition, mood, etc.), and have shown quantified signals relevant to mental health that can be extracted from them. These methods to gather and analyze disparate data unobtrusively and in real time enable a range of new scientific questions, diagnostic capabilities, assessment of novel treatments, and quantified key performance measures for behavioral health. These techniques hold special promise for suicide risk, given the dearth of unbiased accounts of a person’s behavior leading up to a suicide attempt. We are beginning to see the promise of using these disparate data for revolution in mental health.
Glen is the founder and CEO of Qntfy (pronounced “quantify”), a company devoted to scaling therapeutic impact by empowering mental health clinicians and patients with data science and technology. Qntfy brings a deep understanding of the underlying technology and an appreciation for the human processes these technologies need to fit in to in order to make an impact. Qntfy, in addition to providing analytic and software solutions, considers it a core mission to push the fundamental and applied research at the intersection of mental health and technology. Qntfy built the data donation site OurDataHelps.org to gather and curate the datasets needed to drive mental health research, working closely with the suicide prevention community. Qntfy was also 2015 Foundry Cup grand prize winner – a design competition seeking innovative approaches to diagnosing and treating PTSD.
Prior to starting Qntfy, Glen was the first full-time research scientist at the Human Language Technology Center of Excellence at Johns Hopkins University where he joined in 2008. His research has focused on the creation and application of statistical pattern recognition techniques to large disparate data sets for addressing challenges of national importance. Oftentimes, the data of interest was human language content and associated metadata. Glen has shown particular acumen for enabling inference tasks that bring together diverse and noisy data. His work spans from principled exploratory data analysis, anomaly detection, graph theory, statistical inference and visualization.
Glen earned his Bachelors in Computer Science and Cognitive Psychology in 2003, a Masters in Psycholinguistics in 2005, and his Doctorate in Neuroscience in 2008, all from Northeastern University. As this suggests, his interests and knowledge are broad, from computer science and statistics to biology and psychology.
Jeniya Tabassum (OSU)
GUCL 4/6/17, 2:00 in St. Mary’s 326
Large Scale Learning for Temporal Expressions
Temporal expressions are words or phrases that refer to dates, times or durations. Social media especially contains time-sensitive information about various events and requires accurate temporal analysis. In this talk, I will present our work on TweeTIME, a minimally supervised time resolver that learns from large quantities of unlabeled data and does not require any hand-engineered rules or hand-annotated training corpora. This is the first successful application of distant supervision for end-to-end temporal recognition and normalization. Our proposed system outperforms all previous supervised and rule-based systems in the social media domain. I will also present ongoing work applying deep learning methods for resolving time expressions and discuss opportunities and challenges that a deep learning system faces when extracting time sensitive information from text.
Jeniya Tabassum is a third year PhD student in the Department of CSE at the Ohio Sate University, advised by Prof Alan Ritter. Her research focuses on developing machine learning techniques that can effectively extract relevant and meaningful information from social media data. Prior to OSU, she received a B.S. in Computer Science and Engineering from Bangladesh University of Engineering and Technology.
Jacob Eisenstein (GA Tech)
Linguistics Speaker Series 4/21/17, 3:30 in Poulton 230
Social Networks, Social Meaning
Language is socially situated: both what we say and what we mean depend on our identities, our interlocutors, and the communicative setting. The first generation of research in computational sociolinguistics focused on large-scale social categories, such as gender. However, many of the most socially salient distinctions are locally defined. Rather than attempt to annotate these social properties or extract them from metadata, we turn to social network analysis, which has been only lightly explored in traditional sociolinguistics. I will describe three projects at the intersection of language and social networks. First, I will show how unsupervised learning over social network labelings and text enables the induction of social meanings for address terms, such as “Ms” and “dude”. Next, I will describe recent research that uses social network embeddings to induce personalized natural language processing systems for individual authors, improving performance on sentiment analysis and entity linking even for authors for whom no labeled data is available. Finally, I will describe how the spread of linguistic innovations can serve as evidence for sociocultural influence, using a parametric Hawkes process to model the features that make dyads especially likely or unlikely to be conduits for language change.
Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on computational sociolinguistics, social media analysis, discourse, and machine learning. He is a recipient of the NSF CAREER Award, a member of the Air Force Office of Scientific Research (AFOSR) Young Investigator Program, and was a SICSA Distinguished Visiting Fellow at the University of Edinburgh. His work has also been supported by the National Institutes for Health, the National Endowment for the Humanities, and Google. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award. Jacob's research has been featured in the New York Times, National Public Radio, and the BBC. Thanks to his brief appearance in If These Knishes Could Talk, Jacob has a Bacon number of 2.
Christo Kirov (JHU)
GUCL 4/28/17, 2:00 in St. Mary’s 250
Rich Morphological Modeling for Multi-lingual HLT Applications
In this talk, I will discuss a number of projects aimed at improving HLT applications across a broad range of typologically diverse languages by modeling morphological structure. These include the creation of a very large, normalized morphological paradigm database derived from Wiktionary, consensus-based morphology transfer via cross-lingual projection, and approaches to lemmatization and morphological analysis and generation based on recurrent neural network architectures. Much of this work falls under the umbrella of the UniMorph project at CLSP, led by David Yarowsky and supported by DARPA LORELEI, and was developed in close collaboration with John Sylak-Glassman.
Dr. Christo Kirov is a Postdoctoral Research Fellow at the Center for Language and Speech Processing at JHU, working with David Yarowsky. His current research combines novel machine learning approaches with traditional linguistics to represent and learn morphological systems across the world’s languages, and to leverage this level of language structure in Machine Translation, Information Extraction, and other HLT tasks. Prior to joining CLSP, he was a Visiting Professor at the Georgetown University Linguistics Department. He has received his PhD in Cognitive Science from Johns Hopkins University studying under Colin Wilson, with dissertation work focusing on Bayesian approaches to phonology and phonetic expression.
Bill Croft (UNM)
Linguistics 5/18/17, 1:00 in Poulton 230
Linguistic Typology Meets Universal Dependencies
Current work on universal dependency schemes in NLP does not make reference to the extensive typological research on language universals, but could benefit since many principles are shared between the two enterprises. We propose a revision of the syntactic dependencies in the Universal Dependencies scheme (Nivre et al. 2015, 2016) based on four principles derived from contemporary typological theory: dependencies should be based primarily on universal construction types over language-specific strategies; syntactic dependency labels should match lexical feature names for the same function; dependencies should be based on the information packaging function of constructions, not lexical semantic types; and dependencies should keep distinct the “ranks” of the functional dependency tree.
William Croft received his Ph.D. in 1986 at Stanford University under Joseph Greenberg. He has taught at the Universities of Michigan, Manchester (UK) and New Mexico, and has been a visiting scholar at the Max Planck Institutes of Psycholinguistics and Evolutionary Anthropology, and at the Center for Advanced Study in the Behavioral Sciences. He has written several books, including Typology and Universals, Explaining Language Change, Radical Construction Grammar, Cognitive Linguistics [with D. Alan Cruse] and Verbs: Aspect and Causal Structure. His primary research areas are typology, semantics, construction grammar and language change. He has argued that grammatical structure can only be understood in terms of the variety of constructions used to express functions across languages; that both qualitative and quantitative methods are necessary for grammatical analysis; and that the study of language structure must be situated in the dynamics of evolving conventions of language use in social interaction.
Spencer Whitehead (RPI)
GUCL 8/15/17, 11:00 in St. Mary’s 326
Multimedia Integration: Event Extraction and Beyond
Multimedia research is becoming increasingly important, as we are immersed in an ever-growing ocean of noisy, unstructured data of various modalities, such as text and images. A major thrust of multimedia research is to leverage multimodal data to better extract information, including the use of visual information to post-process or re-rank natural language processing results, or vice versa. In our work, we seek to tightly integrate multimodal information into a flexible, unified approach that jointly utilizes text and images. Here we focus on one application: improving event extraction by incorporating visual knowledge with words and phrases from text documents. Such visual knowledge provides a means to overcome the challenges that the ambiguities of language introduce. We first discover named visual patterns in a weakly-supervised manner in order to avoid the requirement of parallel/well-aligned annotations. Then, we propose a multimodal event extraction algorithm where the event extractor is jointly trained with textual features and visual patterns. We find improvements of 7.1% and 8.5% absolute F-score gain on event trigger and argument labeling, respectively. Moving forward, we intend to extend the idea of tight integration of multimodal information to other tasks, namely image and video captioning.
Spencer Whitehead is a PhD student in the Computer Science Department at Rensselaer Polytechnic Institute, where he is advised by Dr. Heng Ji. His interests broadly span Natural Language Processing, Machine Learning, and Computer Vision, but mainly lie in the intersection of these fields: multimedia information extraction and natural language generation from multimedia data. A primary goal of his work is to develop intelligent systems that can utilize structured, unstructured, and multimodal data to extract information as well as generate coherent, accurate, and focused text. Central to his research is the creation of novel architectures, deep learning or otherwise, which can properly incorporate such heterogeneous data. He received his Bachelors of Science degree in Mathematics and Computer Science from Rensselaer Polytechnic Institute with highest honors.
Cristian Danescu-Niculescu-Mizil (Cornell)
Linguistics Speaker Series 10/13/17, 3:30 in Poulton 230
Conversational markers of social dynamics
Can conversational dynamics—the nature of the back and forth between people—predict the outcomes of social interactions? In this talk I will introduce a computational framework for modeling conversational dynamics and for extracting the social signals they encode, and apply it in a variety of different settings. First, I will show how these signals can be predictive of the future evolution of a dyadic relationship. In particular, I will characterize friendships that are unlikely to last and examine temporal patterns that foretell betrayal in the context of the Diplomacy strategy game. Second, I will discuss conversational patterns that emerge in problem-solving group discussions, and show how these patterns can be indicative of how (in)effective the collaboration is. I will conclude by focusing on the effects of under and over-confidence on the dynamics and outcomes of decision-making discussions.
This talk includes joint work with Jordan Boyd-Graber, Liye Fu, Dan Jurafsky, Srijan Kumar, Lillian Lee, Jure Leskovec, Vlad Niculae, Chris Potts and Justine Zhang.
Cristian Danescu-Niculescu-Mizil is an assistant professor in the information science department at Cornell University. His research aims at developing computational frameworks that can lead to a better understanding of human social behavior, by unlocking the unprecedented potential of the large amounts of natural language data generated online. He is the recipient of several awards—including the WWW 2013 Best Paper Award, a CSCW 2017 Best Paper Award, and a Google Faculty Research Award—and his work has been featured in popular-media outlets such as the Wall Street Journal, NBC's The Today Show, NPR and the New York Times.
Antonios Anastasopoulos (Notre Dame)
GUCL 10/20/17, 1:00 in Poulton 255
Speech translation for documentation of endangered languages
Most of the world's languages do not have a writing system, so recent documentation efforts for endangered languages have switched focus to annotating corpora with translations. This talk will present work on modelling parallel speech without access to transcriptions, both using a neural attentional model (Long et al, NAACL 2016) and an unsupervised probability model (Anastasopoulos et al, EMNLP 2016), as well as some recent work on using translations for term discovery (Anastasopoulos et al, SCNLP 2017).
Antonis Anastasopoulos is a fourth year PhD student at the University of Notre Dame, working with Prof. David Chiang. His research lies in the intersection of low resource speech recognition and machine translation, focusing on developing technologies for endangered languages documentation.
Katherine Waldock (GU MSB)
GUCL 10/27/17, 1:00 in Poulton 230
NLP Applications to a Corpus of Corporate Bankruptcy Documents
Data extraction from legal text presents a number of challenges that can be addressed using Natural Language Processing (NLP) methods. I discuss several applications that arise from a corpus of approximately 50 million pages of bankruptcy documents. These constitute substantially all documents from the universe of Chapter 11 cases filed between 2004 and 2014 that involved firms with over $10 million in assets. Examples of NLP applications include various classification issues (nested-phrase docket entries, financial reports, and legal writing), Part-of-Speech tagging, Optical Character Recognition, and quasi-tabular text.
Katherine Waldock is an Assistant Professor of Finance at the McDonough School of Business and holds a courtesy joint appointment with the Georgetown Law Center. She received a Ph.D. in Finance from the NYU Stern School of Business and a B.A. in Economics from Harvard University. Her primary research interests are in corporate bankruptcy, law and finance, small businesses, and financial institutions.
Tim Finin (UMBC)
CS Colloquium 10/27/17, 11:00 in St. Mary’s 326
From Strings to Things: Populating Knowledge Graphs from Text
The Web is the greatest source of general knowledge available today but its current form suffers from two limitations. The first is that text and multimedia objects on the Web are easy for people to understand but difficult for machines to interpret and use. The second is the Web's access paradigm, which remains dominated by information retrieval, where keyword queries produce a ranked list of documents that must be read to find the desired information. I'll discuss research in natural language understanding and semantic web technologies that addresses both problems by extracting information from text to produce and populate Web-compatible knowledge graphs. The resulting knowledge bases have multiple uses, including (1) moving the Web's access paradigm from retrieving documents to answering questions, (2) embedding semi-structured knowledge in Web pages in formats designed for computer to understand, (3) providing intelligent computer systems with information they need to perform their tasks, (4) allowing the extracted data and knowledge to be more easily integrated, enabling inference and advanced analytics and (5) serving as background knowledge to improve text and speech understanding systems. I will also cover current work on applying the techniques to extract and use cybersecurity-related information from documents, the Web and social media.
Tim Finin is the Willard and Lillian Hackerman Chair in Engineering and a Professor of Computer Science and Electrical Engineering at the University of Maryland, Baltimore County (UMBC). He has over 35 years of experience in applications of artificial intelligence to problems in information systems and language understanding. His current research is focused on the Semantic Web, analyzing and extracting information from text, and on enhancing security and privacy in computing systems. He is a fellow of the Association for the Advancement of Artificial Intelligence, an IEEE technical achievement award recipient and was selected as the UMBC Presidential Research Professor in 2012. He received an S.B. degree from MIT and a Ph.D. from the University of Illinois at Urbana-Champaign. He has held full-time positions at UMBC, Unisys, the University of Pennsylvania and the MIT AI Laboratory. He served as an editor-in-chief of the Journal of Web Semantics and is a co-editor of the Viewpoints section of the Communications of the ACM.
Matthew Marge (ARL)
CS Colloquium 11/3/17, 1:00 in St. Mary’s 326
Towards Natural Dialogue with Robots
Robots can be more effective teammates with people if they can engage in natural language dialogue. In this talk, I will address one fundamental research problem to achieving this goal: understanding how people will talk to robots in collaborative tasks, and how robots could respond in natural language to maintain an effective dialogue that stays on track. The unique contribution of this research is the adoption of a multi-phased approach to building spoken dialogue systems that starts with exploratory data collection of human-robot dialogue with a human “wizard” standing in for the robot’s language processing behind the scenes, and ends with training a dialogue system that automates away the wizard.
With the ultimate goal of an autonomous conversational robot in mind, I will focus on the initial experiments that aim to collect computationally tractable human-robot dialogue without sacrificing naturalness. I will show how this approach can efficiently collect dialogue in the navigation domain, and in a form suitable for training a conversational robot. I will also present a novel annotation scheme for dialogue semantics and structure that captures the types of instructions that people gave to the robot, showing that over time these can change as people better assess the robot’s capabilities. Finally, I’ll place this research effort in the broader context of enabling better teaming between people and robots.
This is joint work with colleagues at ARL and at the USC Institute for Creative Technologies.
Matthew Marge is a Research Scientist at the Army Research Lab (ARL). His research focuses on improving how robots and other artificial agents can build common ground with people via natural language. His current interests lie at the intersection of computational linguistics and human-robot interaction, specializing in dialogue systems. He received the Ph.D. and M.S. degrees in Language and Information Technologies from the School of Computer Science at Carnegie Mellon University, and the M.S. degree in Artificial Intelligence from the University of Edinburgh.
Ben Carterette (Delaware)
CS Colloquium 11/10/17, 11:00 in St. Mary’s 326
Offline Evaluation of Search Systems Using Online Data
Evaluation of search effectiveness is very important for being able to iteratively develop improved algorithms, but it is not always easy to do. Batch experimentation using test collections—the traditional approach dating back to the 1950s—is fast but has high start-up costs and requires strong assumptions about users and their information needs. User studies are slow and have high variance, making them difficult to generalize and certainly not possible to apply during iterative development. Online experimentation using A/B tests, pioneered and refined by companies such as Google and Microsoft, can be fast but is limited in other ways.
In this talk I present work we have done and work in progress on using logged online user data to do evaluation offline. I will discuss some of the user simulation work I have done with my students in the context of evaluating system effectiveness over user search sessions (in the context of the TREC Session track), based on training models on logged data for use offline. I will also discuss work on using historical logged data to re-weight search outputs for evaluation, focusing on how to collect that data to arrive at unbiased conclusions. The latter is work I am doing while on sabbatical at Spotify, which provides many motivating examples.
Ben Carterette is an Associate Professor in the Department of Computer and Information Sciences at the University of Delaware, and currently on sabbatical as a Research Scientist at Spotify in New York City. He primarily researches search evaluation, including everything from designing search experiments to building test collections to obtaining relevance judgments to using them in evaluation measures to statistical testing of results. He completed his PhD with James Allan at the University of Massachusetts Amherst on low-cost methods for acquiring relevance judgments for IR evaluation. He has published over 80 papers, won 4 Best Paper Awards, and co-organized two ACM SIGIR-sponsored conferences—WSDM 2014 and ICTIR 2016—in addition to nearly a decade's worth of TREC tracks and several workshops on topics related to new test collections and evaluation. He was also elected SIGIR Treasurer in 2016.
Laura Dietz (UNH)
CS Colloquium 11/14/17, 11:00 in St. Mary’s 326
Retrieving Complex Answers through Knowledge Graph and Text
We all turn towards Wikipedia with questions we want to know more about, but eventually find ourselves on the limit of its coverage. Instead of providing "ten blue links" as common in Web search, why not answer any web query with something that looks and feels like Wikipedia? This talk is about algorithms that automatically retrieve and identify relevant entities and relevant relations and can identify text to explain this relevance to the user. The trick is to model the duality between structured knowledge and unstructured text. This leads to supervised retrieval models can jointly identify relevant Web documents, Wikipedia entities, and extract support passages to populate knowledge articles.
Laura Dietz is an Assistant Professor at the University of New Hampshire, where she teaches "Information Retrieval" and "Data Science for Knowledge Graphs and Text". She coordinates the TREC Complex Answer Retrieval Track and runs a tutorial/workshop series on Utilizing Knowledge Graphs in Text-centric Retrieval. Previously, she was a research scientist in the Data and Web Science group at Mannheim University, and a research scientist with Bruce Croft and Andrew McCallum at the Center for Intelligent Information Retrieval (CIIR) at UMass Amherst. She obtained her doctoral degree with a thesis on topic models for networked data from Max Planck Institute for Informatics, supervised by Tobias Scheffer and Gerhard Weikum.
Ben Van Durme (JHU)
CS Colloquium 11/17/17, 11:00 in St. Mary’s 326
Universal Decompositional Semantics
The dominant strategy for capturing a symbolic representation of natural language has focussed on categorical annotations that lend themselves to structured multi-class classification. For example, predicting whether a given syntactic subject satisfies the definition of the AGENT thematic role. These annotations typically result from professionals coming to mutual agreement on semantic ontologies. The JHU Decompositional Semantics Initiative (decomp.net) is exploring a framework for semantic representation utilizing simple statements confirmed by everyday people, e.g., "The [highlighted syntactic subject] was aware of the [eventuality characterized by the salient verb]". This is conducive to a piece-wise, incremental, exploratory approach to developing a meaning representation. The resulting data relates to recent work in natural language inference, and common sense, two topics of increasingly larger interest within computational linguistics.
Benjamin Van Durme is an Assistant Professor in the departments of Computer Science and Cognitive Science at Johns Hopkins University, a member of the Center for Language and Speech Processing (CLSP), and the lead of Natural Language Understanding research at the JHU Human Language Technology Center of Excellence (HLTCOE). His research groupin CLSP consists of over a dozen graduate students, with additional post-docs, research staff, and a variety of close collaborations with fellow faculty at JHU and universities in the mid-Atlantic. His research covers a spectrum from computational semantics to applied frameworks for knowledge discovery on large, possibly streaming collections of text and recently photos. He is currently the PI for projects under DARPA DEFT, DARPA LORELEI, DARPA AIDA, and coPI on IARPA MATERIAL. His work has been supported by the NSF and companies including Google, Microsoft, Bloomberg, and Spoken Communications. Benjamin has worked previously at Google, Lockheed Martin, and BAE Systems. He received an MS in Language Technologies from the CMU Language Technologies Institute, followed by a PhD in Computer Science and Linguistics at the University of Rochester, working with Lenhart Schubert, Daniel Gildea and Gregory Carlson.
Jordan Boyd-Graber (UMD)
CS Colloquium 12/1/17, 1:00 in St. Mary’s 326
Cooperative and Competitive Machine Learning through Question Answering
My research goal is to create machine learning algorithms that are interpretable to humans, that can understand human strengths and weaknesses, and can help humans improve themselves. In this talk, I'll discuss how we accomplish this through a trivia game called quiz bowl. These questions are written so that they can be interrupted by someone who knows more about the answer; that is, harder clues are at the start of the question and easier clues are at the end of the question: a player must decide when it has enough information to "buzz in". Our system to answer quiz bowl questions depends on two parts: a system to identify the answer to questions and to determine when to buzz. We discuss how deep averaging networks—fast neural bag of words models—can help us answer questions quickly using diverse training data (previous questions, raw text of novels, Wikipedia pages) to determine the right answer and how deep reinforcement learning can help us determine when to buzz.
More importantly, however, this setting also helps us build systems to adapt in cooperation and competition with humans. In competition, we are also able to understand the skill sets of our competitors to adjust our strategy to optimize our performance against players using a deep mixture of experts opponent model. The game of quiz bowl also allows opportunities to better understand interpretability in deep learning models to help human players perform better with machine cooperation. This cooperation helps us with a related task, simultaneous machine translation.
Finally, I'll discuss opportunities for broader participation through open human-computer competitions: http://hcqa.boydgraber.org/
Jordan Boyd-Graber is an associate professor in the University of Maryland's Computer Science Department, iSchool, UMIACS, and Language Science Center. Jordan's research focus is in applying machine learning and Bayesian probabilistic models to problems that help us better understand social interaction or the human cognitive process. He and his students have won "best of" awards at NIPS (2009, 2015), NAACL (2016), and CoNLL (2015), and Jordan won the British Computing Society's 2015 Karen Spärk Jones Award and a 2017 NSF CAREER award. His research has been funded by DARPA, IARPA, NSF, NCSES, ARL, NIH, and Lockheed Martin and has been featured by CNN, Huffington Post, New York Magazine, and the Wall Street Journal.
Ellie Pavlick (Google/Brown)
CS Colloquium 1/19/18, 11:00 in St. Mary’s 204
Compositional Lexical Entailment for Natural Language Inference
In this talk, I will discuss my thesis work on training computers to make inferences about what is true or false based on information expressed in natural language. My approach combines machine learning with insights from formal linguistics in order to build data-driven models of semantics which are more precise and interpretable than would be possible using linguistically naive approaches. I will begin with my work on automatically adding semantic annotations to the 100 million phrase pairs in the Paraphrase Database (PPDB). These annotations provide the type of information necessary for carrying out precise inferences in natural language, transforming the database into a largest available lexical semantics resource for natural language processing. I will then turn to the problem of compositional entailment, and present an algorithm for performing inferences about long phrases which are unlikely to have been observed in data. Finally, I will discuss my current work on pragmatic reasoning: when and how humans derive meaning from a sentence beyond what is literally contained in the words. I will describe the difficulties that such "common-sense" inference poses for automatic language understanding, and present my on-going work on models for overcoming these challenges.
Ellie Pavlick is currently a Postdoctoral Fellow at Google Research in NY. She will join Brown University as an Assistant Professor of Computer Science in July. Ellie received her PhD from University of Pennsylvania, where her dissertation focused on natural language inference and entailment. Outside of her dissertation research, Ellie has published work on stylistic variation in paraphrase—e.g. how paraphrases can affect the formality or the complexity of language—and on applications of crowdsourcing to natural language processing and social science problems.
Burr Settles (Duolingo)
CS Colloquium 1/26/18, 11:00 in St. Mary’s 111
Duolingo: Improving Language Learning and Assessment with Data
Student learning data can and should be analyzed to develop new instructional technologies, such as personalized practice schedules and data-driven proficiency assessments. I will describe several projects at Duolingo—the world's most popular language education platform with more than 200 million students worldwide—where we combine vast amounts of learner data with machine learning, computational linguistics, and psychometrics to improve learning, testing, and engagement.
Burr Settles leads the research group at Duolingo, developing statistical machine learning systems to improve learning, engagement, and assessment. He also runs FAWM.ORG (a global collaborative songwriting experiment) and is the author of Active Learning—a text on AI algorithms that are curious and exploratory (if you will). His research has been published in numerous journals and conferences, and has been featured in The New York Times, Slate, Forbes, and WIRED. In past lives, he was a postdoc at Carnegie Mellon and earned his PhD from UW-Madison. He lives in Pittsburgh, where he gets around by bike and plays guitar in the pop band Delicious Pastries.
Sam Han (Washington Post)
CS Colloquium 2/1/18, 10:00 in St. Mary’s 326
Data Science @WaPo: How data science can help publishers succeed
The data science team at the Post has built a Big Data platform that helps to develop applications for personalization, newsroom productivity improvement and targeted advertisement. They ingest data from the digital side (washingtonpost.com and apps), paper, and external sources into the platform. They build various applications using the data stored in the platform. These applications are built to enhance user experience, perform targeted advertisement and improve newsroom work productivity. They also build tools to help newsroom adapt to the demands of digital journalism.
Sam will cover some of these applications in detail and share challenges and insights learned from the projects:
- Clavis is an audience targeting platform and is at the root of personalization efforts. Clavis analyzes everything they publish, builds user profiles, and provides personalized content and brand messaging.
- Virality is a machine learning based system that predicts the popularity of articles based on content, meta data, site traffic and social chatters.
- ModBot is an automatic comment moderation system that helps us to maintain a high quality comment section.
- Heliograf is an automated storytelling agent that automates the writing of data-driven articles and frees up reporters to focus on the high-quality stories.
- Bandito allows newsroom to test different headlines and other UX elements, determines the best performing experience and serves it to as much traffic as fast as possible.
- Headliner is an automated system that proposes several headlines for an article.
Eui-Hong (Sam) Han is Director, Data Science & AI at The Washington Post. Sam is an experienced practitioner of data mining and machine learning. He has in-depth understanding of analytics technologies and has experience of successfully applying these technologies to solve real business problems. At the Washington Post, he is leading a team to build an integrated Big Data platform to store all aspects of customer profiles and activities from both digital and print circulation, metadata of content, and business data. His team builds an infrastructure, tools, and services to provide personalized experience to customers, to empower newsroom with data for better decisions, and to provide targeted advertising capability. Prior to joining The Washington Post, he led Big Data practice at Persistent Systems, started Machine Learning Group at Sears Holdings Online Business Unit, and worked for a data mining startup company. His expertise includes data mining, machine learning, information retrieval, and high performance computing. He holds PhD in Computer Science from the University of Minnesota.
Shuly Wintner (Haifa)
CS Colloquium 2/2/18, 1:30 in St. Mary’s 326
Computational Approaches to the Study of Translation (and other Crosslingual Language Varieties)
Translated texts, in any language, have unique characteristics that set them apart from texts originally written in the same language; to some extent, they form a sub-language, called "translationese". Some of the properties of translationese stem from interference from the source language (the so-called "fingerprints" of the source language on the translation product); others are source-language-independent, and are presumably universal. These include phenomena resulting from three main processes: simplification, standardization and explicitation.
I will describe research that uses standard (supervised and unsupervised) text classification techniques to distinguish between translations and originals. I will focus on the features that best separate between the two classes, and how these features corroborate some (but not all) of the hypotheses set forth by Translation Studies scholars. More generally, I will show how computational methodologies shed light on pertinent Translation Studies questions.
Translation is only one instance of language that is affected by the interaction of more than one linguistic system. Another instance is the language of advanced, highly-fluent non-native speakers. Are translations and non-native language similar? In what respects? And are such similarities the result of interference or of more "universal" properties? I will discuss recent work that uses text classification to address these questions. In particular, I will describe work that addresses the identification of the source language of translations, and relate it to the task of Native Language Identification.
Shuly Wintner is professor of computer science at the University of Haifa, Israel. His research spans various areas of computational linguistics and natural language processing, including formal grammars, morphology, syntax, language resources, and translation. He served as the editor-in-chief of Springer's Research on Language and Computation, a program co-chair of EACL-2006, and the general chair of EACL-2014. He was among the founders, and twice (6 years) the chair, of ACL SIG Semitic. He is currently the Chair of the Faculty Union at the University of Haifa.
Rebecca Hwa (Pittsburgh)
Linguistics Speaker Series 2/16/18, 3:30 in Poulton 230
Separating the Sheep from the Goats: On recognizing the Literal and Figurative Usages of Idioms
Typically, we think of idioms as colorful expressions whose literal interpretations don’t match their underlying meaning. However, many idiomatic expressions can be used either figuratively or literally, depending on their contexts. In this talk, we survey both supervised and unsupervised methods for training a classifier to automatically distinguish usages of idiomatic expressions. We will conclude with a discussion about some potential applications.
Rebecca Hwa is an Associate Professor in the Department of Computer Science at the University of Pittsburgh. Her recent research focuses on understanding persuasion from a computational linguistics perspective. Some of her recent projects include: modeling student behaviors in revising argumentative essays, identifying symbolism in visual rhetoric, and understanding idiomatic expressions. Dr. Hwa is a recipient of the NSF CAREER Award. Her work has also been supported by NIH and DARPA. She has been the Chair of the North American Chapter of the Association for Computational Linguistics.
Mark Steedman (Edinburgh)
Linguistics Speaker Series 2/23/18, 3:30 in Poulton 230
Bootstrapping Language Acquisition
Recent work with Abend, Kwiatkowski, Smith, and Goldwater (2016) has shown that a general-purpose program for inducing parsers incrementally from sequences of paired strings (in any language) and meanings (in any convenient language of logical form) can be applied to real English child-directed utterance from the CHILDES corpus to successfully learn the child's ("Eve's") grammar, combining lexical and syntactic learning in a single pass through the data.
While the earliest stages of learning necessarily proceed by pure "semantic bootstrapping", building a probabilistic model of all possible pairings of all possible words and derivations with all possible decompositions of logical form, the later stages of learning show emergent effects of "syntactic bootstrapping" (Gleitman 1990), where the program's increasing knowledge of the grammar of the language allows it to identify the syntactic type and meaning of unseen words in one trial, as has been shown to be characteristic of real children in experiments with nonce-word learning. The concluding section of the talk considers the extension of the learner to a more realistic semantics including information structure and conversational dynamics.
Mark Steedman is Professor of Cognitive Science in the School of Informatics at the University of Edinburgh. Previously, he taught as Professor in the Department of Computer and Information Science at the University of Pennsylvania, which he joined as Associate Professor in 1988. His PhD in Artificial Intelligence is from the University of Edinburgh. He is a Fellow of the Association for the Advancement of Artificial Intelligence, the British Academy, the Royal Society of Edinburgh, the Association for Computational Linguistics, and the Cognitive Science Society, and a Member of the European Academy. His research interests cover issues in computational linguistics, artificial intelligence, computer science and cognitive science, including syntax and semantics of natural language, wide-coverage parsing and open-domain question-answering, comprehension of natural language discourse by humans and by machine, grammar-based language modeling, natural language generation, and the semantics of intonation in spoken discourse. Much of his current NLP research is addressed to probabilistic parsing and robust semantics for question-answering using the CCG grammar formalism, including the acquisition of language from paired sentences and meanings by child and machine. He sometimes works with colleagues in computer animation using these theories to guide the graphical animation of speaking virtual or simulated autonomous human agents, for which he recently shared the 2017 IFAAMAS Influential Paper Award for a 1994 paper with Justine Cassell and others. Some of his research concerns the analysis of music by humans and machines.
Claire Bonial (ARL)
Linguistics Speaker Series 3/23/18, 3:30 in Poulton 230
Event Semantics in Text Constructions, Vision, and Human-Robot Dialogue
“Ok, robot, make a right and take a picture” – a simple instruction like this exemplifies some of the obstacles in our research on human-robot dialogue: how are make and take to be interpreted? What precise actions should be executed? In this presentation, I explore three challenges: 1) interpreting the semantics of constructions in which verb meanings are extended in novel usages, 2) recognizing activities and events in images/video by employing information about the objects and participants typically involved, and 3) mapping natural language instructions to the physically situated actions executed by a robot. Throughout these distinct research areas, I leverage both Neo-Davidsonian styles of event representation and the principles of Construction Grammar in addressing these challenges for interpretation and execution.
Claire Bonial is a computational linguist specializing in the murky world of event semantics. In her efforts to make this world computationally tractable, she has collaborated on a variety of Natural Language Processing semantic role labeling projects, including PropBank, VerbNet, and Abstract Meaning Representation. A focused contribution to these projects has been her theoretical and psycholinguistic research on both the syntax and semantics of English light verb constructions (e.g., take a walk, make a mistake). Bonial received her Ph.D. in Linguistics and Cognitive Science in 2014 from the University of Colorado Boulder. Bonial began her current position in the Computational and Information Sciences Directorate of the Army Research Laboratory (ARL) in 2015. Since joining ARL, she has expanded her research portfolio to include multi-modal representations of events (text and imagery/video), as well as human-robot dialogue.
CS Ph.D. dissertation defense 3/19/18, 10:00 in St. Mary’s 326
Text summarization and categorization for scientific and health-related data
The increasing amount of unstructured health-related data has created a need for intelligent processing to extract meaningful knowledge. This knowledge can be utilized to promote healthcare and wellbeing of individuals. My research goal in this dissertation is to develop Natural Language Processing (NLP) and Information Retrieval (IR) methods for better understanding, summarizing and categorizing scientific literature and other health-related information.
First, I focus on scientific literature as the main source of knowledge distribution in scientific fields. It has become a challenge for researchers to keep up with the increasing rate at which scientific findings are published. As an attempt to address this problem, I propose summarization methods using citation texts and discourse structure of the papers to provide a concise representation of important contributions of the papers. I also investigate methods to address the problem of citation inaccuracy by linking the citations to their related parts in the target paper, capturing their relevant context. In addition, I raise the problem of the inadequacy of current summarization evaluation metrics for summarization in the scientific domain and present a method based on semantic relevance for evaluating the summaries.
In the second part, I focus on other significant sources of health-related information including clinical narratives and social media. I investigate categorization methods to address the critical problem of medical errors which is among leading causes of death worldwide. I demonstrate how we can effectively identify significant reporting errors and harmful cases through medical narratives which could help prevent similar future problems. These approaches include both the carefully designed feature-rich methods and more generalizable neural networks. Mental health is another significant dimension of health and wellbeing. Suicide, the most serious challenge in mental health, accounts for approximately 1.4% of all deaths worldwide and approximately one person dies by suicide every 40 seconds. I present both feature-rich and data-driven methods, to capture mental health conditions, such as depression, self-harm, and suicide, based on the general language expressed on social media. These methods have clear clinical and scientific applications and can help individuals with mental health conditions.
Advisor: Nazli Goharian
CS Ph.D. dissertation defense 3/26/18, 10:00 in St. Mary’s 326
The Knowledge and Language Gap in Medical Information Seeking
Interest in medical information retrieval has raised significantly in the last few years. The Internet has become a primary source for consumers looking for health information and advice; however, their lack of expertise causes a language and knowledge gap that affects their ability to properly formulate their information needs. Health experts also struggle to efficiently search the large amount of medical literature available to them, which impacts their ability of integrating the latest research findings in clinical practice. In this dissertation, I propose several methods to overcame these challenges, thus improving search outcomes. For queries issued by lay users, I introduce query clarification, a technique to identify the most appropriate expert expression that describes their information need; such expression is then used to expand the query. I experiment with three existing synonym mappings, and show that the best one leads to a 7.3% improvement over non-clarified queries. When a classifier that predicts the most appropriate mapping for each query is used, an additional 5.2% improvement over non-clarified queries is achieved. Furthermore, I introduce a set of features to capture semantic similarity between consumer queries and retrieved documents, which are then exploited by a learning to rank framework. This approach yields a 26.6% improvement over the best known results on a dataset designed to evaluate medical information retrieval for lay users. To improve literature search for medical professionals, I propose and evaluate two query reformulation techniques that expand complex medical queries with relevant latent and explicit medical concepts. The first is an unsupervised system that combines a statistical query expansion with a medical terms filter, while the second is a supervised neural convolutional model that predicts which terms to add to medical queries. Both approaches are competitive with the state of the art, achieving up to 8% improvement in inferred nDCG. Finally, I conclude my dissertation by showing how the convolutional model can be adapted to reduce clinical notes that contain significant noise, such as medical abbreviations, incomplete sentences, and redundant information. This approach outperforms the best query reformulation system for this task by 27% in inferred nDCG.
Advisor: Nazli Goharian
Paul Smolensky (JHU)
Linguistics Speaker Series 4/13/18, 11:00 in Poulton 230
Gradient Symbolic Representations in Grammar: The case of French Liaison (a co-authored study by Paul Smolensky, Matthew Goldrick & Eric Rosen)
In Gradient Symbolic Computation, representations are structures in which a given position hosts a blend of symbols each with its own numerical activity level. A goal of Gradient Symbolic Computation for linguistic theory is to enable resolution of a ubiquitous type of impasse: two conflicting structures for analyzing the same linguistic phenomenon persist indefinitely because each can handle data that the other cannot. The hypothesis is that a single theory deploying a numerically weighted blend of the two proposed structures can simultaneously account for both bodies of data. I will describe an attempt at testing this hypothesis: blending two theories of French liaison, a phenomenon that has been argued to be regulated by prosodic boundaries, and one in which child data play a provocative role.
Dr. Smolensky is a professor of Cognitive Science at Johns Hopkins University. His research focuses on integrating symbolic and neural network computation for modeling reasoning and, especially, grammar in the human mind/brain. The work is formal and computational, with emerging applications to neuroscience and applied natural language processing. His research has primarily addressed issues of representation and processing rather than learning. Principal contributions are to linguistics theory, the theory of vectorial neural network computation, and the philosophical foundations of cognitive science.
John Conroy (IDA Center for Computing Sciences)
CS Colloquium 4/20/18, 11:00 in St. Mary’s
Multilingual Summarization and Evaluation Using Wikipedia Featured Articles
Multilingual text summarization is a challenging task and an active area of research within the natural language processing community. In this talk I will give an overview of the use of Wikipedia featured articles are used to create datasets comprising about 40 languages for the training and testing of automatic single document summarization methods. These datasets were used in 2015 and 2017 MultiLing Workshop's single document summarization task. I will give an overview of the methods used to both generate and to evaluate the summaries submitted for the tasks. Systems overall performance are measured using automatic and human evaluations and these data are analyzed to evaluate the effectiveness of the automatic methods for multi-lingual summarization evaluation. Thus, the results not only suggest which approaches to automatic text summarization generalize across a wide range of languages but also which evaluation metrics are best at predicting human judgments in the multilingual summarization task. This talk is based on a soon to appear book chapter to be published by World Scientific Press. The chapter is jointly written with Jeff Kubina, Peter Rankel, and Julia Yang.
John M. Conroy is a graduate of Central High School of Philadelphia with a BA and St. Joseph's University of Philadelphia, where he received a BS in Mathematics. Conroy then studied Applied Mathematics with a concentration in Computer Science at the University of Maryland, where he received an MS and PhD in Mathematics. He has been a research staff member at the IDA Center for Computing Sciences for over 30 years. Conroy is the co-developer of the CLASSY and the OCCAMS text summarization systems. He has published widely in text summarization and evaluation and serves on numerous program committees for summarization. He is also a co-inventor of patented summarization methods. Other publications by Conroy include high performance matrix computations, graph matching and anomaly detection, with application to neural science and network security. He is a member of the Association for Computational Linguistics, a life member of the Society for Industrial and Applied Mathematics, and a member of the Institute for Electronics and Electrical Engineers.
CS Ph.D. dissertation defense 4/17/18, 11:00 in Regents 551
Dynamic Search Models and Applications
Dynamic search is an information retrieval task that involves a sequence of queries for a complex information need (e.g. searching for one-week tour plans in Italy). It is characterized by rich user-system interactions and temporal dependency between queries and between consecutive user behaviors. Dynamic search is a trial-and-error process, which matches well with the family of Reinforcement Learning (RL) algorithms: the RL algorithms learn from repeated, varied attempts which are continued until success. The learner/agent learns from its dynamic interactions with the world. These similarities inspire me to model dynamic search using RL frameworks.
In particular, I model dynamic search as a dual-agent stochastic game, one of the standard variants of Partially Observable Markov Decision Process (POMDP), where the user agent and the search engine agent work together to jointly maximize their long-term rewards. In the framework, users’ search statuses, such as exploitation, exploration, struggle etc. are modeled as hidden states, which can only be estimated through user interaction data. In each search iteration, one search algorithm is picked from a set of candidates to maximize a reward function. My work provides a general framework to model dynamic search. It enables the use of Reinforcement Learning algorithms for optimizing retrieval results. The experiments on the Text REtrieval Conference (TREC) 2011–2014 Session Track datasets show a statistically significant improvement over the state-of-the-art dynamic search algorithms.
The design of states, actions, and rewards is quite flexible when using POMDPs in dynamic search. In the thesis, I also examine all available design choices from related work and compare their retrieval accuracy and efficiency. My analysis reveals the effects of these choices, which enables me to recommend practical design choices. The finding again proves that modeling dynamic search using POMDPs is promising, however, it also shows that this approach is computational demanding.
To improve the efficiency of above dynamic search models, I propose another RL framework, direct policy learning, which finds optimal policies for the best search engine actions directly from what is observed in the user and search engine interactions via gradient descent. The proposed framework greatly reduces the model complexity than the POMDP framework. It is also a flexible design, which includes a wide range of features describing the rich interactions in dynamic search. The framework is shown to be highly effective and efficient on the TREC Session Tracks.
In addition to the dynamic search frameworks, I propose predictive models to detect user struggling states in search. Most prior work uses effort-based features to detect user struggle. However, recent studies suggest that there might be a gap between user effort and user struggle. In this work, I take a psychological perspective to bridge this gap. I demonstrate that after removing the biases introduced by different searcher motivations, user struggle can be tightly connected to user effort.
At last, I implement a dynamic search tool to support the annotation task for the TREC Dynamic Domain Tracks. This serves as my first trial of implementing dynamic search algorithm and struggling detection in practice. The results show that my algorithm is also effective in real life settings.
The research in this thesis is among the first in this field and serves as one step towards solving dynamic search and developing real-world applications.
Advisor: Grace Hui Yang
Maite Taboada (Simon Fraser)
Linguistics Speaker Series 4/20/18, 3:30 in Poulton 230
Fantastic online comments and how to find them
I provide an overview of my current research on discourse and computational methods to analyze social media language. The first part of the talk will be devoted to outlining the two frameworks for this research: rhetorical relations and sentiment analysis. Rhetorical relations are the fundamental building blocks of discourse, connecting propositions to make coherent text. I will describe existing research on rhetorical relations, and present a study on how relations are signalled by discourse markers and other linguistic devices. Then I introduce my work on sentiment analysis, and on the role that rhetorical relations and other contextual factors play in the interpretation of sentiment and opinion.
The second part of the talk is devoted to describing a current project, analyzing online news comments in terms of constructiveness and toxicity. Using a large corpus of comments, I describe how we have modelled constructiveness in terms of rhetorical and argumentation relations, and toxicity as a type of extreme negative sentiment.
Maite Taboada is Professor in the Department of Linguistics at SFU. Her research combines discourse analysis and computational linguistics, with an emphasis on discourse relations and sentiment analysis. Current work focuses on the analysis of online comments, drawing insights from corpus linguistics, computational linguistics and big data. She is the director of the Discourse Processing Lab at SFU.
Alexander Rush (Harvard)
CS Colloquium 4/26/18, 12:30 in St. Mary’s 326
Challenges in End-to-End Generation
Progress in NMT has led to optimism for text generation tasks such as summarization and dialogue, but it has been more difficult to quantify the successes and challenges in this space. In this talk, I will survey some of the recent advances in neural NLG, and present a successful implementation of these techniques for the 2017 E2E NLG challenge (Gehrmann et al, 2018). Despite success on these small scale examples, though, we see that similar models fail to scale to a more realistic data-to-document corpus. Analysis shows systems will need further improvements in discourse modeling, reference, and referring expression generation (Wiseman et al, 2017). Finally, I will end by presenting recent work in unsupervised NLG that shows promising results in neural style transfer using a continuous GAN-based text autoencoder (Zhao et al 2017).
Alexander "Sasha" Rush is an assistant professor at Harvard University. His research interest is in ML methods for NLP with recent focus on deep learning for text generation including applications in machine translation, data and document summarization, and diagram-to-text generation, as well as the development of the OpenNMT translation system. His past work focused on structured prediction and combinatorial optimization for NLP. Sasha received his PhD from MIT supervised by Michael Collins and was a postdoc at Facebook NY under Yann LeCun. His work has received four research awards at major NLP conferences.
Alexander Rush (Harvard)
Statistical Machine Translation course 4/26/18, 3:30 in White-Gravenor 206
Towards Easier Machine Translation
Much has been made about the accuracy improvements from neural systems, but the transition to NMT will potentially make the technology of translation more accessible in transformative ways. I will present three projects from Harvard with this goal, including: 1) OpenNMT; a collaborative open-source project to provide benchmark NMT components, 2) Sequence distillation; a research project to build smaller, faster, on-device NMT systems, and 3) Sequence Vis; a visualization framework designed to support inspection and debugging of translation output. These projects aim to make translation systems easier-to-extend, easier-to-run, and easier-to-understand.
Alexander "Sasha" Rush is an assistant professor at Harvard University. His research interest is in ML methods for NLP with recent focus on deep learning for text generation including applications in machine translation, data and document summarization, and diagram-to-text generation, as well as the development of the OpenNMT translation system. His past work focused on structured prediction and combinatorial optimization for NLP. Sasha received his PhD from MIT supervised by Michael Collins and was a postdoc at Facebook NY under Yann LeCun. His work has received four research awards at major NLP conferences.
William Croft (UNM)
Linguistics Speaker Series Extra 5/14/18, 3:30 in Poulton 230
A Mental Space Analysis of Tense and Modality
This talk presents a progress report on developing a mental space analysis of tense and modality, based on prior cognitive semantic and typological research in those domains. Mental spaces are a model to represent "alternative realities", including past and future times and unrealized events, the focus of our interest. Mental spaces are evoked by specific grammatical constructions including tense, mood and modality, complement-taking predicates and conditional constructions. Following Cutrer (1994), we analyze these constructions as providing access paths for the hearer from one space (viewpoint) to another space (focus), which may be a new space or one already established in the discourse. We propose a revised (and simplified) version of Cutrer's analysis of tense, based on Comrie's (1981) analyses. We also adopt Fillmore's (1990) analysis of conditionals, treating his concept of epistemic stance as a relation between the hypothetical space and the "reality" (speaker's belief) space. We extend epistemic stance to model epistemic modality, following Boye (2012), and use a mental space analysis of Clark's (1996) theory of common ground (shared knowledge) extended to individual knowledge to model Boye's theory of the relationship between epistemic modality and evidentiality. Finally, we argue that hypothetical and "reality" (speaker belief) spaces are of the same kind, differing only in epistemic stance, and that they should ultimately be embedded in an interactional model of the negotiation of shared knowledge, as indicated by the grouping of epistemic, evidential and (knowledge) interactional categories in a single grammatical category (see Palmer 2001).
William Croft received his Ph.D. in 1986 at Stanford University under Joseph Greenberg. He has taught at the Universities of Michigan, Manchester (UK) and New Mexico, and has been a visiting scholar at the Max Planck Institute of Psycholinguistics (Niijmegen, the Netherlands) the Max Planck Institute for Evolutionary Anthropology (Leipzig, Germany), and at the Center for Advanced Study in the Behavioral Sciences. He has written several books, including Typology and Universals, Explaining Language Change, Radical Construction Grammar, Cognitive Linguistics [with D. Alan Cruse] and Verbs: Aspect and Causal Structure. He is currently working on his next book, Morphosyntax: Constructions of the World’s Languages. His primary research areas are typology, semantics, construction grammar and language change.
Adam Lopez (Edinburgh)
CS Colloquium 6/18/18, 11:00 in St. Mary’s 326
What do neural networks learn about language?
Neural network models have redefined the state-of-the-art in many areas of natural language processing. Much of this success is attributed to their ability to learn representations of their input, and this has invited bold claims that these representations encode important semantic, syntactic, and morphological properties of language. For example, when one research group recently suggested that "prior information regarding morphology ... among others, should be incorporated" into neural models, a prominent deep learning group retorted that it is "unnecessary to consider these prior information" when using neural networks. In this talk I’ll try to tease apart the hype from the reality, focusing on two questions: what do character-level neural models really learn about morphology? And what do LSTMs learn about negation?
This is work with Clara Vania, Federico Fancellu, Yova Kementchedjhieva, Andreas Grivas, and Bonnie Webber.
Adam Lopez is a Reader in the School of Informatics at the University of Edinburgh ("Reader" is a peculiar British title meaning "Associate Professor"). His research group develops computational models of natural language learning, understanding and generation in people and machines, and their research focuses on basic scientific, mathematical, and engineering problems related to these models. He's especially interested in models that handle diverse linguistic phenomena across languages.
Brendan O’Connor (UMass Amherst)
Linguistics Speaker Series 9/7/18, 3:30 in Poulton 230
Demographic bias in social media language analysis: a case study of African-American English
We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter, through a demographically supervised model to identify AAE-like language associated with geo-located messages. We verify that this language follows well-known AAE linguistic phenomena -- and furthermore, existing tools like language identification, part-of-speech tagging, and dependency parsing fail on this AAE-like language more often than text associated with white speakers. We leverage our model to fix racial bias in some of these tools, and discuss future implications for fairness and artificial intelligence.
Brendan O'Connor is an assistant professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst, and works in the intersection of computational social science and natural language processing – studying how social factors influence language technologies, and how to better understand social trends with text analysis. For example, his research investigates racial bias in NLP technologies, political events reported in news, and opinions and slang in Twitter; his work has been featured in the New York Times and the Wall Street Journal. He received his PhD in 2014 from Carnegie Mellon University's Machine Learning Department, advised by Noah Smith, and has previously been a Visiting Fellow at the Harvard Institute for Quantitative Social Science, and an intern with the Facebook Data Science team. He holds a BS/MS in Symbolic Systems from Stanford University.
Rebecca Krawiec (Canisius),
Christine Luckritz Marquis (Union Presbyterian),
Beth Platte (Reed),
Caroline Schroeder (Pacific),
Amir Zeldes (GU)
Linguistics Speaker Series 9/14/18, 3:30 in Poulton 230
Coptic Scriptorium: A Linked Digital Environment for Coptic Studies
The Coptic language represents the last phase of the Ancient Egyptian phylum of the Afro-Asiatic language family, forming part of the longest continuously documented human language on Earth. Despite its high value for historical, comparative and typological linguistics, as well as its cultural importance as the heritage language of Copts in Egypt and in the diaspora, digital resources for the study of Coptic have only recently become available.
Coptic Scriptorium is an interdisciplinary Digital Humanities project dedicated to providing language documentation resources for Coptic, including linguistically and philologically annotated corpora, online lexical resources and automatic tools for processing the language. In this talk, we will present some of our work in building and using Coptic corpora to explore the world of first millennium Egypt and its language. We will feature the technologies and infrastructures that can enhance research in under-resourced languages as well as the benefits of collaboration across institutions and humanities disciplines, including some examples of our research using the digital research environment. The talk will discuss some of the challenges and solutions we have found and are working on for creating tools such as an online dictionary for Coptic, formulating annotation guidelines, building the first syntactically annotated datasets for the language and linking information about entities such as people and places to work in other projects using Linked Open Data standards.
Omri Abend (HUJI)
Linguistics Speaker Series 9/21/18, 3:30 in Poulton 230
UCCA: A computational approach to cross-linguistic semantic representation
Analyzing large volumes of text in various languages and mapping them into a representation that reflects content, rather than form, is one of the great challenges NLP must face. The talk will present UCCA (Universal Conceptual Cognitive Annotation), an approach to structural semantic representation that emphasizes cross-linguistic applicability and accessibility to non-expert annotators, which represents a step towards meeting this challenge. The first part of the talk will introduce the UCCA scheme and show how it provides a typologically motivated characterization of abstract semantic structure. The second part will discuss a transition-based approach to UCCA parsing, and experiments on leveraging annotated data in other formalisms (such as AMR and SDP) to improve UCCA parsing through multi-task learning. The talk will conclude with an overview of how UCCA is being applied to text-to-text generation tasks, such as machine translation and text simplification, and their evaluation.
All of UCCA's resources are freely available at http://www.cs.huji.ac.il/~oabend/ucca.html.
Joint work with Leshem Choshen, Daniel Hershcovich, Ari Rappoport and Elior Sulem.
Omri is a faculty member in the Hebrew University's departments of Computer Science and Cognitive Science. Previously he was a post-doc in Mark Steedman's lab in the University of Edinburgh. Omri earned his PhD from the Hebrew University of Jerusalem, where he was supervised by Ari Rappoport. Before that, he studied mathematics and cognitive sciences in the Hebrew University. Omri's research focuses on semantics, including semantic representation and parsing, as well as corpus annotation and evaluation.
Daniel Khashabi (UPenn)
GUCL 9/28/18, 2:00 in St. Mary’s 326
Reasoning-Driven Question Answering
Most current methods for Question Answering (QA) treat it as yet another machine learning problem. I argue that this leads to only shallow understanding of text, which is reflected in poor generalization. In this talk, I will show how techniques involving Knowledge Representation, Reasoning, and Logic can be combined with machine learning to improve QA systems. First, I will introduce a formulation for abductive reasoning in natural language and show its effectiveness in two different domains. Next, I will show a few recent theoretical results on limitations of multi-step reasoning. Finally, I will introduce MultiRC, a reading comprehension playground where questions can only be answered based on information from multiple sentences.
Daniel Khashabi is PhD candidate with Prof. Dan Roth at the University of Pennsylvania. His interests lie at the intersection of computational intelligence and natural language processing. Daniel obtained his B.Sc. from Amirkabir University of Technology (Tehran Polytechnic) in 2012, and spent a few years as a graduate researcher at University of Illinois, Urbana-Champaign, before moving to UPenn in 2017.
Özlem Uzuner (GMU)
CS Colloquium 10/5/18, 1:30 in St. Mary’s 326
De-identification of Clinical Narratives
De-identification is the task of finding and removing patient identifying private health information from electronic health records. This is a sensitive and complicated task: failure to remove all private health information from records results in violations of privacy, whereas overzealous de-identification can remove medically-salient text from the records and render the data unusable for future research. Ambiguities between private health information and medical concepts can exacerbate this problem. Misspellings and foreign names render dictionaries inadequate. Short, telegraphic style of clinical narratives and the density of jargon therein limits transfer of methods from open domain.
In this talk, we will present two natural language processing approaches to de-identification. First, we will define and use "local context". We will show that local context goes a long way towards effective de-identification, even when the concepts are presented in ungrammatical and fragmented narrative text. Next, we will take a deep learning approach to learning long-distance contextual features for de-identification. We will show that featureless recurrent neural networks (RNNs) can adequately capture long-distance contextual information, giving comparable de-identification performance to methods with manually designed features. Addition of manually-designed features to featureless RNNs gives them the boost they need in order to outperform the state of the art.
Dr. Ozlem Uzuner is an associate professor at the Information Sciences and Technology Department of George Mason University. She also holds a visiting associate professor position at Harvard Medical School and is a research affiliate at the Computer Science and Artificial Intelligence Laboratory of MIT. Dr. Uzuner specializes in Natural Language Processing and its applications to real-world problems, including healthcare and policy. Her current research interests include information extraction from fragmented and ungrammatical narratives for capturing meaning, studies of consumer generated text such as social media and electronic petitions, and semantic representation development for phenotype prediction, fraud detection, and topic modeling. Her research has been funded by National Institutes of Health, National Libraries of Medicine, National Institutes of Mental Health, Office of the National Coordinator, and by industry.
Tal Linzen (JHU)
Linguistics Speaker Series 10/19/18, 3:30 in Poulton 230
On the Syntactic Abilities of Recurrent Neural Networks
Recent technological advances have made it possible to train recurrent neural networks (RNNs) on a much larger scale than before. These networks have proved effective in applications such as machine translation and speech recognition. These engineering advances are surprising from a cognitive point of view: RNNs do not have the kind of explicit structural representations that are typically thought to be necessarily for syntactic processing. In this talk, I will discuss studies that go beyond standard engineering benchmarks and examine the syntactic capabilities of contemporary RNNs using established cognitive and linguistic diagnostics. These studies show that RNNs are able to compute agreement relations with considerable success across languages, although their error rate increases in complex sentences. A comparison of the detailed pattern of agreement errors made by RNNs to those made by humans in a behavioral experiment reveals some similarities (attraction errors, number asymmetry) but also some differences (relative clause modifiers increase the probability of attraction errors in RNNs but decrease it in humans). Overall, RNNs can learn to exhibit sophisticated syntactic behavior despite the lack of an explicit hierarchical bias, but their behavior differs from humans in important ways.
Tal Linzen is an Assistant Professor of Cognitive Science at Johns Hopkins University. Before moving to Johns Hopkins in 2017, he was a postdoctoral researcher at the École Normale Supérieure in Paris, where he worked with Emmanuel Dupoux and Benjamin Spector; before that he obtained his PhD from the Department of Linguistics at New York University in 2015, under the supervision of Alec Marantz. Dr. Linzen directs the Computational Psycholinguistics Lab, which develops computational models of human language comprehension and acquisition, as well as methods for interpreting and evaluating neural network models for natural language processing.
Tzuf Argaman-Paz (Open U Israel)
GUCL 10/26/18, 2:00 in St. Mary’s 326
Learning to Navigate in Real Urban Environments Using Natural Language Directions
According to the Universal Postal Union (UPU), the majority of people in many developing countries do not have a set address. With very few alternatives, they often rely on natural language (NL) description of the path to their house. E.g., "Turn right after bar and it will be the first house after the school". What if we could automate the interpretation of such directions, allowing robots or autonomous vehicles to automatically navigate based on free NL descriptions of such routes?
The task of following NL navigation instructions requires composition of language and domain knowledge, and it raises several challenges including: grounding language to physical objects; resolving ambiguity; avoiding cascading errors; and many more. The main datasets collected for the NL navigation task so far (SAIL and HCRC) present a very simplistic, unrealistic depiction of the world, with a small fixed set of entities that are known to the navigator in advance. Such representations bypass the great complexity of navigation based on real urban maps, where an abundance of previously ungrounded and unseen entities are observed at test time.
In this work, we redefine the task of NL navigation by endorsing the complexity and challenge presented by real urban environments and we present OpenStreetSet - a novel data-set with large real urban maps and richer information layers than ever before. We present an effective baseline architecture for the NL navigation task, which augments a standard encoder-decoder model with an entity abstraction layer, attention over words and worlds, and a constantly updating world-state. Our experiments show that this architecture is indeed better-equipped to treat the grounding challenge in realistic urban settings than standard sequence-to-sequence architectures.
Tzuf Argaman-Paz is a graduate student and a research assistant at the ONLP lab headed by Dr. Reut Tsarfaty at the Open University of Israel. Tzuf's research is focused on grounded and executable semantic parsing. Formerly, Tzuf was a data analyst, product owner, programmer and algorithm developer in the Israeli government.
Binyam Ephrem Siyoum (Addis Ababa)
GUCL 11/16/18, 2:00 in Poulton 230
Resource Building for Parsing Amharic Text: Morphological-rich language with less resource
In recent years, different language processing applications are demanding the state-of-the-art parsers. Specifically, high-quality parsers are required for applications like question answering, machine translation and information summarization. In order to train or develop efficient parser then it has become a trained to create high-quality treebank, a linguistically annotated corpus which includes morphological annotations and syntactic annotations. Treebanks play a role in promoting research in parsing natural languages but also contributes for linguistics theory and corpus-based language analysis. Furthermore, treebanks are important resources for the respective language to building and testing data-driven tools serving as gold standard. Such a resource has been developed for highly resourced language. However, building a resource for a morphologically-rich and less resourced language is difficult. In this presentation I will talk about the general tasks involved in the development of a treebank for Amharic parser. I will explain the problems at each stages and the solution we proposed.
Binyam Ephrem obtained an M.Phil. in Linguistics from Norwegian University of Science and Technology (NTNU) and expected to complete his Ph.D. in Language Technology at Addis Ababa University. He has over ten years of professional experience in teaching and research in higher education in Addis Ababa University, Ethiopia. Mr. Seyoum has involved in an international as well as local projects related to language technology on Ethiopian languages. He has presented academic papers including recently at the LREC 2108 and COLING 2018 conferences. He has led projects which are composed of dynamic team members that boosted his enthusiasm for diligence and hard work in communicating with various IT and Language experts. Mr. Binyam is skilled in Natural Language Processing, Text Processing, and Machine learning.
Alek Kołcz (Pushd)
CS Colloquium 11/20/18, 11:00 in St. Mary’s 326
Effective deep near-take detection
Deep learning models produce image representations that excel both at semantic and perceptual similarity. Networks trained on the classic ImageNet benchmark are commonly fine-tuned over task-specific data and tend to produce highly accurate solutions for a variety of problems, ranging from image classification to content based search. It has been observed, however, that the deepest architectures such as ResNet, while superior at semantic classification are not as as effective at providing perceptually sensitive representations when compared to older architectures, such as AlexNet or VGG. This has implications for complex solutions for which both of these modalities are important. In this work we investigate the problem of improving the perceptual sensitivity of deeper architectures with application to near-take detection, where the objective is to identify clusters of images with strong visual similarity. We show that with an appropriate modification of the learning objective, deeper architectures such as ResNet substantially improve in their perceptual representations, which maintaining good semantic representations.
Alek Kołcz is currently the Chief Scientist of Pushd Inc., where he works on a variety Machine Learning and Computer Vision problems. His past research include email/social-media spam detection, document/query classification and clustering, user modeling and personalization at Twitter, Microsoft, AOL and University of Colorado. He holds a PhD in EE from the University of Manchester (former UMIST).
Liz Merkhofer (MITRE)
Analytics 11/27/18, 12:30 in Car Barn 203
Commonsense Reasoning without Commonsense Knowledge
Reading comprehension tasks measure the ability to answer questions that require inference from a free text story. This talk explores the two machine learning approaches included in MITRE’s submission to a shared task, SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge. The first system is a logistic regression over simple features like question length and lexicalized patterns. The second is a recurrent neural network with attention, which uses a pretrained semantic space and learns to align words of the story, passage, and answers. The resulting ensemble system answers reading comprehension questions with 82.27% accuracy and ranked second in the task. This strong performance, despite limited use of external knowledge sources or explicit reasoning, raises questions about “commonsense knowledge” in this task.
Liz Merkhofer is a lead computational linguist at the MITRE Corporation, where she focuses on neural models/deep learning. Her work especially focuses on representation and transfer learning, for applications ranging from monitoring mental health-related messages in social media, judicial adjudication support, textual similarity, and content-based recommendation. She holds a BA in Philosophy and Spanish from the University of Arizona and a MA in Linguistics from Georgetown University.
Rachel Rudinger (JHU)
GUCL 1/11/19, 1:30 in St. Mary’s 326
Decompositional Event Semantics
In natural language understanding, traditional approaches to mapping observed linguistic forms (e.g., a sentence) to symbolic representations of meaning typically require the construction of an underlying ontology of abstract semantic categories. Such ontologies can be challenging to establish, expensive to annotate training data for, and may still suffer from issues of label ambiguity, incomplete coverage, or sparsity. The Decompositional Semantics Initiative is a collection of efforts to provide light-weight alternatives to ontology-backed semantic representations of text; these decompositional representations instead consist of layers of independent, rapidly-annotated properties (entailments) based on common-sense questions that can be posed to crowdsource workers. In this talk, I will present my work on two particular efforts under this initiative: Semantic Proto-Role Labeling and Event Factuality Prediction. This work involves each stage of the decompositional semantics life-cycle: from the formulation of semantic targets, to large-scale data collection, and finally to the development of state-of-the-art, linguistically-informed predictive models. I will also discuss recent follow-up work on analytically probing the capabilities of these models, as well as a newly-developed cross-lingual application.
Rachel Rudinger is a senior Ph.D. student at Johns Hopkins University in Computer Science, affiliated with the Center for Language and Speech Processing. Her work focuses on problems in natural language understanding, including event factuality prediction, semantic (proto-)role labeling, and acquisition of common-sense knowledge from text. During her Ph.D., Rachel interned with the Allen Institute for Artificial Intelligence and was a visiting student at Saarland University. She is the author of many peer-reviewed papers in top Computational Linguistics conferences and journals, and has been interviewed about her work on the popular NLP Highlights podcast. Rachel is a recipient of the NSF Graduate Research Fellowship and a member of the MIT EECS Rising Stars cohort of 2018.
Elad Yom-Tov (Microsoft Research)
CS Colloquium 3/12/19, 2:00 in St. Mary’s 326
Screening for cancer using a learning internet advertising system
Studies have shown that the traces people leave when browsing the internet may indicate the onset of diseases such as cancer. Specifically, queries to search engines were found to be indicative of future cancer diagnosis for several types of cancer.
In my talk I will discuss two studies showing that the adaptive engines of advertising systems working in conjunction with clinically verified questionnaires can identify people who are suspected of having one of three types of solid tumor cancers. First, a classifier trained to predict suspected cancer inferred from questionnaire response using past queries reached an Area Under the Curve of 0.64. Second, using a conversion optimization mechanism, both Bing and Google advertisement systems learned to identify people who were likely to have symptoms consistent with suspected cancer, such that after a training period of approximately 10 days, 11% of people it selected for showing of targeted campaign ads were found to have suspected cancer. People who received information that their symptoms were consistent with suspected cancer increased their searches for healthcare utilization and maintained it for longer than people whose symptoms were not associated with suspected cancer, indicating that the questionnaires provided useful information to people who completed them.
These results demonstrate the utility of using search engine queries to screen for possible cancer and the application of modern advertising systems to identify people who are likely suffering from serious medical conditions.
Elad Yom-Tov is a Principal Researcher at Microsoft Research. Before joining Microsoft he was with Yahoo Research, IBM Research, and Rafael. His primary research interests are in applying large-scale Machine Learning and Information Retrieval methods to medicine. Dr. Yom-Tov studied at Tel-Aviv University and the Technion, Israel. He has published four books, over 100 papers (of which 3 were awarded prizes), and was awarded more than 20 patents. His latest book is “Crowdsourced Health: How What You Do on the Internet Will Improve Medicine” (MIT Press, 2016).
Naomi Feldman (UMD)
Linguistics Speaker Series 3/15/19, 3:30 in Poulton 230
Modeling early phonetic learning from spontaneous speech
Most models of language acquisition have used idealized data. Phonetic category learning models, in particular, have been trained on inputs whose "acoustics" are artificially constructed to follow Gaussian distributions and abstract away from the time-varying nature of the speech signal. Furthermore, the acoustic dimensions used in phonetic learning models are typically hand-selected to be linguistically relevant.
This talk describes research that builds toward a theory of phonetic learning from naturalistic speech. The first half of the talk considers ways in which context might constrain phonetic category learning from speech that is highly variable, and finds that context can still be very helpful for learning from spontaneous speech -- but only when it is used in very specific ways. The second half of the talk raises the question of whether we should be thinking in terms of early phonetic category learning at all.
This is joint work with Kasia Hitczenko, Thomas Schatz, Stephanie Antetomaso, Emmanuel Dupoux, Micha Elsner, Sharon Goldwater, Reiko Mazuka, and Kouki Miyazawa.
Naomi Feldman is an associate professor in the Department of Linguistics and the Institute for Advanced Computer Studies at the University of Maryland, and a member of the Computational Linguistics and Information Processing Lab. She received her PhD in Cognitive Science from Brown University in 2011. She uses methods from machine learning to create formal models of how people learn and represent the structure of their language, and has been developing methods that take advantage of naturalistic speech corpora to study how listeners encode information from their linguistic environment.
Tatsunori Hashimoto (Stanford)
CS Colloquium 3/18/19, 11:00 in St. Mary’s 326
Beyond the average case: machine learning for atypical examples
Although machine learning systems have improved dramatically over the last decade, it has been widely observed that even the best systems fail on atypical examples. For example, prediction models such as image classifiers have low accuracy on images from minority cultures, and generative models such as dialogue systems are often incapable of generating diverse, atypical responses. In this talk, I will discuss two domains where high performance on typical examples is insufficient.
The first is learning prediction models that perform well on minority groups, such as non-native English speakers using a speech recognition system. We demonstrate that models with low average loss can still assign high losses to minority groups, and this gap can amplify over time as minority users that suffer high losses stop using the model. We develop an approach using distributionally robust optimization that learns models that perform well over all groups and mitigate the feedback loop.
The second domain is learning natural language generation (NLG) systems, such as a dialogue system. It has been frequently observed that existing NLG systems which produce high-quality samples rely heavily on typical responses such as "I don't know" and fail to generate the full diversity of atypical but valid human responses. We carefully quantify this problem through a new evaluation metric based on the optimal classification error between human- and model-generated text and propose a new, edit-based generative model of text whose outputs are both diverse and high-quality.
Tatsunori (Tatsu) Hashimoto is a 3rd year post-doc in the Statistics and Computer Science departments at Stanford, supervised by Professors Percy Liang and John Duchi. He holds a Ph.D from MIT where he studied random walks and computational biology under Professors Tommi Jaakkola and David Gifford, and a B.S. from Harvard in Statistics and Math. His work has been recognized in NeurIPS 2018 (Oral), ICML 2018 (Best paper runner-up), and NeurIPS 2014 Workshop on Networks (Best student paper).
Narges Razavian (NYU)
CS Colloquium 4/5/19, 11:00 in St. Mary’s 326
Machine Learning in Medicine: Disease Prediction and Biomarker Discovery
Machine learning has seen great progress in the past decade. In parallel, Electronic Health Records (EHR) systems are accumulating clinical and medical data at unprecedented scales. The intersection of the two phenomena has enabled multitude of machine-learning-assisted medical care, with potential to impact and improve healthcare research, and delivery for millions of individuals. In this talk, we first briefly review the data landscape of healthcare, including modalities and quantities of data available to various machine learning tasks, and discuss the implications of this data on different research areas. We will then focus on a number of recent work at my research lab at NYU Langone Medical Center on the topics of biomarker discovery and disease classification. Our discussion includes classification of lung cancer genomic mutation and subtype using histopathology images; deep learning on clinical notes for disease prediction; and biomarker discovery using EHR time series.
Narges Razavian is an assistant professor at NYU Langone Medical Center, with joint appointment at departments of Radiology and Population Health. Before that, she was a postdoc at NYU Courant CILVR lab, as a member of David Sontag's team. Her lab is currently focusing on machine learning and deep learning applications in healthcare, and is actively working on medical notes, EHR time series, and medical imaging.
Jason Eisner (JHU)
CS Colloquium DISTINGUISHED SPEAKER 4/11/19, 2:00 in St. Mary’s 326
Recovering Syntactic Structure from Surface Features
We show how to predict the basic word-order facts of a novel language, and even obtain approximate syntactic parses, given only a corpus of part-of-speech (POS) sequences. We are motivated by the longstanding challenge of determining the structure of a language from its superficial features. While this is usually regarded as an unsupervised learning problem, there are good reasons that generic unsupervised learners are not up to the challenge. We do much better with a supervised approach where we train a system – a kind of language acquisition device – to predict how linguists will annotate a language. Our system uses a neural network to extract predictions from a large collection of numerical measurements. We train it on a mixture of real treebanks and synthetic treebanks obtained by systematically permuting the real trees, which we can motivate as sampling from an approximate prior over possible human languages.
Jason Eisner is Professor of Computer Science at Johns Hopkins University. He is a Fellow of the Association for Computational Linguistics (and an action editor of its TACL journal). At Johns Hopkins, he is also affiliated with the Center for Language and Speech Processing, the Machine Learning Group, the Cognitive Science Department, and the national Center of Excellence in Human Language Technology. His goal is to develop the probabilistic modeling, inference, and learning techniques needed for a unified model of all kinds of linguistic structure. His 125+ papers have presented various algorithms for parsing, machine translation, and weighted finite-state machines; formalizations, algorithms, theorems, and empirical results in computational phonology; and unsupervised or semi-supervised learning methods for syntax, morphology, and word-sense disambiguation. He is also the lead designer of Dyna, a new declarative programming language that provides an infrastructure for AI research. He has received two school-wide awards for excellence in teaching.
Daniel Hershcovich (Copenhagen)
GUCL 6/11/19, 11:00 in St. Mary’s 326
Universal Semantic Parsing with Neural Networks
Natural language understanding requires the ability to comprehend text, reason about it, and act upon it intelligently. While simplistic frameworks such as end-to-end sequence-to-sequence architectures or even bag-of-words models can go a long way, symbolic meaning representation is inevitably needed for some applications, and may provide an invaluable inductive bias for others. We construct such graphical meaning representations from text, with a focus on a particular semantic representation scheme called Universal Conceptual Cognitive Annotation (UCCA), whose main design principles are support for all linguistic semantic phenomena, ease of annotation, cross-linguistic applicability and stability, and a modular architecture of different layers of meaning. We develop a general directed acyclic parser supporting the graph structures UCCA exhibits. Subsequently, we apply the parser to three other representation schemes in three languages, demonstrating its flexibility and universality. We show that training the parser in a multitask setting on all of these schemes improves its UCCA parsing accuracy, by effectively learning generalizations across the different representations. inally, in an empirical comparison of the content of semantic and syntactic representations, we discover several aspects of divergence. These have profound impact on the potential contribution of syntax to semantic parsing, and on the usefulness of each of the approaches for semantic tasks in natural language processing.
Daniel is a postdoctoral researcher at the CoAStaL research group at the University of Copenhagen, working on semantic representations and semantic parsing. He completed his Ph.D. at the Hebrew University of Jerusalem with Ari Rappoport and Omri Abend, and his B.Sc. in Mathematics and Computer Science at the Open University of Israel. During 2008-2019, Daniel was a software engineer at IBM Research, where he was a member of Project Debater.
Linguistics Ph.D. dissertation defense 9/5/19, 2:30 in Poulton 230
A Multifactorial, Multitask Approach to Automated Speaker Profiling
Automated Speaker Profiling (ASP) refers broadly to the computational prediction of speaker traits based on cues mined from the speech signal. Accurate prediction of such traits can have a wide variety of applications such as automating the collection of customer metadata, improving smart-speaker/voice-assistant interactions, narrowing down suspect pools in forensic situations, etc.
Approaches to ASP to date have primarily focused on single-task computational models—i.e. models which each predict one speaker trait in isolation. Recent work however has suggested that using a multi-task learning framework, in which a system learns to predict multiple related traits simultaneously, each trait-prediction task having access to the training signals of all other trait-prediction tasks, can increase classification accuracy along all trait axes considered.
Likewise, most work on ASP to date has focused primarily on acoustic cues as predictive features for speaker profiling. However, there is a wide range of evidence from the sociolinguistic literature that lexical and phonological cues may also be of use in predicting social characteristics of a given speaker. Recent work in the field of author profiling has also demonstrated the utility of lexical features in predicting social information about authors of textual data, though few studies have investigated whether this carries over to spoken data.
In this dissertation I focus on prediction of five different social traits: sex, ethnicity, age, region, and education. Linguistic features from the acoustic, phonetic, and lexical realms are extracted from 60 second chunks of speech taken from the 2008 NIST SRE corpus and used to train several types of predictive models. Naive (majority class prediction) and informed (single-task neural network) models are trained to provide baseline predictions against which multi-task neural network models are evaluated. Feature importance experiments are performed in order to investigate which features and feature types are most useful for predicting which social traits.
Results presented in chapters 5-7 of this dissertation demonstrate that multi-task models consistently outperform single-task models, that models are most accurate when provided information from all three linguistic levels considered, and that lexical features as a group contribute substantially more predictive power than either phonetic or acoustic features.
Vivek Srikumar (Utah)
Linguistics Speaker Series & CS Colloquium 9/20/19, 3:30 in Poulton 230
Natural Language Processing in Therapy: What Is and What Can Be
Natural language processing (NLP) can potentially transform a broad array of applications where human interactions are largely mediated by conversation. For example, automatically analyzing dialogue in counseling sessions can not only help understand and guide behavior, but also improve patient outcomes.
In this talk, I will first present how language technology can guide the course of therapy sessions. I will describe recent work on automated helpers that can provide realtime feedback and guidance to therapists, and also alert them to potentially important forthcoming cues from a patient.
Following this introduction, I will focus on the some of the technical challenges that need to be addressed before the potential of NLP can be realized. Specifically, I will focus on the question of how we can mitigate the need for massive annotated datasets using knowledge stated in the form of logic. To this end, I will present a new learning framework that can exploit domain knowledge in logic to improve the accuracy and self-consistency of neural models and demonstrate promising results on several tasks that call for reasoning about language.
Vivek Srikumar is an assistant professor in the School of Computing at the University of Utah. His research lies in the areas of natural learning processing and machine learning and has primarily been driven by questions arising from the need to reason about textual data with limited explicit supervision and to scale NLP to large problems. His work has been published in various AI, NLP and machine learning venues and received the best paper award at EMNLP 2014. His work has been supported by grants and awards from NSF and BSF and gifts from Google, Nvidia and Intel. He obtained his Ph.D. from the University of Illinois at Urbana-Champaign in 2013 and was a post-doctoral scholar at Stanford University.
Geoffrey K. Pullum (Edinburgh)
Linguistics Speaker Series 9/27/19, 3:30 in Poulton 230
The Humble Preposition and the Sins of Traditional Grammar
The general public has an insatiable appetite for books about English grammar and usage. But few of the people who buy such books appreciate how much they are being cheated. The available books, plagiarizing each other shamelessly, repeat inadequate analyses dating from the 18th century. Few topics illustrate this better than the treatment of prepositions. Traditional grammars undercount prepositions massively, distinguish prepositions from other categories irrationally, ignore cogent refutations spread across two centuries, and overlook most of the really interesting aspects of prepositions and the phrases they head. This talk presents an uncompromising critique of the state of the art, and an informal reanalysis distinct from both traditional and generative grammar.
Geoffrey K. Pullum is professor of general linguistics in the School of Philosophy, Psychology and Language Sciences at the University of Edinburgh. Previously (1981-2007) he was professor of linguistics at the University of California, Santa Cruz, and held the Gerard Visiting Professorship at Brown University in 2012-2013. He co-authored The Cambridge Grammar of the English Language with Rodney Huddleston in 2002, a book which won the 2004 Leonard Bloomfield Book Award from the Linguistic Society of America. He is a member of the Academia Europaea and an elected fellow of the British Academy, the Linguistic Society of America, and the American Academy of Arts and Sciences. He is well known for his satirical essays on linguistics (collected in The Great Eskimo Vocabulary Hoax, 1991), his entertaining Lingua Franca posts published by The Chronicle of Higher Education, and his fiery sermons preached against Strunk & White and other misguided self-ordained authorities on English grammar and usage. His latest book is Linguistics: Why It Matters (2018).
Kevin Duh (JHU)
CS Colloquium 10/18/19, 11:00 in St. Mary’s 326
Multi-objective Hyperparameter Optimization of Deep Neural Networks
Deep neural network models are full of hyperparameters. To obtain a good model, one must carefully experiment with hyperparameters such as the number of layers, the number of hidden nodes, the type of non-linearity, the learning rate, and the drop-out parameter, just to name a few. I will discuss general hyperparameter optimization algorithms, based on evolutionary strategies or Bayesian techniques, that can automate this laborious process. Further, I will argue for the necessity of a multi-objective approach: we desire models that are not only optimized for accuracy, but also respect practical computational constraints such as model size and run-time. We will present results for machine translation and language modeling.
This is joint work with Takahiro Shinozaki, Tomohiro Tanaka, and Takafumi Moriya at the Tokyo Institute of Technology, Shinji Watanabe and Xuan Zhang at JHU, and Michael Denkowski at Amazon.
Kevin Duh is a senior research scientist at the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLTCOE). He is also an assistant research professor in the Department of Computer Science and a member of the Center for Language and Speech Processing (CLSP). His research interests lie at the intersection of Natural Language Processing and Machine Learning, in particular on areas relating to machine translation, semantics, and deep learning. Previously, he was assistant professor at the Nara Institute of Science and Technology (2012-2015) and research associate at NTT CS Labs (2009-2012). He received his B.S. in 2003 from Rice University, and PhD in 2009 from the University of Washington, both in Electrical Engineering.
Vicente Ordóñez Román (UVA)
CS Colloquium 10/25/19, 11:00 in St. Mary’s 326
Building Compositional, Interpretable and Robust Visual Recognition through Language
The principle of compositionality of language states that complex meaning can be derived from simpler constituents, from a simple vocabulary complex ideas can emerge. Models for visual recognition have seen remarkable success at finding isolated concepts but still struggle to deal with evolving vocabularies and compositionally derived concepts. In this talk I will argue that imbuing visual recognition models with compositionality borrowed from the language domain has the potential to address several shortcomings in current visual recognition models, including issues with fairness, robustness, interpretability, and generalization. I will present several recent works from the vision and language research group at the University of Virginia addressing some of these challenges.
Vicente Ordóñez is Assistant Professor in the Department of Computer Science at the University of Virginia, and Visiting Professor at Adobe Research. His research interests lie at the intersection of computer vision, natural language processing and machine learning. He is a recipient of a Best Long Paper Award at the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2017 and the Best Paper – IEEE Marr Prize - at the International Conference on Computer Vision (ICCV) 2013. He has also received an IBM Faculty Award and a Google Faculty Research Award. Vicente obtained a PhD in Computer Science at the University of North Carolina at Chapel Hill in 2015, and spent a year as Visiting Research Fellow at the Allen Institute for Artificial Intelligence (AI2).
Yangfeng Ji (UVA)
CS Colloquium 11/8/19, 11:00 in St. Mary’s 414
Enhancing Human-AI Collaboration in Language Technology
The goal of the human-AI collaboration is to take advantage of the complementary nature of humans and AI systems, such that they can work together to achieve better application goals. In language technology, the questions of enhancing human-machine collaboration include (1) how can we build AI systems that will help humans acquire language skills effectively? Moreover, (2) how can we build explainable AI systems that can be trusted by human users in real-world NLP applications? In this talk, I present some recent work as examples to answer these two questions. The first part of this talk will demonstrate that recent progress on text generation can help enhance human writing skills via a collaborative writing system. Then, in the second part, I will show our recent work on interpreting predictions from neural text classifiers with human-understandable explanations.
Yangfeng Ji is the William Wulf Assistant Professor in the Department of Computer Science at the University of Virginia, where he leads the Natural Language Processing group. His research interests include building machine learning models for text understanding and generation. His work on entity-driven story generation won an Outstanding Paper Award at NAACL 2018. Yangfeng received his Ph.D. degree from the School of Interactive Computing at Georgia Institute of Technology in 2016 and was a Postdoctoral Researcher in the Paul G. Allen School of Computer Science & Engineering at the University of Washington from 2016 to 2018.
David Bamman (UC Berkeley)
Linguistics Speaker Series & CS Colloquium 11/22/19, 3:30 in Poulton 230
The Data-Driven Analysis of Literature
Literary novels push the limits of natural language processing. While much work in NLP has been heavily optimized toward the narrow domains of news and Wikipedia, literary novels are an entirely different animal—the long, complex sentences in novels strain the limits of syntactic parsers with super-linear computational complexity, their use of figurative language challenges representations of meaning based on neo-Davidsonian semantics, and their long length (ca. 100,000 words on average) rules out existing solutions for problems like coreference resolution that expect a small set of candidate antecedents.
At the same time, fiction drives computational research questions that are uniquely interesting to that domain. In this talk, I'll outline some of the opportunities that NLP presents for research in the quantitative analysis of culture—including measuring the disparity in attention given to characters as a function of their gender over two hundred years of literary history (Underwood et al. 2018)—and describe our progress to date on two problems essential to a more complex representation of plot: recognizing the entities in literary texts, such as the characters, locations, and spaces of interest (Bamman et al. 2019) and identifying the events that are depicted as having transpired (Sims et al. 2019). Both efforts involve the creation of a new dataset of 200,000 words evenly drawn from 100 different English-language literary texts and building computational models to automatically identify each phenomenon.
This is joint work with Matt Sims, Ted Underwood, Sabrina Lee, Jerry Park, Sejal Popat and Sheng Shen
David Bamman is an assistant professor in the School of Information at UC Berkeley, where he works on applying natural language processing and machine learning to empirical questions in the humanities and social sciences. His research often involves adding linguistic structure (e.g., syntax, semantics, coreference) to statistical models of text, and focuses on improving NLP for a variety of languages and domains (such as literary text and social media). Before Berkeley, he received his PhD in the School of Computer Science at Carnegie Mellon University (LTI).
Adam Poliak (JHU)
GUCL 1/10/20, 1:00 in St. Mary’s 326
Sentence-level Semantic Inference: From Diverse Phenomena to Applications
Many NLP tasks involve understanding meaning at the sentence level. In order to analyze models developed for these tasks, we should decompose sentence-level semantic understanding into a diverse array of smaller, more-focused, fine-grained types of reasoning. This will help improve our understanding of the sentence-level reasoning capabilities of our NLP systems. In this talk, we will focus on Natural Language Inference (NLI), the task of determining if one sentence (hypothesis) can likely be inferred from another (context/premise). NLI has traditionally be used to evaluate how well different models understand language and the relationship between texts. We investigate whether 10 recent NLI datasets require models to reason about both texts, or if the datasets contain biases or statistical irregularities that allow a model to correctly label a context-hypothesis pair by only looking at a hypothesis. In the most popular dataset that we consider, a hypothesis-only model outperforms the majority baseline by over 2x. We will also discuss our recently released dataset, the Diverse NLI Collection (DNC), that can be used to shed light on a model’s ability to capture or understand a diverse array of semantic phenomena that are important to Natural Language Understanding. We will demonstrate how a variant of the DNC has been used to evaluate whether a Neural Machine Translation encoder captures semantic phenomena related to translation. With the remaining time, we will discuss how lessons from these studies can be applied to real-world uses cases of sentence-level semantic inference. This talk is based on work that has appeared at NAACL, ACL, StarSem, and EMNLP.
Adam Poliak is currently a 4th-year Computer Science Ph.D. candidate at Johns Hopkins University's Center for Language & Speech Processing. Adam graduated with a Bachelor of Arts in Computer Science from Johns Hopkins University in 2016 and is advised by Dr. Benjamin Van Durme. Adam spent time as a research intern at Lincoln Labs and Bloomberg L.P.
Lisa Singh (GU)
CS Faculty Seminar 1/17/20, 10:00 in St. Mary’s 326
The Promise of Big Data: Blending Noisy Organic Signals with Traditional Movement Variables to Predict Forced Migration in Iraq
Worldwide displacement due to war and conflict is at all-time high. Unfortunately, determining if, when, and where people will move is a complex problem. This talk will describe a multi-university project that develops methods for blending variables constructed from publicly available organic data (social media and newspapers) with more traditional indicators of forced migration to better understand when and where people will move. I will demonstrate our approach using a case study involving displacement in Iraq, and shows that incorporating open-source generated conversation and event variables maintains or improves predictive accuracy over traditional variables alone. I will conclude with a discussion on strengths and limitations of leveraging organic big data for societal—scale problems.
Lisa Singh is a Professor in the Department of Computer Science at Georgetown University. Broadly, her research interests are in data-centric computing – data mining, data privacy, data science, data visualization, and databases. She has authored/co-authored over 50 peer reviewed publications and book chapters. Her research has been supported by the National Science Foundation, the Office of Naval Research, the Social Science and Humanities Research Council, and the Department of Defense. With ISIM, LLNL, York University and other NGOs she works on understanding different ways to use big data to better understand movement patterns of forced migration. Some of her other interdisciplinary projects include studying privacy on the web (adversarial inference), dolphin social structures with the Shark Bay Dolphin Research project (graph inference and social mining using incomplete and uncertain data), and learning from open source big data for social science research related to public opinion, election dynamics, and child behavior. Dr. Singh's research related to the 2016 election has been cited by the Huffington Post, CNN, and the Hill. Dr. Singh has also helped organize three workshops involving future directions of big data research, has served on numerous organizing and program committees, and is currently involved in different organizations working on increasing participation of women in computing and integrating computational thinking into K-12 curricula. She received her B.S.E. from Duke University and her M.S. and Ph.D. from Northwestern University.
Aylin Caliskan (GWU)
CS Colloquium 1/24/20, 1:30 in St. Mary’s 326
Algorithmic measures of language mirror human biases
Following progress in computing and machine learning and the emergence of big data, artificial intelligence has become a reality. Reliance on machine learning can be seen in diverse areas ranging from cancer detection and job candidate selection to recidivism prediction. Machine learning can detect patterns in large swaths of data and suggest solutions to hard problems in ways that would elude the best human experts. However, machine learning is not the panacea it was expected to be. For example, Google Translate converts the gender-neutral Turkish sentences “O bir doktor. O bir hemşire” to the English sentences “He is a doctor. She is a nurse.” What is the reasoning behind making these biased decisions? What types of biased outcomes should we expect to observe in language-based AI technologies? This talk introduces a method, the Word Embedding Association Test (WEAT), an adaptation of the Implicit Association Test (IAT) to machine learning. WEAT operates by analyzing language models that are trained on billions of sentences collected from the internet to probe whether measures of individuals’ implicit attitudes and beliefs can be corroborated by measures of association in language as it resides in large datasets.
Aylin Caliskan is an Assistant Professor of Computer Science at George Washington University. Her research interests include the emerging science of bias in artificial intelligence, fairness in machine learning, and privacy. Her work aims to characterize and quantify aspects of natural and artificial intelligence using a multitude of machine learning, language processing, and computer vision techniques. In her recent publication in Science, she demonstrated how semantics derived from language corpora contain human-like biases. Prior to that, she developed novel privacy attacks to de-anonymize programmers using code stylometry. Her presentations on both de-anonymization and bias in machine learning are the recipients of best talk awards. Her work on semi-automated anonymization of writing style furthermore received the Privacy Enhancing Technologies Symposium Best Paper Award. Her research has received extensive press coverage across the globe. Aylin holds a PhD in Computer Science from Drexel University and a Master of Science in Robotics from University of Pennsylvania. Before joining the faculty at George Washington University, she was a Postdoctoral Researcher and a Fellow at Princeton University's Center for Information Technology Policy.
Noah Smith (UW)
CS Colloquium 2/28/20, 2:00 in St. Mary’s 326
Language and Context
I'll start this talk by summarizing the latest breakthrough in natural language processing: contextual word representations, giving a historical perspective and some intuitive explanation for their impact on NLP. I'll then turn to a discussion of some key challenges that remain—despite great excitement and the emergence of a new paradigm for tackling NLP problems, our systems are still missing some fundamental aspects of linguistic intelligence, resulting in inconsistent performance. I'll argue that the notion of "context" must be radically expanded if NLP is to become broadly useful and reliable. I'll present two recent projects: a method that adapts NLP models to the specific communicative context of their inputs (often called "domain"; examples include biomedical articles and web reviews),* and a new conceptual formalism and dataset to make NLP systems more "aware" of social context.**
* Collaborators: Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, and Doug Downey (AI2)
** Collaborators: Maarten Sap, Saadia Gabriel, Lianhui Qin (UW); Dan Jurafsky (Stanford); and Yejin Choi (UW/AI2)
Noah Smith is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, as well as a Senior Research Manager at the Allen Institute for Artificial Intelligence. Previously, he was an Associate Professor of Language Technologies and Machine Learning in the School of Computer Science at Carnegie Mellon University. He received his Ph.D. in Computer Science from Johns Hopkins University in 2006 and his B.S. in Computer Science and B.A. in Linguistics from the University of Maryland in 2001. His research interests include statistical natural language processing, machine learning, and applications of natural language processing, especially to the social sciences. His book, Linguistic Structure Prediction, covers many of these topics. He has served on the editorial boards of the journals Computational Linguistics (2009–2011), Journal of Artificial Intelligence Research (2011–present), and Transactions of the Association for Computational Linguistics (2012–present), as the secretary-treasurer of SIGDAT (2012–2015 and 2018–present), and as program co-chair of ACL 2016. Alumni of his research group, Noah's ARK, are international leaders in NLP in academia and industry; in 2017 UW's Sounding Board team won the inaugural Amazon Alexa Prize. Smith's work has been recognized with a UW Innovation award (2016–2018), a Finmeccanica career development chair at CMU (2011–2014), an NSF CAREER award (2011–2016), a Hertz Foundation graduate fellowship (2001–2006), numerous best paper nominations and awards, and coverage by NPR, BBC, CBC, New York Times, Washington Post, and Time.
Matt Gardner (AI2)
CS Colloquium 3/4/20, 12:30 in St. Mary’s 326
NLP Evaluations That We Believe In
With all of the modeling advancements in recent years, NLP benchmarks have been falling over left and right: "human performance" has been reached on SQuAD 1 and 2, GLUE and SuperGLUE, and many commonsense datasets. Yet no serious researcher actually believes that these systems understand language, or even really solve the underlying tasks behind these datasets. To get benchmarks that we actually believe in, we need to both think more deeply about the language phenomena that our benchmarks are targeting, and make our evaluation sets more rigorous. I will first present ORB, an Open Reading Benchmark that collects many reading comprehension datasets that we (and others) have recently built, targeting various aspects of what it means to read. I will then present contrast sets, a way of creating non-iid test sets that more thoroughly evaluate a model's abilities on some task, decoupling training data artifacts from test labels.
Matt Gardner is a senior research scientist at the Allen Institute for AI on the AllenNLP team. His research focuses primarily on getting computers to read and answer questions, dealing both with open domain reading comprehension and with understanding question semantics in terms of some formal grounding (semantic parsing). He is particularly interested in cases where these two problems intersect, doing some kind of reasoning over open domain text. He is the original architect of the AllenNLP toolkit, and he co-hosts the NLP Highlights podcast with Waleed Ammar and Pradeep Dasigi.
Ella Rabinovich (Toronto)
Linguistics Speaker Series
3/16/20, 12:30 in Poulton 230 postponed
Aspects of Advanced Proficiency in Second Language Acquisition
In this talk I will present a study on two topics related to acquisition of English as a second language in advanced non-native speakers: semantic infelicities in second language stemming from the atypical nature of English indefinite pronouns, and native language cognate effects on second language lexical choice. First, drawing on studies in semantic typology, I will lay out empirical evidence for theoretical hypotheses on the nature of challenges posed by indefinite pronouns to English learners with varying degree of proficiency. I will then suggest and evaluate an automatic approach for detection of infelicitous usage patterns, demonstrating encouraging initial results obtained with deep learning architectures on this task involving nuanced semantic anomalies. Second, I will show that utilizing a large corpus of highly competent non-native speakers and a set of carefully selected lexical items, it is possible to reveal that the lexical choices of non-natives are affected by cognates in their native language. This effect is sufficiently powerful to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages.
Ella Rabinovich is a postdoctoral fellow in the Computer Science department at the University of Toronto, Canada. She is a member of Language, Cognition and Computation group, studying various aspects of language productions of multilingual speakers. Prior to moving to Toronto, she worked at IBM Research Labs on Project Debater -- a flagship IBM research project developing a computational argumentation system. Ella completed her Ph.D. at the University of Haifa, Israel.
Nicola Tonellotto (Pisa)
CS Colloquium 4/24/20, 11:00 via Zoom
Using an Inverted Index Synopsis for Query Latency and Performance Prediction
Predicting the query latency by a search engine has important benefits, for instance, by allowing the search engine to adjust its configuration to address long-running queries without unnecessarily sacrificing its effectiveness. However, for the dynamic pruning techniques that underlie many search engines, achieving accurate predictions of query latencies is difficult. In this talk I will discuss how index synopses – which are stochastic samples of the full index – can be used for attaining accurate timings. Experiments using the TREC ClueWeb09 collection, and a large set of user queries, show that using a small random sample it is possible to very accurately estimate properties of the larger index, including sizes of posting list unions and intersections. I will also show that index synopses facilitate two use cases: (i) predicting the query latencies on the full index and classifying long-running queries can be accurately achieved using index synopses; (ii) the effectiveness of queries can be estimated more accurately using a synopsis index post-retrieval predictor than a pre-retrieval predictor. This work is partially supported by the Italian Ministry of Education and Research (MIUR) in the framework of the CrossLab project (Departments of Excellence).
Dr. Nicola Tonellotto (male) is assistant professor at the Information Engineering Department of the University of Pisa since 2019. From 2002 to 2019 he was researcher at the Information Science and Technologies Institute “A. Faedo” of the National Research Council of Italy. His main research interests include Cloud Computing, Web Search, Information Retrieval and Deep Learning. He co-authored more than 60 papers on these topics in peer reviewed international journal and conferences. He participated and coordinated activities in several European projects such as CoreGRID, NextGRID, GRIDComp, S-Cube, MIDAS, BigDataGrapes. He is co-recipient of the ACM SIGIR 2015 Best Paper Award. He taught or teaches BSc, MSc and PhD courses on Cloud computing, distributed enabling platforms and information retrieval.