GUCL: Computational Linguistics @ Georgetown
We are a group of Georgetown University faculty, student, and staff researchers at the intersection of language and computation. Our areas of expertise include natural language processing, corpus linguistics, information retrieval, text mining, and more. Members belong to the Linguistics and/or Computer Science departments.
GUCL holds monthly group meetings about research and maintains a mailing list for its members. (Contact Nathan Schneider to subscribe.) This website will also promote courses, talks, and other events on campus that relate to topics in computational linguistics.
NACLOFirst Round: 1/26/17
- Joel Tetreault (Grammarly): CS 1/27/17
- Kenneth Heafield (Edinburgh): CS 2/2/17
- Margaret Mitchell (Google Research): CS 2/16/17
- Glen Coppersmith (Qntfy & JHU): CS 2/24/17
- Jacob Eisenstein (GA Tech): Linguistics 4/21/17
Ph.D. student, Linguistics
CL & NLP, computational semantics, mathematical approaches to linguistics
Ph.D. student, CS
NLP, information retrieval, medical text processing
Ph.D. student, Spanish Linguistics
formal semantics, syntax/semantics interface, NLP
Professor, CS & Georgetown Medical Center
information retrieval, text mining, bioinformatics, wireless networks
Assistant Professor, Nursing & Human Science
CL & NLP, speech recognition & signal processing
Clinical Professor, CS
information retrieval, text mining, biomedical/health informatics
Undergraduate, CS major/Linguistics minor
text mining, data science, NLP
Ph.D. student, CS
information retrieval, session search, dynamic search
Ph.D. student, Linguistics
neural networks, computational & formal semantics, theory of computation
Ph.D. student, CS
information retrieval, NLP
Ph.D. student, Linguistics
CL & NLP
Ph.D. student, Linguistics
NLP, big data extraction, phonetics/phonology, speech recognition & signal processing
formal semantics, syntax/semantics interface, grammar formalisms, semantic representation, cognitive science
Assistant Professor, Linguistics & CS
CL & NLP, especially semantic representation, annotation, & analysis
Ph.D. student, Linguistics
CL & NLP, narrative, coreference, critical discourse analysis, modality, using NLP for real-world problems
Ph.D. student, Linguistics
CL & NLP, computational sociolinguistics
Associate Professor, CS
social graph mining, text mining, data science, visual analytics
Ph.D. student, CS
medical information retrieval
Ph.D. student, Linguistics
corpus lingusitics, Bayesian stats, NLP, formal semantics
Undergraduate, CS major/Linguistics minor
information retrieval, dynamic search, signal processing
Ph.D. student, CS
unsupervised learning, NLP
|Grace Hui Yang|
Assistant Professor, CS
information retrieval, text mining, NLP, machine learning, privacy
Assistant Professor, Linguistics
corpus building, search and visualization; coreference resolution; digital humanities
Ph.D. student, Linguistics
CL & NLP, coreference resolution, speech prosody, time-series data mining, music information retrieval
For those wishing to study computational linguistics or related topics, the following graduate programs at Georgetown may be of interest:
- M.S., Ph.D. in Computational Linguistics
- Offered by the Linguistics department. Research in the department includes corpus linguistics & corpus annotation, digital humanities, natural language processing, and speech processing.
- M.S. in Computer Science, Ph.D. in Computer Science
- One of the main research areas in the department is Information Systems, including machine learning, data mining, information retrieval, natural language processing, and bio- and health informatics.
- M.S. in Analytics with a concentration in Data Science
Nazli Goharian Upperclass Undergraduate & Graduate
Information retrieval is the identification of textual components, be them web pages, blogs, microblogs, documents, medical transcriptions, mobile data, or other big data elements, relevant to the needs of the user. Relevancy is determined either as a global absolute or within a given context or view point. Practical, but yet theoretically grounded, foundational and advanced algorithms needed to identify such relevant components are taught.
The Information-retrieval techniques and theory, covering both effectiveness and run-time performance of information-retrieval systems are covered. The focus is on algorithms and heuristics used to find textual components relevant to the user request and to find them fast. The course covers the architecture and components of the search engines such as parser, index builder, and query processor. In doing this, various retrieval models, relevance ranking, evaluation methodologies, and efficiency considerations will be covered. The students learn the material by building a prototype of such a search engine. These approaches are in daily use by all search and social media companies.
Nathan Schneider Graduate
Systems of communication that come naturally to humans are thoroughly unnatural for computers. For truly robust information technologies, we need to teach computers to unpack our language. Natural language processing (NLP) technologies facilitate semi-intelligent artificial processing of human language text. In particular, techniques for analyzing the grammar and meaning of words and sentences can be used as components within applications such as web search, question answering, and machine translation.
This course introduces fundamental NLP concepts and algorithms, emphasizing the marriage of linguistic corpus resources with statistical and machine learning methods. As such, the course combines elements of linguistics, computer science, and data science. Coursework will consist of lectures, programming assignments (in Python), and a final team project. The course is intended for students who are already comfortable with programming and have some familiarity with probability theory.
Amir Zeldes Upperclass Undergraduate & Graduate
Digital linguistic corpora, i.e. electronic collections of written, spoken or multimodal language data, have become an increasingly important source of empirical information for theoretical and applied linguistics in recent years. This course is meant as a theoretically founded, practical introduction to corpus work with a broad selection of data, including non-standardized varieties such as language on the Internet, learner corpora and historical corpora. We will discuss issues of corpus design, annotation and evaluation using quantitative methods and both manual and automatic annotation tools for different levels of linguistic analysis, from parts-of-speech, through syntax to discourse annotation. Students in this course participate in building the corpus described here: https://corpling.uis.georgetown.edu/gum/
Corey Miller Upperclass Undergraduate & Graduate
This course will survey speech processing technology from a computational linguistic perspective. Speech processing technology is a component of human language technology that focuses on the processing of audio data. The audio data can be either the input or output of speech processing. When speech serves as the output, the technology is known as speech synthesis or text-to-speech (TTS). Additional technologies to be examined include spoken language identification (SLID), speaker verification and identification and speech diarization, which is the parsing of audio data into individual speaker segments.
Particular attention will be paid to the linguistic components of speech technology. Phonetics and phonology play an important role in both TTS and STT. In addition, morphology, syntax and pragmatics are important both in authentic modeling of TTS and in constraining possible STT output. Semantics plays a role in the interpretation of STT output, which can feed into text-based natural language processing (NLP).
The algorithms underlying contemporary speech technology approaches will be discussed. Despite the focus on the linguistic aspects of the technology, it is important for students to have sufficient understanding of the algorithms used in order to grasp both where linguistics fits in and the possible constraints on its incorporation into larger systems.
The course will examine freely available TTS and STT packages so that students can build their own engines and experiment with the construction of the components. For assignments and projects, students will be encouraged to pick a language or dialect of their choice in order to build a synthesizer or recognizer for that variety. It would be most interesting to focus on languages or varieties that do not generally receive attention in commercial applications, such as African American or accented varieties of English.
Students from a variety of backgrounds are encouraged to take this course. Helpful background includes: natural language processing, phonetics, phonology and sociolinguistics. While not required, helpful technical background includes familiarity with speech analysis software such as PRAAT, Linux, shell scripting and coding/scripting in languages like Python, Java, C++, etc.
Amir Zeldes Graduate
Recent years have seen an explosion of computational work on higher level discourse representations, such as entity recognition, mention and coreference resolution and shallow discourse parsing. At the same time, the theoretical status of the underlying categories is not well understood, and despite progress, these tasks remain very much unsolved in practice. This graduate level seminar will concentrate on theoretical and practical models representing how referring expressions, such as mentions of people, things and events, are coded during language processing. We will begin by exploring the literature on human discourse processing in terms of information structure, discourse coherence and theories about anaphora, such as Centering Theory and Alternative Semantics. We will then look at computational linguistics implementations of systems for entity recognition and coreference resolution and explore their relationship with linguistic theory. Over the course of the semester, participants will implement their own coding project exploring some phenomenon within the domain of entity recognition, coreference, discourse modeling or a related area.
COSC-285 | Introduction to Data Mining
Nazli Goharian Upperclass Undergraduate
This course covers concepts and techniques in the field of data mining. This includes both supervised and unsupervised algorithms, such as naive Bayes, neural network, decision tree, rule based classifiers, distance based learners, clustering, and association rule mining. Various issues in the pre-processing of the data are addressed. Text classification, social media mining, and recommender systems will be addressed. The students learn the material by building various data mining models and using various data pre-processing techniques, performing experimentation and provide analysis of the results.
COSC-545 | Theory of Computation
Calvin Newport Graduate
Topics covered are drawn from the following: finite automata, formal languages, machine models for formal languages, computability and recursion theory, computational complexity, and mathematical logic applied to computer science.
COSC-550 | Information Theory
Bala Kalyanasundaram Graduate
This course introduces a beautiful mathematical theory that captures the essence of information content of a process, computation or communication. It will explore the connection between this theory and various fundamental topics in Computer Science such as coding theory, communication complexity, and description complexity. Subject to time limitations, applications in combinatorics, graph theory, lower bounds, data compression, data communication and coding will be covered. A major goal of the course is to understand information-theoretic techniques and intuition that play prominent role in various parts of science.
COSC-575 | Machine Learning
Mark Maloof Graduate
This course surveys the major research areas of machine learning, concentrating on inductive learning. The course will also compare and contrast machine learning with related endeavors, such as statistical learning, pattern classification, data mining, and information retrieval. Topics will include rule induction, decision trees, Bayesian methods, density estimation, linear classifiers, neural networks, instance-based approaches, genetic algorithms, evaluation, and applications. In addition to programming projects and homework, students will complete a semester project.
COSC-589 | Web Search and Sense-making
Grace Hui Yang Graduate
The Web provides abundant information which allows us to live more conveniently and make quicker decisions. At the same time, the growth of the Web and the improvements in data creation, collection, and use have lead to tremendous increase in the amount and complexity of the data that a search engine needs to handle. The increase of the magnitude and complexity of the data has become a major drive for new data analytics algorithms and technologies that are scalable, highly interactive, and able to handle complex and dynamic information seeking tasks in the big data era. How to effectively and efficiently search for the documents relevant to our information needs and how to extract the valuable information and make sense out from “big data” are the subjects of this course.
The course will cover Web search theory and techniques, including basic probabilistic theory, representations of documents and information needs, various retrieval models, link analysis, classification and recommender systems. The course will also cover programming models that allow us to easily distribute computations across large computer clusters. In particular, we will teach Apache Spark, which is an open-source cluster computing framework that has soon become the state-of-the-art for big data programming. The course is featured in step-by-step weekly/bi-weekly small assignments which composes a large big data project, such as building Google’s PageRank on the entire Wikipedia. Students will be provided knowledge to Spark, Scala, Web search engines, and Web recommender systems with a focus on search engine design and "thinking at scale”.
COSC-592 | Health Search and Mining
Nazli Goharian Graduate
This course will be a combination of lectures and students' presentations. After providing a review of information retrieval and data mining, the lectures will cover health text processing on scientific literature, clinical notes, and social media, among others. The Students will present and discuss research literature. This includes: review of current literature on specific topic, and experimental results and evaluation of a proposed approach. Students are expected to have the knowledge of data structures.
COSC/LING-672 | Advanced Semantic Representation
Nathan Schneider Graduate
Natural language is an imperfect vehicle for meaning. On the one hand, some expressions can be interpreted in multiple ways; on the other hand, there are often many superficially divergent ways to express very similar meanings. Semantic representations attempt to disentangle these two effects by exposing similarities and differences in how a word or sentence is interpreted. Such representations, and algorithms for working with them, constitute a major research area in natural language processing.
This course will examine semantic representations for natural language from a computational/NLP perspective. Through readings, presentations, discussions, and hands-on exercises, we will put a semantic representation under the microscope to assess its strengths and weaknesses. For each representation we will confront questions such as: What aspects of meaning are and are not captured? How well does the representation scale to the large vocabulary of a language? What assumptions does it make about grammar? How language-specific is it? In what ways does it facilitate manual annotation and automatic analysis? What datasets and algorithms have been developed for the representation? What has it been used for? In Spring 2017 the focus will be on the Abstract Meaning Representation (http://amr.isi.edu/); its relationship to other representations in the literature will also be considered. Term projects will consist of (i) innovating on the representation's design, datasets, or analysis algorithms, or (ii) applying it to questions in linguistics or downstream NLP tasks.
LING-264 | Multilingual and Parallel Corpora
Amir Zeldes Upperclass Undergraduate
Parallel and multilingual corpora are collections of natural language data in several languages, constructed using principled design criteria, which contain either aligned translations of the same texts, or distinct but comparable texts. As such, they are vital resources for comparative linguistics, translation studies, multilingual lexicography and machine translation. This course sets out to explore the theoretical problems raised by translated language, as well as practical issues and empirical patterns found in multilingual data. This includes questions such as exploring similarities and differences between closely related languages, such as different Romance or Slavic languages, or very distant ones, such as English and Japanese. The focus of the course is on the study of actual examples of parallel and comparable corpora using computational methods. Students will have access to search engines indexing aligned translations and will learn some of the basics of building a parallel corpus. The course introduces some fundamental computational methodology, but does not have a programming requirement. However, familiarity with at least one language other than English is required to complete coursework.
LING-469 | Analyzing language data with R
Amir Zeldes Upperclass Undergraduate & Graduate
This course will teach statistical analysis of language data with a focus on corpus materials, using the freely available statistics software 'R'. The course will begin with foundational notions and methods for statistical evaluation, hypothesis testing and visualization of linguistic data which are necessary for both the practice and the understanding of current quantitative research. As we progress we will learn exploratory methods to chart out meaningful structures in language data, such as agglomerative clustering, principal component analysis and multifactorial regression analysis. The course assumes basic mathematical skills and familiarity with linguistic methodology, but does not require a background in statistics or R.
Hal Daumé (UMD)
CS Colloquium 10/14/16, 11:00 in St. Mary’s 326
Learning Language through Interaction
Machine learning-based natural language processing systems are amazingly effective, when plentiful labeled training data exists for the task/domain of interest. Unfortunately, for broad coverage (both in task and domain) language understanding, we're unlikely to ever have sufficient labeled data, and systems must find some other way to learn. I'll describe a novel algorithm for learning from interactions, and several problems of interest, most notably machine simultaneous interpretation (translation while someone is still speaking).
This is all joint work with some amazing (former) students He He, Alvin Grissom II, John Morgan, Mohit Iyyer, Sudha Rao and Leonardo Claudino, as well as colleagues Jordan Boyd-Graber, Kai-Wei Chang, John Langford, Akshay Krishnamurthy, Alekh Agarwal, Stéphane Ross, Alina Beygelzimer and Paul Mineiro.
Hal Daumé III is an associate professor in Computer Science at the University of Maryland, College Park. He holds joint appointments in UMIACS and Linguistics. He was previously an assistant professor in the School of Computing at the University of Utah. His primary research interest is in developing new learning algorithms for prototypical problems that arise in the context of language processing and artificial intelligence. This includes topics like structured prediction, domain adaptation and unsupervised learning; as well as multilingual modeling and affect analysis. He associates himself most with conferences like ACL, ICML, NIPS and EMNLP. He earned his PhD at the University of Southern California with a thesis on structured prediction for language (his advisor was Daniel Marcu). He spent the summer of 2003 working with Eric Brill in the machine learning and applied statistics group at Microsoft Research. Prior to that, he studied math (mostly logic) at Carnegie Mellon University. He still likes math and doesn't like to use C (instead he uses O'Caml or Haskell).
Yulia Tsvetkov (CMU/Stanford)
Linguistics Speaker Series 11/11/16, 3:30 in Poulton 230
On the Synergy of Linguistics and Machine Learning in Natural Language Processing
One way to provide deeper insight into data and to build more powerful, robust models is bridging between linguistic knowledge and statistical learning. I’ll present model-based approaches that incorporate linguistic knowledge in novel ways.
First, I’ll show how linguistic knowledge comes to the rescue in processing languages which lack large data resources. I’ll describe a new approach to cross-lingual knowledge transfer that models the historical process of lexical borrowing between languages, and I will show how its predictions can be used to improve statistical machine translation systems.
In the second part of my talk, I’ll argue that linguistic insight helps improve learning also in resource-rich conditions. I’ll present three methods to integrate linguistic knowledge in training data, neural network architectures, and into evaluation of word representations. The first method uses features quantifying linguistic coherence, prototypicality, simplicity, and diversity to find a better curriculum for learning distributed representations of words. Distributed representations of words capture which words have similar meanings and occur in similar contexts. With improved word representations, we improve part-of-speech tagging, parsing, named entity recognition, and sentiment analysis. The second describes polyglot language models, neural network architectures trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted. Finally, the third is an intrinsic evaluation measure of the quality of distributed representations of words. It is based on correlations of learned vectors with features extracted from manually crafted lexical resources. This computationally inexpensive method obtains strong correlation with performance of the vectors in a battery of downstream semantic and syntactic evaluation tasks. I’ll conclude with future research questions.
Yulia Tsvetkov is a postdoc in the Stanford NLP Group, where she works on computational social science with professor Dan Jurafsky. During her PhD in the Language Technologies Institute at Carnegie Mellon University, she worked on advancing machine learning techniques to tackle cross-lingual and cross-domain problems in natural language processing, focusing on computational phonology and morphology, distributional and lexical semantics, and statistical machine translation of both text and speech. In 2017, Yulia will join the Language Technologies Institute at CMU as an assistant professor.
Marine Carpuat (UMD)
Linguistics Speaker Series 11/18/16, 3:30 in Poulton 230
Toward Natural Language Inference Across Languages
Natural Language processing tasks as diverse as automatically extracting information from text, answering questions, translating or summarizing documents, all require the ability to compare and contrast the meaning of words and sentences. State-of-the-art techniques rely on dense vector representations which capture the distributional properties of words in large amounts of text in a single language. We seek to improve these representations to capture not only similarity in meaning between words or sentences, but also inference relations such as entailment and contradiction, and enable comparisons not only within, but also across languages.
In this talk, we will present novel approaches to inducing word representations from multilingual text corpora. First, we will show that translations in e.g. Chinese can be used as distant supervision to induce English word representations that can be composed into better representations of English sentences (Elgohary and Carpuat, ACL 2016). Then we will show how sparsity constraints can further improve word representations, and enable the detection not only semantic similarity (do "cure" and "remedy" have the same meaning?), but also entailment (does "antidote" entail "cure"?) between words in different languages (Vyas and Carpuat, NAACL 2016).
Marine Carpuat is an Assistant Professor in Computer Science at the University of Maryland, with a joint appointment at UMIACS. Her research interests are in natural language processing, with a focus on multilinguality. Marine was previously a Research Scientist at the National Research Council of Canada, and a postdoctoral researcher at the Columbia University Center for Computational Learning Systems. She received a PhD in Computer Science from the Hong Kong University of Science & Technology (HKUST) in 2008. She also earned a MPhil in Electrical Engineering from HKUST and an engineering degree from the French Grande Ecole Supélec.
Shomir Wilson (UC)
CS Colloquium 11/21/16, 11:00 in St. Mary’s 326
Text Analysis to Support the Privacy of Internet Users
Shomir Wilson is an Assistant Professor of Computer Science in the Department of Electrical Engineering and Computing Systems at the University of Cincinnati. His professional interests span pure and applied research in natural language processing, privacy, and artificial intelligence. Previously he held postdoctoral and lecturer positions in Carnegie Mellon University's School of Computer Science, and he spent a year as an NSF International Research Fellow in the University of Edinburgh's School of Informatics. He received his Ph.D. in Computer Science from the University of Maryland in 2011.
Mark Dredze (JHU)
CS Colloquium 11/29/16, 11:00 in St. Mary’s 326
Topic Models for Identifying Public Health Trends
Twitter and other social media sites contain a wealth of information about populations and has been used to track sentiment towards products, measure political attitudes, and study social linguistics. In this talk, we investigate the potential for Twitter and social media to impact public health research. Broadly, we explore a range of applications for which social media may hold relevant data. To uncover these trends, we develop new topic models that can reveal trends and patterns of interest to public health from vast quantities of data.
Mark Dredze is an Assistant Research Professor in Computer Science at Johns Hopkins University and a research scientist at the Human Language Technology Center of Excellence. He is also affiliated with the Center for Language and Speech Processing and the Center for Population Health Information Technology. His research in natural language processing and machine learning has focused on graphical models, semi-supervised learning, information extraction, large-scale learning, and speech processing. His focuses on public health informatics applications, including information extraction from social media, biomedical and clinical texts. He obtained his PhD from the University of Pennsylvania in 2009.
Mona Diab (GW)
CS Colloquium 12/2/16, 2:30 in St. Mary’s 414
Processing Arabic Social Media: Challenges and Opportunities
We recently witnessed an exponential growth in Arabic social media usage. Processing such media is of great utility for all kinds of applications ranging from information extraction to social media analytics for political and commercial purposes to building decision support systems. Compared to other languages, Arabic, especially the informal variety, poses a significant challenge to natural language processing algorithms since it comprises multiple dialects, linguistic code switching, and a lack of standardized orthographies, to top its relatively complex morphology. Inherently, the problem of processing Arabic in the context of social media is the problem of how to handle resource poor languages. In this talk I will go over some of our insights to some of these problems and show how there is a silver lining where we can generalize some of our solutions to other low resource language contexts.
Mona Diab is an Associate Professor in the Department of Computer Science, George Washington University (GW). She is the founder and Director of the GW NLP lab (CARE4Lang). Before joining GW, she was a Research Scientist (Principal Investigator) at the Center for Computational Learning Systems (CCLS), Columbia University in New York. She is also co-founder of the CADIM group with Nizar Habash and Owen Rambow, which is one of the leading places and reference points on computational processing of Arabic and its dialects. Her research interests span several areas in computational linguistics/natural language processing: computational lexical semantics, multilingual processing, social media processing, information extraction & text analytics, machine translation, and computational socio-pragmatics. She has a special interest in low resource language processing with a focus on Arabic dialects.
Joel Tetreault (Grammarly)
CS Colloquium 1/27/17, 11:00 in St. Mary’s 326
Analyzing Formality in Online Communication
Full natural language understanding requires comprehending not only the content or meaning of a piece of text or speech, but also the stylistic way in which it is conveyed. To enable real advancements in dialog systems, information extraction, and human-computer interaction, computers need to understand the entirety of what humans say, both the literal and the non-literal. This talk presents an in-depth investigation of one particular stylistic aspect, formality. First, we provide an analysis of humans' subjective perceptions of formality in four different genres of online communication. We highlight areas of high and low agreement and extract patterns that consistently differentiate formal from informal text. Next, we develop a statistical model for predicting formality at the sentence level, using rich NLP and deep learning features, and then evaluate the model's performance against human judgments across genres. Finally, we apply our model to analyze language use in online debate forums. Our results provide new evidence in support of theories of linguistic coordination, underlining the importance of formality for language generation systems.
This work was done with Ellie Pavlick (UPenn) during her summer internship at Yahoo Labs.
Joel Tetreault is Director of Research at Grammarly. His research focus is Natural Language Processing with specific interests in anaphora, dialogue and discourse processing, machine learning, and applying these techniques to the analysis of English language learning, automated essay scoring among others. Currently he works on the research and development of NLP tools and components for the next generation of intelligent writing assistance systems. Prior to joining Grammarly, he was a Senior Research Scientist at Yahoo Labs, Senior Principal Manager of the Core Natural Language group at Nuance Communications, Inc., and worked at Educational Testing Service for six years as a managing research scientist where he researched automated methods for essay scoring, detecting grammatical errors by non-native speakers, plagiarism detection, and content scoring. Tetreault received his B.A. in Computer Science from Harvard University and his M.S. and Ph.D. in Computer Science from the University of Rochester. He was also a postdoctoral research scientist at the University of Pittsburgh's Learning Research and Development Center, where he worked on developing spoken dialogue tutoring systems. In addition, he has co-organized the Building Educational Application workshop series for 8 years, several shared tasks, and is currently NAACL Treasurer.
Kenneth Heafield (Edinburgh)
CS Colloquium 2/2/17, 11:00 in St. Mary’s 326
Machine Translation is Too Slow
We're trying to make machine translation output less terrible, but we're impatient. A neural translation system took two weeks to train in 1996 and two weeks to train in 2016 because the field used twenty years of computing advances to build bigger and better models subject to the same patience limit. I'll talk about multiple efforts to make things faster: coarse-to-fine search algorithms and sparse gradient updates to reduce network communication.
Kenneth Heafield is a Lecturer (~Assistant Professor) in computer science at the University of Edinburgh. Motivated by machine translation problems, he takes a systems-heavy approach to improving quality and speed of neural systems. He is the creator of the widely-used KenLM library for efficient language modeling.
Margaret Mitchell (Google Research)
CS Colloquium 2/16/17, 11:00 in St. Mary’s 326
Algorithmic Bias in Artificial Intelligence: The Seen and Unseen Factors Influencing Machine Perception of Images and Language
The success of machine learning has recently surged, with similar algorithmic approaches effectively solving a variety of human-defined tasks. Tasks testing how well machines can perceive images and communicate about them have exposed strong effects of different types of bias, such as selection bias and dataset bias. In this talk, I will unpack some of these biases, and how they affect machine perception today. I will introduce and detail the first computational model to leverage human Reporting Bias—what people mention—in order to learn ground-truth facts about the visual world.
I am a Senior Research Scientist in Google's Research & Machine Intelligence group, working on advancing artificial intelligence towards positive goals, as well as ethics in AI and demographic diversity of researchers. My research is on vision-language and grounded language generation, focusing on how to help computers communicate based on what they can process. My work combines computer vision, natural language processing, social media, many statistical methods, and insights from cognitive science. Before Google, I was a founding member of Microsoft Research's "Cognition" group, focused on advancing vision-language artificial intelligence. Before MSR, I was a postdoctoral researcher at The Johns Hopkins University Center of Excellence, where I mainly focused on semantic role labeling and sentiment analysis using graphical models, working under Benjamin Van Durme. Before that, I was a postgraduate (PhD) student in the natural language generation (NLG) group at the University of Aberdeen, where I focused on how to naturally refer to visible, everyday objects. I primarily worked with Kees van Deemter and Ehud Reiter. I spent a good chunk of 2008 getting a Master's in Computational Linguistics at the University of Washington, studying under Emily Bender and Fei Xia. Simultaneously (2005 - 2012), I worked on and off at the Center for Spoken Language Understanding, part of OHSU, in Portland, Oregon. My title changed with time (research assistant/associate/visiting scholar), but throughout, I worked on technology that leverages syntactic and phonetic characteristics to aid those with neurological disorders under Brian Roark. I continue to balance my time between language generation, applications for clinical domains, and core AI research.
Glen Coppersmith (Qntfy & JHU)
CS Colloquium 2/24/17, 11:00 in St. Mary’s 326
Quantifying the White Space
Behavioral assessment and measurement today are typically invasive and human intensive (for both patient and clinician). Moreover, by their nature, they focus on retrospective analysis by the patient (or the patient’s loved ones) about emotionally charged situations—a process rife with biases, not repeatable, and expensive. We examine all the data in the “white space” between interactions with the healthcare system (social media data, wearables, activities, nutrition, mood, etc.), and have shown quantified signals relevant to mental health that can be extracted from them. These methods to gather and analyze disparate data unobtrusively and in real time enable a range of new scientific questions, diagnostic capabilities, assessment of novel treatments, and quantified key performance measures for behavioral health. These techniques hold special promise for suicide risk, given the dearth of unbiased accounts of a person’s behavior leading up to a suicide attempt. We are beginning to see the promise of using these disparate data for revolution in mental health.
Glen is the founder and CEO of Qntfy (pronounced “quantify”), a company devoted to scaling therapeutic impact by empowering mental health clinicians and patients with data science and technology. Qntfy brings a deep understanding of the underlying technology and an appreciation for the human processes these technologies need to fit in to in order to make an impact. Qntfy, in addition to providing analytic and software solutions, considers it a core mission to push the fundamental and applied research at the intersection of mental health and technology. Qntfy built the data donation site OurDataHelps.org to gather and curate the datasets needed to drive mental health research, working closely with the suicide prevention community. Qntfy was also 2015 Foundry Cup grand prize winner – a design competition seeking innovative approaches to diagnosing and treating PTSD.
Prior to starting Qntfy, Glen was the first full-time research scientist at the Human Language Technology Center of Excellence at Johns Hopkins University where he joined in 2008. His research has focused on the creation and application of statistical pattern recognition techniques to large disparate data sets for addressing challenges of national importance. Oftentimes, the data of interest was human language content and associated metadata. Glen has shown particular acumen for enabling inference tasks that bring together diverse and noisy data. His work spans from principled exploratory data analysis, anomaly detection, graph theory, statistical inference and visualization.
Glen earned his Bachelors in Computer Science and Cognitive Psychology in 2003, a Masters in Psycholinguistics in 2005, and his Doctorate in Neuroscience in 2008, all from Northeastern University. As this suggests, his interests and knowledge are broad, from computer science and statistics to biology and psychology.