GUCL: Computational Linguistics @ Georgetown
We are a group of Georgetown University faculty, student, and staff researchers at the intersection of language and computation. Our areas of expertise include natural language processing, corpus linguistics, information retrieval, text mining, and more. Members belong to the Linguistics and/or Computer Science departments.
- 9/8/21: Congratulations to the Corpling lab on winning the DISRPT 2021 shared task on discourse processing!
- 8/27/20: First-Year Student Presented Paper at Prestigious Computational Linguistics Conference (Aryaman Arora)
- 9/10/18: #MeToo Movement on Twitter (Lisa Singh)
- 8/29/18: Cliches in baseball (Nathan Schneider)
- 1/20/18: The Coptic Scriptorium project (Amir Zeldes)
- Congratulations to Arman Cohan, Nazli Goharian, and Georgetown alum Andrew Yates for winning a Best Long Paper award at EMNLP 2017! The paper is entitled "Depression and Self-Harm Risk Assessment in Online Forums."
- Congratulations to Ophir Frieder, who has been named to the European Academy of Sciences and Arts (EASA)!
- 9/19/16: "Email" Dominates What Americans Have Heard About Clinton (Lisa Singh)
- 7/12/16: Searching Harsh Environments (Ophir Frieder)
Mailing list: Contact Nathan Schneider to subscribe!
- Aynat Rubinstein (HUJI): Linguistics, 11/5/21, 3:30 in Poulton 326/hybrid
- Ryan Lepic (Gallaudet): Linguistics, 11/12/21, 3:30 in Poulton 326/hybrid
- Emine Yilmaz (UCL): CS, 11/12/21, 11:00 via Zoom
- Chris Potts (Stanford): Linguistics, 12/3/21, 3:30 via Zoom
- Rediet Abebe (Berkeley): CS, 3/25/22, time TBA, via Zoom
- Laura Michaelis (CU Boulder): Linguistics, 4/29/22, 3:30 in Poulton 326/hybrid
Nazli Goharian Upperclass Undergraduate & Graduate
Information retrieval is the identification of textual components, be them web pages, blogs, microblogs, documents, medical transcriptions, mobile data, or other big data elements, relevant to the needs of the user. Relevancy is determined either as a global absolute or within a given context or view point. Practical, but yet theoretically grounded, foundational and advanced algorithms needed to identify such relevant components are taught.
The Information-retrieval techniques and theory, covering both effectiveness and run-time performance of information-retrieval systems are covered. The focus is on algorithms and heuristics used to find textual components relevant to the user request and to find them fast. The course covers the architecture and components of the search engines such as parser, index builder, and query processor. In doing this, various retrieval models, relevance ranking, evaluation methodologies, and efficiency considerations will be covered. The students learn the material by building a prototype of such a search engine. These approaches are in daily use by all search and social media companies.
Nathan Schneider Graduate
Systems of communication that come naturally to humans are thoroughly unnatural for computers. For truly robust information technologies, we need to teach computers to unpack our language. Natural language processing (NLP) technologies facilitate semi-intelligent artificial processing of human language text. In particular, techniques for analyzing the grammar and meaning of words and sentences can be used as components within applications such as web search, question answering, and machine translation.
This course introduces fundamental NLP concepts and algorithms, emphasizing the marriage of linguistic corpus resources with statistical and machine learning methods. As such, the course combines elements of linguistics, computer science, and data science. Coursework will consist of lectures, programming assignments (in Python), and a final team project. The course is intended for students who are already comfortable with programming and have some familiarity with probability theory.
Amir Zeldes Upperclass Undergraduate & Graduate
Digital linguistic corpora, i.e. electronic collections of written, spoken or multimodal language data, have become an increasingly important source of empirical information for theoretical and applied linguistics in recent years. This course is meant as a theoretically founded, practical introduction to corpus work with a broad selection of data, including non-standardized varieties such as language on the Internet, learner corpora and historical corpora. We will discuss issues of corpus design, annotation and evaluation using quantitative methods and both manual and automatic annotation tools for different levels of linguistic analysis, from parts-of-speech, through syntax to discourse annotation. Students in this course participate in building the corpus described here: https://corpling.uis.georgetown.edu/gum/
Corey Miller Upperclass Undergraduate & Graduate
This course will survey speech processing technology from a computational linguistic perspective. Speech processing technology is a component of human language technology that focuses on the processing of audio data. The audio data can be either the input or output of speech processing. When speech serves as the output, the technology is known as speech synthesis or text-to-speech (TTS). Additional technologies to be examined include spoken language identification (SLID), speaker verification and identification and speech diarization, which is the parsing of audio data into individual speaker segments.
Particular attention will be paid to the linguistic components of speech technology. Phonetics and phonology play an important role in both TTS and STT. In addition, morphology, syntax and pragmatics are important both in authentic modeling of TTS and in constraining possible STT output. Semantics plays a role in the interpretation of STT output, which can feed into text-based natural language processing (NLP).
The algorithms underlying contemporary speech technology approaches will be discussed. Despite the focus on the linguistic aspects of the technology, it is important for students to have sufficient understanding of the algorithms used in order to grasp both where linguistics fits in and the possible constraints on its incorporation into larger systems.
The course will examine freely available TTS and STT packages so that students can build their own engines and experiment with the construction of the components. For assignments and projects, students will be encouraged to pick a language or dialect of their choice in order to build a synthesizer or recognizer for that variety. It would be most interesting to focus on languages or varieties that do not generally receive attention in commercial applications, such as African American or accented varieties of English.
Students from a variety of backgrounds are encouraged to take this course. Helpful background includes: natural language processing, phonetics, phonology and sociolinguistics. While not required, helpful technical background includes familiarity with speech analysis software such as PRAAT, Linux, shell scripting and coding/scripting in languages like Python, Java, C++, etc.
Amir Zeldes Graduate
Recent years have seen an explosion of computational work on higher level discourse representations, such as entity recognition, mention and coreference resolution and shallow discourse parsing. At the same time, the theoretical status of the underlying categories is not well understood, and despite progress, these tasks remain very much unsolved in practice. This graduate level seminar will concentrate on theoretical and practical models representing how referring expressions, such as mentions of people, things and events, are coded during language processing. We will begin by exploring the literature on human discourse processing in terms of information structure, discourse coherence and theories about anaphora, such as Centering Theory and Alternative Semantics. We will then look at computational linguistics implementations of systems for entity recognition and coreference resolution and explore their relationship with linguistic theory. Over the course of the semester, participants will implement their own coding project exploring some phenomenon within the domain of entity recognition, coreference, discourse modeling or a related area.
COSC-285 | Introduction to Data Mining
Nazli Goharian Upperclass Undergraduate
This course covers concepts and techniques in the field of data mining. This includes both supervised and unsupervised algorithms, such as naive Bayes, neural network, decision tree, rule based classifiers, distance based learners, clustering, and association rule mining. Various issues in the pre-processing of the data are addressed. Text classification, social media mining, and recommender systems will be addressed. The students learn the material by building various data mining models and using various data pre-processing techniques, performing experimentation and provide analysis of the results.
COSC-545 | Theory of Computation
Calvin Newport Graduate
Topics covered are drawn from the following: finite automata, formal languages, machine models for formal languages, computability and recursion theory, computational complexity, and mathematical logic applied to computer science.
COSC-550 | Information Theory
Bala Kalyanasundaram Graduate
This course introduces a beautiful mathematical theory that captures the essence of information content of a process, computation or communication. It will explore the connection between this theory and various fundamental topics in Computer Science such as coding theory, communication complexity, and description complexity. Subject to time limitations, applications in combinatorics, graph theory, lower bounds, data compression, data communication and coding will be covered. A major goal of the course is to understand information-theoretic techniques and intuition that play prominent role in various parts of science.
COSC-575 | Machine Learning
Mark Maloof Graduate
This course surveys the major research areas of machine learning, concentrating on inductive learning. The course will also compare and contrast machine learning with related endeavors, such as statistical learning, pattern classification, data mining, and information retrieval. Topics will include rule induction, decision trees, Bayesian methods, density estimation, linear classifiers, neural networks, instance-based approaches, genetic algorithms, evaluation, and applications. In addition to programming projects and homework, students will complete a semester project.
COSC-589 | Web Search and Sense-making
Grace Hui Yang Graduate
The Web provides abundant information which allows us to live more conveniently and make quicker decisions. At the same time, the growth of the Web and the improvements in data creation, collection, and use have lead to tremendous increase in the amount and complexity of the data that a search engine needs to handle. The increase of the magnitude and complexity of the data has become a major drive for new data analytics algorithms and technologies that are scalable, highly interactive, and able to handle complex and dynamic information seeking tasks in the big data era. How to effectively and efficiently search for the documents relevant to our information needs and how to extract the valuable information and make sense out from “big data” are the subjects of this course.
The course will cover Web search theory and techniques, including basic probabilistic theory, representations of documents and information needs, various retrieval models, link analysis, classification and recommender systems. The course will also cover programming models that allow us to easily distribute computations across large computer clusters. In particular, we will teach Apache Spark, which is an open-source cluster computing framework that has soon become the state-of-the-art for big data programming. The course is featured in step-by-step weekly/bi-weekly small assignments which composes a large big data project, such as building Google’s PageRank on the entire Wikipedia. Students will be provided knowledge to Spark, Scala, Web search engines, and Web recommender systems with a focus on search engine design and "thinking at scale”.
COSC-592 | Health Search and Mining
Nazli Goharian Graduate
This course will be a combination of lectures and students' presentations. After providing a review of information retrieval and data mining, the lectures will cover health text processing on scientific literature, clinical notes, and social media, among others. The Students will present and discuss research literature. This includes: review of current literature on specific topic, and experimental results and evaluation of a proposed approach. Students are expected to have the knowledge of data structures.
COSC/LING-672 | Advanced Semantic Representation
Nathan Schneider Graduate
Natural language is an imperfect vehicle for meaning. On the one hand, some expressions can be interpreted in multiple ways; on the other hand, there are often many superficially divergent ways to express very similar meanings. Semantic representations attempt to disentangle these two effects by exposing similarities and differences in how a word or sentence is interpreted. Such representations, and algorithms for working with them, constitute a major research area in natural language processing.
This course will examine semantic representations for natural language from a computational/NLP perspective. Through readings, presentations, discussions, and hands-on exercises, we will put a semantic representation under the microscope to assess its strengths and weaknesses. For each representation we will confront questions such as: What aspects of meaning are and are not captured? How well does the representation scale to the large vocabulary of a language? What assumptions does it make about grammar? How language-specific is it? In what ways does it facilitate manual annotation and automatic analysis? What datasets and algorithms have been developed for the representation? What has it been used for? In Spring 2017 the focus will be on the Abstract Meaning Representation (http://amr.isi.edu/); its relationship to other representations in the literature will also be considered. Term projects will consist of (i) innovating on the representation's design, datasets, or analysis algorithms, or (ii) applying it to questions in linguistics or downstream NLP tasks.
LING-264 | Multilingual and Parallel Corpora
Amir Zeldes Upperclass Undergraduate
Parallel and multilingual corpora are collections of natural language data in several languages, constructed using principled design criteria, which contain either aligned translations of the same texts, or distinct but comparable texts. As such, they are vital resources for comparative linguistics, translation studies, multilingual lexicography and machine translation. This course sets out to explore the theoretical problems raised by translated language, as well as practical issues and empirical patterns found in multilingual data. This includes questions such as exploring similarities and differences between closely related languages, such as different Romance or Slavic languages, or very distant ones, such as English and Japanese. The focus of the course is on the study of actual examples of parallel and comparable corpora using computational methods. Students will have access to search engines indexing aligned translations and will learn some of the basics of building a parallel corpus. The course introduces some fundamental computational methodology, but does not have a programming requirement. However, familiarity with at least one language other than English is required to complete coursework.
LING-469 | Analyzing language data with R
Amir Zeldes Upperclass Undergraduate & Graduate
This course will teach statistical analysis of language data with a focus on corpus materials, using the freely available statistics software 'R'. The course will begin with foundational notions and methods for statistical evaluation, hypothesis testing and visualization of linguistic data which are necessary for both the practice and the understanding of current quantitative research. As we progress we will learn exploratory methods to chart out meaningful structures in language data, such as agglomerative clustering, principal component analysis and multifactorial regression analysis. The course assumes basic mathematical skills and familiarity with linguistic methodology, but does not require a background in statistics or R.