GUCL: Computational Linguistics @ Georgetown
We are a group of Georgetown University faculty, student, and staff researchers at the intersection of language and computation. Our areas of expertise include natural language processing, corpus linguistics, information retrieval, text mining, and more. Members belong to the Linguistics and/or Computer Science departments.
- 8/27/20: First-Year Student Presented Paper at Prestigious Computational Linguistics Conference (Aryaman Arora)
- 9/10/18: #MeToo Movement on Twitter (Lisa Singh)
- 8/29/18: Cliches in baseball (Nathan Schneider)
- 1/20/18: The Coptic Scriptorium project (Amir Zeldes)
- Congratulations to Arman Cohan, Nazli Goharian, and Georgetown alum Andrew Yates for winning a Best Long Paper award at EMNLP 2017! The paper is entitled "Depression and Self-Harm Risk Assessment in Online Forums."
- Congratulations to Ophir Frieder, who has been named to the European Academy of Sciences and Arts (EASA)!
- 9/19/16: "Email" Dominates What Americans Have Heard About Clinton (Lisa Singh)
- 7/12/16: Searching Harsh Environments (Ophir Frieder)
Mailing list: Contact Nathan Schneider to subscribe!
COSC-270 | Artificial Intelligence
Mark Maloof Undergraduate
Artificial Intelligence (AI) is the branch of computer science that studies how to program computers to reason, learn, see, and understand. The lecture portion of this class surveys basic and advanced concepts and techniques of artificial intelligence, including search, knowledge representation, automated reasoning, uncertain reasoning, and machine learning. Additional topics include the Lisp programming language, theorem proving, game playing, rule-based systems, and philosophical issues. Applications of artificial intelligence will also be discussed and will include domains such as medicine, computer security, and face detection. Students must complete midterm and final exams, and five projects using the Lisp programming language.
COSC/LING-272 | Algorithms for NLP
Nathan Schneider Undergraduate
Human language technologies increasingly help us to communicate with computers and with each other. But every human language is extraordinarily complex, and the diversity seen in languages of the world is massive. Natural language processing (NLP) seeks to formalize and unpack different aspects of a language so computers can approximate human-like language abilities. In this course, we will examine the building blocks that underlie a human language such as English (or Japanese, Arabic, Tamil, or Navajo), and fundamental algorithms for analyzing those building blocks in text data, with an emphasis on the structure and meaning of words and sentences. Students will implement a variety of core algorithms for both rule-based and machine learning methods, and learn how to use computational linguistic datasets such as lexicons and treebanks. Text processing applications such as machine translation, information retrieval, and dialogue systems will be introduced as well.
This course is designed for undergraduates who are comfortable with the basics of discrete probability and possess solid programming skills, including the ability to use basic data structures and familiarity with regular expressions. COSC-160: Data Structures is the prerequisite for CS students, and LING-001 is the prerequisite for Linguistics students. Students that are new to programming or need a refresher are directed to LING-362: Introduction to NLP. The languages of instruction will be English and Python.
COSC-574 | Automated Reasoning
Mark Maloof Graduate
This graduate lecture surveys methods of automated deductive reasoning. Through traditional lectures, programming projects, paper presentations, and research projects, students learn (1) to understand the foundations of logical and probabilistic methods of automated reasoning. (2) to implement algorithms for logical and probabilistic reasoning, (3) to comprehend, analyze, and critique papers from the primary literature, (4) to replicate studies described in the primary literature, and (5) to design, conduct, and present their own studies. Topics include propositional logic, predicate logic, resolution proof, production systems, Prolog, uncertain reasoning, certainty factors, Bayesian decision theory, Bayesian networks, exact inference, approximate inference, first-order probabilistic models, probabilistic programming languages, and applications.
COSC-586 | Text Mining & Analysis
Nazli Goharian Graduate
This course covers various aspects and research areas in text mining and analysis. Text may be a document, query, blog, tag description, etc. The structure of the course is a combination of lectures & students' presentations. The lectures will cover Text/Web/query classification, information extraction, word sense disambiguation, opinion mining & sentiment analysis, query log analysis, ontology extraction and integration, and more. The students are assigned a related topic in the field for further study and presentation in the class.
LING-362 | Introduction to Natural Language Processing
Amir Zeldes Upperclass Undergraduate & Graduate
This course will introduce students to the basics of Natural Language Processing (NLP), a field which combines insights from linguistics and computer science to produce applications such as machine translation, information retrieval, and spell checking. We will cover a range of topics that will help students understand how current NLP technology works and will provide students with a platform for future study and research. We will learn to implement simple representations such as finite-state techniques, n-gram models and basic parsing in the Python programming language. Previous knowledge of Python is not required, but students should be prepared to invest the necessary time and effort to become proficient over the course of the semester. Students who take this course will gain a thorough understanding of the fundamental methods used in natural language understanding, along with an ability to assess the strengths and weaknesses of natural language technologies based on these methods.
LING-367 | Computational Corpus Linguistics
Amir Zeldes Upperclass Undergraduate & Graduate
Digital linguistic corpora, i.e. electronic collections of written, spoken or multimodal language data, have become an increasingly important source of empirical information for theoretical and applied linguistics in recent years. This course is meant as a theoretically founded, practical introduction to corpus work with a broad selection of data, including non-standardized varieties such as language on the Internet, learner corpora and historical corpora. We will discuss issues of corpus design, annotation and evaluation using quantitative methods and both manual and automatic annotation tools for different levels of linguistic analysis, from parts-of-speech, through syntax to discourse annotation. Students in this course participate in building the corpus described here: https://corpling.uis.georgetown.edu/gum/
Corey Miller Upperclass Undergraduate & Graduate
This course will survey speech processing technology from a computational linguistic perspective. Speech processing technology is a component of human language technology that focuses on the processing of audio data. The audio data can be either the input or output of speech processing. When speech serves as the output, the technology is known as speech synthesis or text-to-speech (TTS). Additional technologies to be examined include spoken language identification (SLID), speaker verification and identification and speech diarization, which is the parsing of audio data into individual speaker segments.
Particular attention will be paid to the linguistic components of speech technology. Phonetics and phonology play an important role in both TTS and STT. In addition, morphology, syntax and pragmatics are important both in authentic modeling of TTS and in constraining possible STT output. Semantics plays a role in the interpretation of STT output, which can feed into text-based natural language processing (NLP).
The algorithms underlying contemporary speech technology approaches will be discussed. Despite the focus on the linguistic aspects of the technology, it is important for students to have sufficient understanding of the algorithms used in order to grasp both where linguistics fits in and the possible constraints on its incorporation into larger systems.
The course will examine freely available TTS and STT packages so that students can build their own engines and experiment with the construction of the components. For assignments and projects, students will be encouraged to pick a language or dialect of their choice in order to build a synthesizer or recognizer for that variety. It would be most interesting to focus on languages or varieties that do not generally receive attention in commercial applications, such as African American or accented varieties of English.
Students from a variety of backgrounds are encouraged to take this course. Helpful background includes: natural language processing, phonetics, phonology and sociolinguistics. While not required, helpful technical background includes familiarity with speech analysis software such as PRAAT, Linux, shell scripting and coding/scripting in languages like Python, Java, C++, etc.
MATH-656 | Data Mining
George Wilson Graduate
This course presents an introduction to computational and statistical methods for exploring large data sets and discovering patterns in them. Visualization and other exploratory methods will be used throughout the course. The course surveys methods in predictive modeling (classification) including decision trees, Naïve Bayes and nearest neighbor methods. In the process, we will study discretization, data normalization and attribute selection as well as sampling methods like cross-validation, bagging and boosting. Other topics will include cluster analysis, association analysis, anomaly detection and text mining. For all topics studied, students will work with various real and constructed data sets to see the impact of different distributions on the performance of the algorithms. A variety of performance metrics will be studied.
The software Weka, R and Excel will be used in the course, although only basic knowledge of R and Excel will be assumed.
COSC-488 | Information Retrieval
Nazli Goharian Upperclass Undergraduate & Graduate
Information retrieval is the identification of textual components, be them web pages, blogs, microblogs, documents, medical transcriptions, mobile data, or other big data elements, relevant to the needs of the user. Relevancy is determined either as a global absolute or within a given context or view point. Practical, but yet theoretically grounded, foundational and advanced algorithms needed to identify such relevant components are taught.
The Information-retrieval techniques and theory, covering both effectiveness and run-time performance of information-retrieval systems are covered. The focus is on algorithms and heuristics used to find textual components relevant to the user request and to find them fast. The course covers the architecture and components of the search engines such as parser, index builder, and query processor. In doing this, various retrieval models, relevance ranking, evaluation methodologies, and efficiency considerations will be covered. The students learn the material by building a prototype of such a search engine. These approaches are in daily use by all search and social media companies.
COSC/LING-572 | Empirical Methods in Natural Language Processing
Nathan Schneider Graduate
Systems of communication that come naturally to humans are thoroughly unnatural for computers. For truly robust information technologies, we need to teach computers to unpack our language. Natural language processing (NLP) technologies facilitate semi-intelligent artificial processing of human language text. In particular, techniques for analyzing the grammar and meaning of words and sentences can be used as components within applications such as web search, question answering, and machine translation.
This course introduces fundamental NLP concepts and algorithms, emphasizing the marriage of linguistic corpus resources with statistical and machine learning methods. As such, the course combines elements of linguistics, computer science, and data science. Coursework will consist of lectures, programming assignments (in Python), and a final team project. The course is intended for students who are already comfortable with programming and have some familiarity with probability theory.
COSC-589 | Web Search and Sense-making
Grace Hui Yang Graduate
The Web provides abundant information which allows us to live more conveniently and make quicker decisions. At the same time, the growth of the Web and the improvements in data creation, collection, and use have lead to tremendous increase in the amount and complexity of the data that a search engine needs to handle. The increase of the magnitude and complexity of the data has become a major drive for new data analytics algorithms and technologies that are scalable, highly interactive, and able to handle complex and dynamic information seeking tasks in the big data era. How to effectively and efficiently search for the documents relevant to our information needs and how to extract the valuable information and make sense out from “big data” are the subjects of this course.
The course will cover Web search theory and techniques, including basic probabilistic theory, representations of documents and information needs, various retrieval models, link analysis, classification and recommender systems. The course will also cover programming models that allow us to easily distribute computations across large computer clusters. In particular, we will teach Apache Spark, which is an open-source cluster computing framework that has soon become the state-of-the-art for big data programming. The course is featured in step-by-step weekly/bi-weekly small assignments which composes a large big data project, such as building Google’s PageRank on the entire Wikipedia. Students will be provided knowledge to Spark, Scala, Web search engines, and Web recommender systems with a focus on search engine design and "thinking at scale”.
LING-462/COSC-482 | Statistical Machine Translation
Achim Ruopp Upperclass Undergraduate & Graduate
After more than 60 years since Machine Translation (MT) research started at Georgetown, this area of Natural Language Processing (NLP) research is more active than ever. In this course we explore the data-driven approaches to translate human language with computers that supplanted rule-based approaches in the past quarter century. First, we lay foundations for the course with statistical NLP relevant to MT and corpus preparation. Next, we start exploring statistical MT (SMT) – from word-based models to phrase-based models to tree-based models. We will then cover domain-adaptation, incremental learning and how to integrate linguistic information. We will learn how to evaluate system output with automatic and human evaluation methods.
Recently, deep learning-based approaches have proven to produce superior translation quality compared to SMT. We will investigate the current state-of-the-art in neural MT (NMT) and contrast its strength and weaknesses with SMT.
Machine translation does not exist in a vacuum; it is now used to provide draft translations for human translators and is embedded in other NLP systems. With better quality, raw MT is increasingly used in in written and spoken human communication. We study the adaptation of MT for the most common applications.
Requirements: Basic Python programming skills are required (for example satisfied by LING-362, Intro to NLP)