This long-standing seminar series brings together faculty and students to discuss issues in the general field of text processing, as it applies to machine learning, natural language processing, information retrieval and digital libraries. Unless stated otherwise, meetings will be held biweekly in MR6 (most of the time), from 10-11 am on Tuesdays.

You can get announcements of the CHIME Text Processing Seminar by joining our mailing list: ChimeText.

Upcoming meetings

Upcoming meetings listed in chronological order.

Date Speaker / Title
2008
No talks currently scheduled

Past meetings

Past meetings listed in reverse chronological order.

Jump to: 2008 2007 2006 2005 2004

Date Speaker / Title Notes / Slides
2008
14 Aug, Thursday, 9:00am - 11:00am, MR6 (AS6 05-12)
Hendra Setiawan / Reordering in Statistical Machine Translation: A Function Word, Syntax-based Approach

ABSTRACT: In this thesis, we investigate a specific area within Statistical Machine Translation (SMT): the reordering task -- the task of arranging translated words from source to target language order. This task is crucial as well as challenging, as the failure to order words correctly leads to a disfluent discourse and it may require in-depth knowledge about the source and target language syntaxes, which are often not available to SMT systems.

In this thesis, we propose to address the reordering task by using knowledge of function words. In many languages, function words -- which include prepositions, determiners, articles, etc -- are important in explaining the grammatical relationship among phrases within a sentence. Projecting them and their dependent arguments into another language often results in structural changes in target sentence. Furthermore, function words have desirable empirical properties as they are enumerable and appear frequently in the text, making them highly amenable to statistical modeling.

We demonstrate the utility of this function word idea to the syntax-based approach, following the recent trend of using syntactic formalisms in modeling reordering. We also believe the idea brought forward and developed in this thesis is applicable to other SMT approaches. We implement this idea in a specific syntax-based approach: the formally syntax-based approach, which assumes a knowledge-poor environment where no linguistic annotation is available to the model. In particular, we demonstrate the benefit of our function words idea by proposing several statistical models that address the suboptimalities of the current formally syntax-based models.

We first argue that the current formally syntax-based models are still problematic, although they achieve state-of-the-art performance. More specifically, without access to linguistic knowledge, these models typically come with only one type of nonterminal symbol, which unfortunately introduces many structural ambiguities. In contrast, our idea, which is implemented as a Head-driven Synchronous Context Free Grammar, is better at addressing this problem since it introduces two types of nonterminals: one for function words, and one for their arguments. With this richer set of nonterminals, we develop novel statistical models to better resolve the structural ambiguities. Our experimental results suggest that our syntax-based approach performs well in the reordering task in perfect lexical choice scenarios, thanks to its stronger structural modeling with the advantage of being more compact. We also validate this approach in the full translation task where the training data contains noise, confirming the merit of our idea to both the reordering and the translation task.

BIODATA: Hendra Setiawan is a Doctoral Student at SoC, NUS, co-supervised by Dr. Min-Yen Kan and Dr. Haizhou Li. His main research interest is Statistical Machine Translation and Natural Language Processing (NLP) in general.

Slides (.htm)
28 July, Friday, 9:00am - 11:00am, MR6 (AS6 05-12)
Qiu Long / Context for Semantic Similarity Calculation in Scenario Template Creation

Abstract: Scenario Template Creation (STC) is a Natural Language Processing (NLP) task to detect the commonalities among articles on similar events and generalize them into an abstract representation -- a scenario template (ST). For this task, the estimation of verb-centric text span similarity is the key. Since text span similarity calculation plays an important role in many NLP applications, various approaches have been proposed. They range from bag-of-words to more complicated ones involving thesauri and features at different linguistic levels. However, there are still demands and opportunities for further improvement. Contextual information, for instance, by intuition would be a source to enhance text span similarity estimation. But it has yet to be exploited as well as the internal features have been.

In this talk, I first discuss an intrinsic similarity measure for predicate-argument tuples (PATs). It is applied to a Paraphrase Recognition (PR) task, demonstrating its feasibility. Then I show a context model to capture contexts that could be more informative compared to other surrounding tokens. With different contextual relations defined, I hypothesize that two PATs' semantic similarity can also be reflected by their extrinsic similarity, i.e., whether they are contextually similarly connected to similar contexts. I show experimental results that confirm the correlation between such an extrinsic similarity and the semantic similarity of PATs. To integrate intrinsic and extrinsic similarities for PAT clustering, I propose a graphical framework, using a novel core algorithm called Context Sensitive Clustering (CSC). This clustering process is guided by the Expectation-Maximization (EM) algorithm. I conduct experiments comparing this EM-based CSC algorithm with the standard K-means algorithm. Under the widely-used purity and inverse purity metrics, the proposed algorithm outperforms K-means over all the scenarios tested.

Biodata: Long Qiu is a Doctoral Student at SoC, NUS, co-supervised by Professor Chua Tat-Seng and Dr. Min-Yen Kan. He got his Master of Science (SM) in Computer Science from Singapore-MIT Alliance in 2002. He is interested in Natural Language Processing (NLP) and the related machine learning techniques.

Slides (.pdf)
25 July, Friday, 2:30pm - 3:30pm, SR1 (COM1 02-06)
William Chang (Chief Scientist, Baidu) / The WWW in China and Three Generations of Intelligent Search

China has become the world's biggest online market in terms of users. What continues to drive this growth? What are its challenges and opportunities? In this survey we will outline the social and economic background, the key business models and competitive advantages, how media and multimedia interact, and how people use the Internet in their daily lives. The second part of this talk will present an overview and forward-looking synopsis of the principles and applications of search, from the perspective of a long-time search engineer.

Bio: Dr. William Chang has been the Chief Scientist at Baidu since January 2007. Prior to joining Baidu, Dr. Chang served as the CTO of Infoseek and the VP of Strategy of Go Network. He is also the creator of the highly successful Infoseek natural language search engine and Ultraseek enterprise search engine. Dr. Chang has extensive expertise in search technology, online community building and advertising business models. Dr. Chang earned an undergraduate degree in mathematics from Harvard and a PhD in computer science from the University of California, Berkeley for his breakthrough work in text search. At the renowned Cold Spring Harbor Laboratory, Dr. Chang mapped a genome and invented a protein sequence search methodology. More recently, he created a contextual advertising product at Sentius Corporation, and founded Affini, Inc., a social network technology company.

No slides available
25 July, Friday, 9:30am - 12:00noon, SR1 (COM1 02-06)
Yahoo! Research Labs talks / Recent Research in NLP / IR at YRL

Talk Overviews (times are approximate):
9:30-10:00 - Ricardo Baeza-Yates / Towards a Distributed Search Engine 10:00-10:30 - Evgeniy Gabrilovich / Overview of Computational Advertising
10:30-11:00 - Rosie Jones / Geography in Web Search
11:00-11:30 - Donald Metzler / Predicting when (not) to Advertise
11:30-12:00 - Vanessa Murdock / Diversifying Image Search with User Generated Content

  1. Ricardo Baeza-Yates

    Title: Towards a Distributed Search Engine

    Abstract: Distributed search engines are often more complex to implement compared to centralized engines. Distributing a search engine across multiple sites, however, has several advantages. In particular, it enables the utilization of less computer resources and the exploitation of data and user locality. In this presentation we show the feasibility of distributed Web search engines, by proposing a model for assessing the total cost of a distributed Web-search engine that includes the computational costs as well as the communication cost among all distributed sites. Using examples, we show that a distributed Web search engine can be more cost effective than a centralized one, if there is a large percentage of local queries, which is usually the case. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results, by using caching and partial replication. We simulate our algorithm on real document collections and real query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to cost. This is joint work with Aris Gionis, Flavio Junqueira, Vassilis Plachouras and Luca Telloli.

    Bio: Ricardo Baeza-Yates is VP of Yahoo! Research for Europe and Latin America, leading the labs at Barcelona, Spain and Santiago, Chile. Until 2005 he was the director of the Center for Web Research at the Department of Computer Science of the Engineering School of the University of Chile; and ICREA Professor at the Dept. of Technology of Univ. Pompeu Fabra in Barcelona, Spain. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley, as well as co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, among more than 150 other publications. He has received the Organization of American States award for young researchers in exact sciences (1993) and with two Brazilian colleagues obtained the COMPAQ prize for the best CS Brazilian research article (1997). In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences. During 2007 he was awarded the Graham Medalfor innovation in computing, given by the University of Waterloo to distinguished ex-alumni.

  2. Evgeniy Gabrilovich

    Title: Overview of Computational Advertising

    Abstract: Web advertising is the primary driving force behind many Web activities, including Internet search as well as publishing of online content by third-party providers. A new discipline - Computational Advertising - has recently emerged, which studies the process of advertising on the Internet from a variety of angles. A successful advertising campaign should be relevant to the immediate user's information need as well as more generally to user's background, be economically worthwhile to the advertiser and the intermediaries (e.g., the search engine), as well as not detrimental to user experience. At first approximation, the process of obtaining relevant ads can be reduced to conventional information retrieval, where one constructs a query that describes the user's context, and then executes this query against a large inverted index of ads. We show how to augment the standard IR approach using query expansion and text classification techniques. We demonstrate how to employ a relevance feedback assumption and use Web search results retrieved by the query. We will also survey the numerous challenges and open research problems posed by computational advertising, such as text summarization, natural language generation, named entity extraction, handling geographic names, and others.

    Bio: Evgeniy Gabrilovich is a Senior Research Scientist and Manager of the NLP & IR Group at Yahoo! Research. His research interests include information retrieval, machine learning, and computational linguistics. Recently, he co-organized a workshop on the synergy between Wikipedia and research in AI at AAAI 2008, as well as co-presented a tutorial on computation advertising at ACL 2008 and EC 2008. He served on the program committees of ACL-08:HLT, AAAI 2008, WWW 2008, CIKM 2008, JCDL 2008, AAAI 2007, EMNLP-CoNLL 2007, and COLING-ACL 2006. Evgeniy earned his MSc ad PhD degrees in Computer Science from the Technion - Israel Institute of Technology. In his Ph.D. thesis, Evgeniy developed a methodology for using large scale repositories of world knowledge (e.g., all the knowledge available in Wikipedia) in order to enhance text representation beyond the bag of words. URL: http://research.yahoo.com/Evgeniy_Gabrilovich

  3. Rosie Jones

    Title: Geography in Web Search

    Abstract: Web search results are typically based on the user's search query, without taking other contextual information into account. However, we can see from user search behavior that for some search topics the user may prefer results which are geographically close to home. We will show topics which have a geographical dependence, as well as others which appear to be geographically independent. Based on these findings, we propose a more flexible approach to web search, which in which we prefer a ranking with results close to the user location when this will best satisfy the user's information need.

    Bio: Rosie Jones is a Senior Research Scientist at Yahoo!. Her research interests include web search, geographic information retrieval and natural language processing. She received her PhD from the School of Computer Science at Carnegie Mellon University. In 2005 she co-organized the SIGIR workshop on lexical cohesion and information retrieval, and in 2003 she co-organized the ICML workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. She served as a Senior PC member for SIGIR in 2007 and 2008. URL: http://research.yahoo.com/Rosie_Jones

  4. Donald Metzler

    Title: Predicting when (not) to Advertise

    Abstract: In this talk we discuss the problem of whether or not to show online advertisements. We propose two methods for addressing this problem, a simple thresholding approach and a machine learning approach, which collectively analyzes the set of candidate ads augmented with external knowledge. Our experimental evaluation, based on over 28,000 editorial judgments, shows that we are able to predict, with high accuracy, when to show ads for both content match and sponsored search advertising tasks.

    Bio: Donald Metzler is a Research Scientist at Yahoo! Research in Santa Clara, CA. He obtained his Ph.D. degree in Computer Science from the University of Massachusetts Amherst in 2007. His research interests include information retrieval, machine learning, and their intersection. He is the co-author of Search Engines: Information Retrieval in Practice, which will be published in the early part of 2009. URL: http://research.yahoo.com/Don_Metzler

  5. Vanessa Murdock

    Title: Diversifying Image Search with User Generated Content

    Abstract: Large-scale image retrieval on the Web relies on the availability of short snippets of text associated with the image. This user-generated content is a primary source of information about the content and context of an image. While traditional information retrieval models focus on finding the most relevant document without consideration for diversity, image search requires results that are both diverse and relevant. This is problematic for images because they are represented very sparsely by text, and as with all user-generated content the text for a given image can be extremely noisy.

    The contribution of this paper is twofold. We show that it is possible to minimize the trade-off between precision and diversity, relevance models offer a unified framework to afford the greatest diversity without harming precision. Furthermore we show that estimating the query model from the distribution of tags favors the dominant sense of a query. Relevance models operating only on tags offers the highest level of diversity with no significant decrease in precision.

    Bio: Vanessa Murdock currently holds a Post Doc position at Yahoo! Research Barcelona. Her current work focuses on retrieval of short texts, such as for advertisements, and user-generated content for images and video. She completed her PhD in 2006 at the University of Massachusetts, working with W. Bruce Croft. Her thesis, focusing on sentence retrieval for applications such as Question Answering, novelty detection, and information provenance, was recently published as a book "Exploring Sentence Retrieval. URL: http://research.yahoo.com/Vanessa_Murdock.

2nd Talk: Slides (.pdf)
4th Talk: Slides (.pdf)
24 July, Thursday, 3:00pm - 5:00pm, SR7 (COM1 02-07)
Microsoft Research Asia Lab talks / Recent Research in NLP at MSRA

Talk Overviews:
3:00-4:00 - Ming Zhou / Generating Chinese Couplets using a Statistical MT Approach
4:00-5:00 - Chin-Yew Lin / Web Scale Question Answering -- SQuAD

ABSTRACTS:

  1. Ming Zhou

    Title: Generating Chinese Couplets using a Statistical MT Approach

    Part of the unique cultural heritage of China is the game of Chinese couplets (duìlián) One person challenges the other person with a sentence (first sentence). The other person then replies with a sentence (second sentence), in a way that corresponding words in the two sentences match each other by obeying certain constraints on semantic, syntactic, and lexical relatedness. This task is viewed as a difficult problem in AI and has not been explored in the research community.

    In this paper, we regard this task as a kind of machine translation process. We present a phrase-based SMT approach to generate the second sentence. First, the system takes as input the first sentence and generates as output an N-best list of proposed second sentences using a phrase-based SMT decoder. Then, a set of filters is used to remove candidates violating linguistic constraints. Finally, a Ranking SVM is applied to rerank the candidates. A comprehensive evaluation, using both human judgments and BLEU scores, has been conducted, and the results demonstrate that this approach is very successful.

    You can view this interesting AI gaming at http://duilian.msra.cn/ which has become very popular in China.

    Bio: Ming Zhou, research manager of Natutal Language Computing Group at Microsoft Research Asia (MSRA). As one of the first group in MSRA, this group has been working on machine translation, information retrieval, question answering and language gaming and has contributed many technologies to MS products such as Chinese/Japanese IME, Chinese word breaker, English writing assistant, search engine speller, multi-language search and keyword bidding, text mining, etc.

    Ming developed the China's first Chinese-English machine system CEMT-I in 1988 which set up the foundation of machine translation research of Harbin Institute of Technology. He is the inventor of J-Beijing Chinese-Japanese machine translation system, a famous MT product in Japan which has taken the 62% market share for 10 years since it was launched in 1998. Ming Zhou got his PhD degree at Harbin Institute of Technology in 1991. Then he had his post-doc in Tsinghua University in 1991-1993. He then became an associate professort at the same university untill 1999 when he joined MSRA.

  2. Chin-Yew Lin

    Title: Web Scale Question Answering -- SQuAD

    Abstract: Question answering has been a very active research field in information retrieval and natural language processing. Despite the success of TREC QA track, large scale robust QA systems are still yet to be found in the real world. In this talk, I will briefly introduce recent progress on SQuAD --a question and answering project aiming to crawl, index, and serve all question and answer pairs existing on the web. I will address six main challenges of the project and then focus on the topic of question search and recommendation. Three demos will be shown to highlight how SQuAD technologies can be used in different scenarios.

    Bio: Dr. Chin-Yew LIN is a lead researcher and research manager at Microsoft Research Asia. Before joining Microsoft in 2006, he was a senior research scientist at the Information Sciences Institute at University of Southern California (USC/ISI) where he worked in the Natural Language Processing and Machine Translation group since 1997. His research interests are automated summarization, opinion analysis, question answering, computational advertising, community intelligence, machine translation, and machine learning.

    Recently, his main focus is developing scalable automatic question answering and distillation system -- SQuAD. He also developed automatic evaluation technologies for summarization, QA, and MT. In particular, he created the ROUGE automatic summarization evaluation package. It has become the de facto standard in summarization evaluations. More than 200 research sites worldwide have downloaded this package.

1st Talk: Slides (.pdf)
17 July, Thursday, 10:30am - 12nn, EC (SoC1 05-46)
Douglas Oard (University of Maryland / Fourth-Generation Content Analysis: Supporting social science research using computational linguistics)

ABSTRACT:

Babbie defines content analysis as "the study of recorded human communications such as books, Web sites, paintings and laws." We all practice what we might call "first generation" content analysis every time we read a paper. What we might call "second generation" content analysis involves social scientists who develop coding frames appropriate to their research question and then meticulously annotate a collection of moderate size in order to support their analysis. Third-generation content analysis leverages extensive automation in fairly straightforward ways, such as by counting words or preparing a concordance. We now find ourselves on the verge of a fourth generation of content analysis techniques in which computational linguistics holds promise for automated population of complex coding frames. This could enable sophisticated Web-scale studies, potentially fostering emergence of research methods that go well beyond content to encompass many forms of evidence from human interaction with information. In this talk, I will describe some challenges that we must overcome as these two communities learn to work together. I'll illustrate my talk with examples from the PopIT procect collaboration between social scientists and computational linguists at the University of Maryland in which we are developing automated tools for computational analysis of trends in the popularity of information technology innovations. I'll start with a sketch of our research design for working at the intersection of these two fields, and then I'll describe a few specific pieces of that puzzle that we have already started to build.

Finally, I'll conclude with a few remarks about where we see potential for collaboration with others who share similar interests.

BIODATA:

Douglas Oard is Associate Dean for Research at the College of Information Studies of the University of Maryland, College Park, where he holds joint appointments as Associate Professor in the College of Information Studies and in the Institute for Advanced Computer Studies. He earned his Ph.D. in Electrical Engineering from the University of Maryland. Dr. Oard's research interests center around the use of emerging technologies to support information seeking by end users, with recent work focusing on interactive techniques for cross-language information retrieval, searching conversational media, and leveraging observable behavior to improve user modeling. Together with Ping Wang and Ken Fleischmann, he helps to lead the NSF-funded PopIT project. Additional information is available at http://www.glue.umd.edu/~oard/

Slides (.htm)
16 July, Wed, 3-4pm, (SR3 COM1 #02-12)
Xiong Deyi (I2R / Linguistically Annotated BTG for Statistical Machine Translation)

ABSTRACT:

Bracketing Transduction Grammar (BTG) is a natural choice for effective integration of desired linguistic knowledge into statistical machine translation (SMT). In this talk, we introduce a Linguistically Annotated BTG (LABTG) for SMT. It conveys linguistic knowledge of source-side syntax structures to BTG hierarchical structures through linguistic annotation. From the linguistically annotated data, we learn annotated BTG rules and train linguistically motivated phrase translation model and reordering model. We also present an annotation algorithm that captures syntactic information for BTG nodes. The experiments show that the LABTG approach significantly outperforms a baseline BTG-based system and a state-of-the-art phrase-based system on the NIST MT-05 Chinese-to-English translation task. Moreover, we empirically demonstrate that the proposed method achieves better translation selection and phrase reordering.

BIODATA:

Xiong Deyi received his Ph.D. from the Institute of Computing Technology of Chinese Academy of Sciences. His research interests include statistical machine translation, Chinese language processing, information extraction, and statistical parsing. He is currently a research fellow at the Institute for Infocomm Research of Agency for Science, Technology and Research (I2R,A-STAR).

Slides (.pdf)
9 Jul, Wed, 2-3pm / SR7 (COM1 #02-07)
Mstislav Maslennikov (NUS) Relation Extraction for Information Extraction from Free Text)

ABSTRACT:

Information Extraction (IE) is the task of identifying information (e.g. entities, relations or events) from free text. Numerous previous context-, ontology-, rule- and classification-based methods were actively explored during the decades of research on this task. However, a challenging open question of effectively handling the flexibility of natural language remains unresolved over the years. In IE, this implies the problem of sparseness of data instances, which in turn causes the problems of paraphrasing and misalignment of context features of the extracted information. In this thesis, we hypothesize that such problems can be alleviated by combining relations between entities at the phrasal, dependency, semantic and inter-clausal discourse levels. To validate our hypothesis, we develop a 2-level multi-resolution framework ARE (Anchors and Relations). The first level of ARE extracts candidate phrases (anchors), while the second level evaluates the relations among the anchors and composes possible candidate templates.

The relations between the anchors are combined in several ways. First, we evaluate dependency relations between anchors. We classify dependency relation paths between the anchors into the Simple, Average and Hard categories according to the path length and develop different techniques to handle them. The category-specific strategies resulted in the improvement of 3%, 4% on the MUC4 (Terrorism) and MUC6 (Management Succession) domains, respectively. The increased performance demonstrates that dependency relations are important to handle paraphrases at the syntactic level. Second, we incorporate the discourse relation analysis in a multi-resolution framework for IE to handle long distance dependency relations and possible paraphrasings at the intra-clausal level. This leads to a further improvement of 3%, 7%, 3% and 4% on MUC4, MUC6 and ACE RDC 2003 (general and specific types) domains, respectively. Third, we explore 2 supplementary strategies to combine relation paths between anchors. Since the amount of negative paths between the anchors is many times more than that of positive paths, we apply a filtering strategy to eliminate negative paths. Also, we support the learning process of our dependency relation classifier by the cascading of the features from the discourse classifier. These 2 strategies further improve the IE performance on the MUC4, MUC6 and ACE RDC 2003 (general and specific types) corpora.

Overall, our results affirm the hypothesis that the extraction of candidate phrases (anchors) and the combination of different relation types between anchors in a multi-resolution framework is important to tackle the key problems of paraphrasing and misalignment in Information Extraction.

BIODATA:

Mr. Maslennikov Mstislav is a Doctoral Student at SOC, NUS. He received his 5-year diploma (equivalent to M.Sc.) degree from the Moscow State University, Russia. Since 2002, he has been studying in the internship and PhD programs under the supervision of Prof. Chua Tat-Seng and Dr. Tian Qi. His research is on the theme of improving Information Extraction through relation-based analysis of free text.

Slides (.pdf)
12, June, Thursday, 10:00am - 12:00n, MR6 (AS6 05-12)
JCDL/LREC Practice Session

AGENDA:

  1. 10:00-10:30 Zhao Jin, "Math Information Retrieval: User Requirements and Prototype Implementation" (JCDL)
  2. 10:30-10:50 Kan Min-Yen, "Slide Image Retrieval: A Preliminary Study" (JCDL, Short paper)
  3. 10:50-11:20 Michael Brown, "User-Assisted Ink-Bleed Correction for Handwritten Documents" (JCDL)
  4. 11:20-11:40 Kan Min-Yen, "The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics" (LREC)
  5. 11:40-12:00 Kan Min-Yen, "ParsCit: An open-source CRF reference string parsing package" (LREC)

ABSTRACTS:

Talk #1: We report on the user requirements study and preliminary implementation phases in creating a digital library that indexes and retrieves educational materials on math. We first review the current approaches and resources for math retrieval, then report on the interviews of a small group of potential users to properly ascertain their needs. While preliminary, the results suggest that meta-search and resource categorization are two basic requirements for a math search engine. In addition, we implement a prototype categorization system and show that the generic features work well in identifying the math contents from the webpage but perform less well at categorizing them. We discuss our long term goals, where we plan to investigate how math expressions and text search may be best integrated.

Talk #2: We consider the task of automatic slide image retrieval, in which slide images are ranked for relevance against a textual query. Our implemented system, SLIDIR caters specifically for this task using features specifically designed for synthetic images embedded within slide presentation. We show promising results in both the ranking and binary relevance task and analyze the contribution of different features in the task performance.

Talk #3: We describe a user-assisted framework for correcting ink-bleed in old handwritten documents housed at the National Archives of Singapore (NAS). Our approach departs from traditional correction techniques that strive for full automation. Fully automated approaches make assumptions about ink-bleed characteristics that are not valid for all inputs. Furthermore, fully-automated approaches often have to set algorithmic parameters that have no meaning for the end-user. In our system, the user needs only to provide simple examples of ink-bleed, foreground ink, and background. These training examples are used to classify the remaining pixels in the document to produce a computer generated result that is equal or better than existing fully-automated approaches.

To offer a complete system, we provide additional tools to allow any remaining errors to be easily cleaned up by the user. The initial training markup, computer-generated results, and manual edits are all recorded with the final output, allowing subsequent viewers to see how a corrected document was created and to make changes or updates. While an on-going project, our feedback from the NAS staff has been overwhelmingly positive that this user-assisted approach is a practical and useful way to address the ink-bleed problem.

Talk #4: The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus (ACL ARC), brings together the recent activities of a number of research groups around the world. Our goal is to make the corpus widely available, and to encourage other researchers to use it as a standard testbed for experiments in both bibliographic and bibliometric research.

Talk #5: We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

1st Talk: Slides (.htm) 2nd Talk: Slides (.htm) 4th Talk: Slides (.htm) 5th Talk: Slides (.htm)
4, June, Wed, 2:30pm - 3:30pm, SR8 (COM1 208)
Timothy Baldwin (University of Melbourne) / Enhanced Information Access to Troubleshooting-oriented Web User Forum Data

ABSTRACT:

The ILIAD (Improved Linux Information Access by Data Mining) Project is an attempt to apply language technology to the task of Linux troubleshooting by analysing the underlying information structure of a multi-document text discourse and improving information delivery through a combination of filtering, term identification and information extraction techniques. In this talk, I will outline the overall project design and present results for a variety of thread-level filtering tasks.

BIODATA:

Timothy Baldwin is a Senior Lecturer in the Department of Computer Science and Software Engineering, University of Melbourne. Since completing his PhD at the Tokyo Institute of Technology in 2001, he has been involved with research grants from including the NSF, NTT, ARC, NICTA and Google. His research interests include web mining, information extraction, deep linguistic processing, multiword expressions, deep lexical acquisition, and biomedical text mining. He is the author of over 130 journal and conference publications, and has held visiting appointments at NTT Communication Science Laboratories and Saarland University. He is the recipient of a number of awards for both teaching and research in the areas of computer science and natural language processing. He is currently on the editorial board of Computational Linguistics, a series editor for CSLI Publications, and a member of the Deep Linguistic Processing with HPSG Initiative (DELPH-IN).

Slides (.pdf)
2, June, Monday, 2:30pm - 3:30pm, SR2 (COM1 02-04)
ACL/SIGIR/WebDB Practice Session

AGENDA:

  1. 2:00-3:30 Chan Yee Seng, "MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation"
  2. 2:30-3:00 Chia Tee Kiah, "Lattice-Based Approach to Query-by-Example Spoken Document Retrieval"
  3. 3:00-3:30 Tan Yee Fan, "Efficient Web-Based Linkage of Short to Long Forms"

ABSTRACTS:

Talk #1: We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences. Unlike most metrics, we compute a similarity score between items across the two sentences. We then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence. This general framework allows us to use arbitrary similarity functions between items, and to incorporate different information in our comparison, such as n-grams, dependency relations, etc. When evaluated on data from the ACL-07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.

Talk #2: Recent efforts on the task of spoken document retrieval (SDR) have made use of speech lattices: speech lattices contain information about alternative speech transcription hypotheses other than the 1-best transcripts, and this information can improve retrieval accuracy by overcoming recognition errors present in the 1-best transcription. In this paper, we look at using lattices for the query-by-example spoken document retrieval task -- retrieving documents from a speech corpus, where the queries are themselves in the form of complete spoken documents (query exemplars). We extend a previously proposed method for SDR with short queries to the query-by-example task. Specifically, we use a retrieval method based on statistical modeling: we compute expected word counts from document and query lattices, estimate statistical models from these counts, and compute relevance scores as divergences between these models. Experimental results on a speech corpus of conversational English show that the use of statistics from lattices for both documents and query exemplars results in better retrieval accuracy than using only 1-best transcripts for either documents, or queries, or both. In addition, we investigate the effect of stop word removal which further improves retrieval accuracy. To our knowledge, our work is the first to have used a lattice-based approach to query-by-example spoken document retrieval.

Talk #3: Abbreviations, acronyms, initialisms, and shortenings frequently occurin many texts found on the Web, such as publication metadata, stock ticker codes, and biological articles. To connect these disparate forms together for knowledge discovery, short forms must be properly linked to their canonical long forms. In this paper, we demonstratehow a search engine can be efficiently utilized in mining the requiredcontextual information, so that short forms can be effectively linked to long forms. We show that a count-based method consistently outperforms other methods, and that using the snippets is better thanusing the full web pages. We also consider adaptively combining a query probing algorithm together with our count-based method. This reduces running time and network bandwidth, while maintaining the strong linkage performance.

1st Talk: Slides (.htm) 2nd Talk: Slides (.htm) 3rd Talk: Slides (.pdf)
8, Apr, Tue, 2pm - 3pm, MR6 (AS6 #05-12)
Su Nam Kim (SoC) / Statistical Modeling of Multiword Expressions (2)

ABSTRACT:

In this work, we propose a novel method based on ellipsed predicates to automatically interpret compound nouns with a predefined set of semantic relations. First we map verb tokens in sentential contexts to a fixed set of seed verbs using WordNet::Similarity and Moby's Thesaurus. We then match the sentences with semantic relations based on the semantics of the seed verbs and grammatical roles of the head noun and modifier. Based on the semantics of the matched sentences, we then build a classifier using a memory-based classification tool, Timbl 5.1. The performance of our final system at interpreting NCs is 52.6%. We also compared our method with previous methods and confirmed better performance over the same dataset.

BIODATA:

Su Nam Kim is a postdoctoral research fellow at NUS. She received her BS and MS degrees from Pusan National University, South Korea, a MS degree from State University of New York at Stony Brook, USA. She recently completed her Ph.D study at University of Melbourne, Australia. She has a broad research interest in AI but primarily focuses on lexical semantics including multiword expressions, word sense disambiguation and cross-lingual lexical acquisition. She is also interested in multi-document/multilingual summarization and question-answering systems.

Slides (.pdf)
11 Mar, Tue, 10am - 11am, VIP Studio (AS6 #05-17) Gong Tianxia (SoC) / Automated Retrieval and Generation of Brain CT Radiology Reports

ABSTRACT:

With the advances of medical techniques, large amounts of medical data are produced in hospitals every day. Radiology reports contain rich information about the corresponding medical images but are often under mined. Therefore, our research topics focus on information extraction from brain CT radiology reports, radiology reports assisted medical image content retrieval, and automatic generation of brain CT reports based on domain knowledge and associated images. Current medical record search systems will benefit from our research so that searching for information is more efficient and convenient. Doctors and radiologists can also be more efficient to conduct their research in the area using the improved system. The automatical generation of reports can give reference to radiologists. Our research will also be helpful to facilitate an education system for junior doctors and researchers in the area.

BIODATA:

Gong Tianxia is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by A/P Tan Chew Lim. She received her bachelor's degree in Computer Engineering at SOC in 2006. Her research interests are in information Rretrieval and medical text processing.

Slides (.pdf)
Slides (.ppt)
26, Feb, Tue, 10am - 11am, MR6 (AS6 #05-12) Su Nam Kim (SoC) / Statistical Modeling of Multiword Expressions (1)

ABSTRACT:

This research focuses on multiword expressions (MWEs), that is lexical items that are made up of two or more simplex words, such as "dog pound", "call up" or "red herring". My goals are: to shed light on underlying the linguistic processes giving rise to MWEs; to generalize techniques for indentifying, extracting and analyzing MWEs; to compare pre-existing MWE classifications; and finally, to exemplify the utility of MWE interpretation within NLP tasks. This is aimed at improving the fluency, robustness and understanding of natural language.

The first of the three part presentation on Feb. 26th will provide a brief background on MWEs including different research perspectives and linguistic foundations of MWEs. It will also cover the basic statistical approaches broadly used in MWE studies and will present a summary of recent advances. The second and third talks will present a more technical and detailed discussion on work done in the past two years. The schedule for the second and third talks will be announced later.

BIODATA:

Su Nam Kim is a postdoctoral research fellow at NUS. She received her BS and MS degrees from Pusan National University, South Korea, a MS degree from State University of New York at Stony Brook, USA. She recently completed her Ph.D study at University of Melbourne, Australia. She has a broad research interest in AI but primarily focuses on lexical semantics including multiword expressions, word sense disambiguation and cross-lingual lexical acquisition. She is also interested in multi-document/multilingual summarization and question-answering systems.

Slides (.pdf)
28, Jan, Mon, 2:00pm - 3:00pm, SR11(COM1 #02-11). Yee Whye Teh (UCL) / Bayesian Agglomerative Clustering with Coalescents

ABSTRACT:

Hierarchical clustering of data is one of the most widely used machine learning techniques. Traditional hierarchical clustering techniques construct a single tree in a greedy fashion, either in a top-down or a bottom-up agglomerative fashion. Sometimes we are interested in how reliable the constructed tree is, i.e. how much we believe that the structure of the tree reflects true underlying structure in the data rather than spurious effects due to noise. Such a question can be answered using a Bayesian approach where we define a prior over trees and compute a posterior distribution over trees which captures the uncertainty in the learned tree structure.

However past Bayesian models for hierarchical clustering either do not give a posterior over trees (Heller and Ghahramani 2005, Friedman 2003), not infinitely exchangeable (Williams 2000), or is simply too complex to have widespread appeal (Neal 2003). In this talk we present a model that
1) gives a posterior distribution over trees,
2) is easy to implement, and
3) has the additional nice property that it is infinitely exchangeable.

Our model is based upon a standard model in population genetics called Kingman's coalescent. We propose both greedy and sequential Monte Carlo inference algorithms for the model. We show that our model performs well compared to previous approaches on a number of small datasets, and apply it to document clustering and phylolinguistics.

BIODATA:

Dr Teh Yee Whye is a lecturer at the Gatsby Computational Neuroscience Unit, University College London in the United Kingdom. Prior to this appointment he worked with Prof Lee Wee Sun as Lee Kuan Yew Postdoctoral Fellow at the National University of Singapore, and with Prof. Michael I. Jordan as a postdoc at University of California at Berkeley. He obtained his PhD from the University of Toronto under Prof. Geoffrey E. Hinton. His research interests are in Bayesian machine learning and probabilistic graphical models.

Slides (.ppt) Slides (.htm)
8, Jan, Tue, 10:30am - 12:00pm, SR3A(COM1 #02-12). Jing Jiang (UIUC) / Domain Adaptation in Natural Language Processing

ABSTRACT:

With the explosion of the amount of textual data in the information age, natural language processing (NLP) has become increasingly important, with direct applications in areas such as Web mining and biomedical literature mining. Currently, the most effective approach to solving most NLP problems is supervised learning coupled with linguistic knowledge. However, standard supervised learning requires the training and the test corpora to be similar, and therefore falls apart in real NLP applications because obtaining labeled data for every new domain is expensive and thus infeasible. In this talk, I will present the major line of my PhD research on domain adaptation in NLP, which aims at adapting classifiers trained on one domain to another domain. We have proposed two frameworks to achieve domain adaptation, both having been evaluated on real NLP problems and outperformed standard learning methods. I will also briefly mention the future plan to incorporate knowledge bases and expert interactions into the domain adaptation process, with applications in large-scale information extraction from biomedical literature.

BIODATA:

Ms Jing Jiang is a final year PhD student in the Text Information Management Group in the Computer Science Department at the University of Illinois at Urbana-Champaign, working with Professor ChengXiang Zhai. Her research interests include natural language processing, information retrieval, machine learning, and biomedical literature mining. She received her B.S. degree and her M.S. degree in Computer Science from Stanford University in 2002 and 2003, respectively.

Slides (.ppt) Slides (.htm)

Jump to: 2008 2007 2006 2005 2004

Date Speaker / Title Notes / Slides
2007
18, Dec, Tue, 3 - 4pm, SR7(COM1). Simone Teufel (Cambridge University) / Citations and discourse structure: AZ and its use in large-scale intelligent search

ABSTRACT:

I will describe how one useful aspect of the structure of scientific articles can be discovered with reasonably shallow means, namely the prototypical argumentation for the validity of the current research. Reference to other people's work, and reasonably standardised statements about this work, are a staple part of the argumentation, and citation analysis can exploit this fact. AZ-discourse analysis is the robust machine-learning of this structure, based on the extraction of correlated, and often linguistically interesting, features. I will show results of AZ on two domains (computational linguistics and chemistry), and discuss several search and summarisation applications using AZ. I will also speculate on more sequence-based methods for recognising AZ-type structures in text.

BIODATA:

Simone Teufel is a senior lecturer in the Computer laboratory at Cambridge University, where she has worked since 2001. Her main research interests are in corpus-linguistic approaches to discourse theory, and in the application of such information to summarisation, information retrieval and citation analysis. She has a background in computer science (1994 Diploma from University Stuttgart) and in cognitive science (2000 PhD from Edinburgh University), and has also experience in medical information processing and search, from a postdoctoral stay at Columbia University, and in collocation extraction, from a research post at Xerox Europe. Her lastest research interests include lexical acquisition, and the visualisation and language generation of the analysis results of scientific articles.

Slides (.pdf)
13, Dec, Thu, 2:30 - 3:30pm, The Big One(I2R). Simone Teufel (Cambridge University) / Information extraction and intelligent search in the Chemical domain: Sciborg

ABSTRACT:

While bioinformatics has far advanced in the past years and recognisers for gene and protein names and interactions have been built, biochemistry is a new field for computational linguistics to move into. I will be talking about the recognition strategy for scientific papers in general which the NLIP group at Cambridge University is developing, while concentrating on the research done in the project SciBorg, on chemical name parsing, ontology discovery, and discourse-related search. I will also talk a bit about the role of citations in this recognition effort, and about quite unusual infrastructure that our project is built on -- robust semantic representations, encoded as XML standoff.

BIODATA:

Simone Teufel is a senior lecturer in the Computer laboratory at Cambridge University, where she has worked since 2001. Her main research interests are in corpus-linguistic approaches to discourse theory, and in the application of such information to summarisation, information retrieval and citation analysis. She has a background in computer science (1994 Diploma from University Stuttgart) and in cognitive science (2000 PhD from Edinburgh University), and has also experience in medical information processing and search, from a postdoctoral stay at Columbia University, and in collocation extraction, from a research post at Xerox Europe. Her lastest research interests include lexical acquisition, and the visualisation and language generation of the analysis results of scientific articles.

Slides (.pdf)
3, Dec, Mon, 9:30am - 11:30am, Big One(I2R). Talk 1:

Prof. Junichi Tsujii (University of Tokyo) / Combining Statistical Models with Symbolic Grammar in Parsing

Talk 2:

Dr. Sophia Ananiadou (University of Manchester) / Text mining techniques for building a Biolexicon

No slides available
1, Nov, Thur, 3:00pm, Big One(I2R). Xiaofeng Yang (I2R) / Coreference Resolution with Knowledge-Rich Methods

ABSTRACT:

Coreference resolution is the task of finding different mentions of the same entity in the word. In the past decade, knowledge-lean approaches are widely adopted, in which only simple morpho-syntactic cues as knowledge sources are employed in the resolution process. Although these approaches have achieved reasonable success, researchers have found that deeper syntactic or semantic knowledge is necessary in order to reach the next level performance. In this talk, we will introduce our knowledge-rich approaches to coreference resolution, including a tree-kernel-based method for the syntactic knowledge, and web-based methods for the semantic knowledge. These sources of enriched knowledge are acquired automatically without too many human efforts, and have proved effective for the coreference resolution task.

No slides available
19, Oct, Fri, 10am - 11am, MR6(AS6 #05-12). QIU Long (NUS) / Scenario Template: Its Creation and Application to Open Domain Q&A

ABSTRACT:

A Scenario Template is a data structure that reflects the salient aspects shared by a set of similar events, which are considered as belonging to the same scenario. These salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes of them.

In this talk, I will first brief about our approach to scenario template creation and update the latest evaluation results. Then I will discuss one of the possible applications of scenario templates, namely, open-domain question and answering. For Q&A systems, query expansion is a common strategy while sentence selection is an important process. I will show how scenario templates might help in these two aspects.

BIODATA:

Qiu Long is a Doctoral Student at SoC, NUS, co-supervised by Professor Chua Tat-Seng and Dr. Min-Yen Kan. He got his Master of Science (SM) in CS from Singapore - MIT Alliance. He is interested in Natural Language Processing (NLP) and the related machine learning techniques.

Slides (.pdf)
20, Sep, Thu, 3pm - 4pm, SR10(COM1, #02-10). **Note special time and venue Tanja Schultz (CMU) / Multilingual Speech Processing

ABSTRACT:

In recent years, speech processing products had been widely distributed all over the world, reflecting a general believe that speech technologies have a huge potential to overcome language barriers and to let everyone participate in today's information revolution. However, in spite of vast improvements in speech and language technologies, the development of speech processing systems still requires significant skills and resources to carry out. Consequently, with more than 6500 languages in the world, the current costs and effort in building speech support is prohibitive to all but the most economically viable languages.

In this talk I will discuss the challenges and limitations of rapidly developing automatic speech processing systems for a large number of languages and dialects. I will describe solutions to system development based on sharing data and system components across languages. Practical implementations and recent results are presented in the light of our SPICE project, which aims to bridge the gap between language and technology experts by providing innovative strategies and tools for non-expert users. These tools enable the user to easily collect appropriate text and speech data, to quickly develop acoustic models, pronunciation dictionaries, and language models based on very limited resources, and to monitor progress and performance allowing for iterative improvements with the user in the loop.

BIODATA:

Tanja Schultz received her Ph.D. and Masters in Computer Science from University Karlsruhe, Germany in 2000 and 1995 respectively and got a German Masters in Mathematics, Sports, and Education Science from the University of Heidelberg, Germany in 1990. She joined Carnegie Mellon University in 2000 and is a faculty member of the Language Technologies Institute as a Research Computer Scientist. Since 2007 she also holds a full professorship at Karlsruhe University, Germany.

Her research activities center around language independent and language adaptive speech recognition but also include large vocabulary continuous speech recognition systems, human-machine interfaces using speech and various biosignals, speech translation, as well as language and speaker identification approaches. With a particular area of expertise in multilingual approaches, she performs research on portability of speech processing systems to many different languages. In 2001 Tanja Schultz was awarded with the FZI price for her outstanding Ph.D. thesis on language independent and language adaptive speech recognition. In 2002 she received the Allen Newell Medal for Research Excellence from Carnegie Mellon for her contribution to Speech-to-Speech Translation and the ISCA best paper award for her publication on language independent acoustic modeling. In 2005 she was awarded the Carnegie Mellon Language Technologies Institute Junior Faculty Chair. Tanja Schultz is the author of more than 100 articles published in books, journals, and proceedings.

She is a member of the IEEE Computer Society, the European Language Resource Association, the Society of Computer Science (GI) in Germany, and currently serves on the ISCA board and several program and review panels.

No slides available
21 Aug, Tue, 2pm - 3pm, SR5 (COM1#02-01). **Note special time and venue Yu-Han Chang (USC ISI) / Toddler Machine Meets Pre-Teen Children: Concepts and Language from Combining Lots of Computing with Lots of Free Time

Abstract:

The idea of using humans to teach computers is not a new one, but it has been largely impractical and largely ignored. Modern-day computers tend to "learn" by either sifting through large amounts of data or by being programmed/endowed with expert knowledge. Typically there is little interaction between man and machine. Our recent project, called "Wubble World", capitalizes on the availability of free hands-on human teaching as a means for machine learning of language and concepts.

The basic premise of this work begins with an online game situated in a virtual 3D environment. Language is generated as children interact with their personal creature, called a wubble, or with other children. By virtue of the virtual environment, this language is situated and forms a rich corpus of matched scenes and sentences upon which to learn language and concepts. In one part of the environment, children interact with their wubble by teaching it to accomplish certain given tasks. The wubble, like a toddler, initially knows little about the world, and must acquire concepts and labels by interacting with the child. I'll describe this environment and the basic concept learning that happens inside the wubble. In another part of the world, children play a competitive team game against other children. The game is designed to require cooperation among team members, typically using spoken language. This language, combined with a log of the game state, generates a rich sentence-scene corpus. This richness could potentially enable natural language processing to move beyond current statistical techniques by incorporating data that reveals underlying meaning. I'll demonstrate the game, describe the data we have collected so far, and discuss some of the possible approaches for learning from this data.

BIODATA:

Dr. Yu-Han Chang is a Computer Scientist at the Information Sciences Institute of the University of Southern California (USC ISI). His current research interests span topics from reinforcement learning, game theory, natural language understanding, interactive technologies, and traditional AI. Recent and ongoing projects include harnessing the power of the Internet to train intelligent agents via human teaching, transfer learning, and the development of efficient no-regret algorithms for non-cooperative learning domains. Dr. Chang holds undergraduate degrees in Mathematics and Economics, as well as a S.M. in Computer Science, from Harvard University. He received his Ph.D. in Electrical Engineering and Computer Science from MIT, focusing his efforts on developing algorithms for multi-agent learning in the context of machine learning and game theory.

Slides (.pdf) (Internal to NUS only)
10 Aug, Fri, 10:30am - 11:30am, SR6 (COM1#02-03). **Note special time and venue Robert Dale (Macquarie University) / The Generation of Referring Expressions: Where We've Been, How We Got Here, and Where We're Going

Abstract:

The task of referring expression generation is concerned with determining what semantic content should be used in a reference to an intended referent so that the hearer will be able to identify that referent. The task has been a focus of interest within natural language generation at least since the early 1980s, in part because the problem appears relatively well-defined. Over the last 25 years, a range of algorithms and approaches have been proposed and explored, making this the most intensely studied problem in natural language generation; and yet, even a casual analysis of real human-authored texts suggests that we have a long way to go in terms of providing an explanation for the range of real linguistic behaviour that we find. In this talk, I'll review research in the area to date, try to characterise where we are now, and point to directions for future research in the area.

BIODATA:

Robert Dale received his PhD in Computational Linguistics from the University of Edinburgh in 1989. His research interests include low-cost approaches to intelligent text processing tasks; practical natural language generation; the engineering of habitable spoken language dialog systems; and computational, philosophical and linguistic issues in reference and anaphora. He is Director of the Centre for Language Technology at Macquarie University, Convenor of the Australian Research Council's Human Communication Science Network, and editor-in-chief of the Journal of Computational Linguistics.

Slides (.pdf)
30 Jul, Mon, 2-3pm, SR3A (COM1#02-12). **Note special time and venue Hari Sundaram (Arts Media and Engineering (AME), Arizona State University) / Contextual Wisdom: Social Relations and Correlations for Multimedia Event Annotation

Abstract:

This work deals with the problem of event annotation in social networks. The problem is made difficult due to variability of semantics and due to scarcity of labeled data. Events refer to real-world phenomena that occur at a specific time and place, and media and text tags are treated as facets of the event metadata. We are proposing a novel mechanism for event annotation by leveraging related sources (other annotators) in a social network. Our approach exploits event concept similarity, concept co-occurrence and annotator trust. We compute concept similarity measures across all facets. These measures are then used to compute event-event and user-user activity correlation. We compute inter-facet concept co-occurrence statistics from the annotations by each user. The annotator trust is determined by first requesting the trusted annotators (seeds) from each user and then propagating the trust amongst the social network using the biased PageRank algorithm. For a specific media instance to be annotated, we start the process from an initial query vector and the optimal recommendations are determined by using a coupling strategy between the global similarity matrix, and the trust weighted global co-occurrence matrix. The coupling links the common shared knowledge (similarity between concepts) that exists within the social network with trusted and personalized observations (concept co-occurrences). Our initial experiments on annotated everyday events are promising and show substantial gains against traditional SVM based techniques.

Co-authors: Amit Zunjarwad (AME), Lexing Xie (IBM)

BIODATA:

Hari Sundaram is currently an assistant professor, at Arizona State University. This is a joint appointment with the department of Computer science and the Arts Media and Engineering program. He received his Ph.D. from the Department of Electrical Engineering at Columbia University in 2002. He received his MS degree in Electrical Engineering from SUNY Stony Brook 1995 and a B.Tech in Electrical Engineering from Indian Institute of Technology, Delhi in 1993.

Slides (.htm) Slides (.ppt)
24 Jul, Tue, 2-3pm, at Meeting Room "BigOne", I2R. **Note special time and venue Hari Sundaram (Arizona State University) / Rethinking media semantics: acquisition, representation and learnability

Jointly organized by CHIME, I2R and PREMIA.

This talk will examine some assumptions in media semantics under three broad categories - (a) aspects of meaning (b) rethinking semantic construction (c) learnability contradictions. A re-examination of the assumptions behind media semantics is useful, as the mechanisms by which people create and consume media have changed significantly in the last decade. These changes offer fresh insight into the familiar problem of the semantic gap - how to go from sensory data to meaning. There are three aspects of meaning of interest - context, approximations, and variability. We need to examine the construction of meaning in a manner very different from the familiar Marr model - specifically we shall examine embodiment and networked construction. A significant challenge to the learnability of semantics lies in re-examining within the multimedia context, of what Chomsky calls "the poverty of input" problem. How is it possible to learn a large number of concepts with very few / or even non-existent training examples? We will examine the role of context and semantic approximation with an application to media retrieval. The issues of embodiment and its relation to semantics will be discussed with respect to an educational application. We hope to provide a partial answer to the issue of semantic construction and learnability in an application related to social networks.

BIODATA:

Hari Sundaram is currently an assistant professor, at Arizona State University. This is a joint appointment with the department of Computer science and the Arts Media and Engineering program. He received his Ph.D. from the Department of Electrical Engineering at Columbia University in 2002. He received his MS degree in Electrical Engineering from SUNY Stony Brook 1995 and a B.Tech in Electrical Engineering from Indian Institute of Technology, Delhi in 1993.

His research group works on developing computational models and systems for situated communication. There are two complementary (but coupled) directions - (a) designing intelligent multimedia environments that exist as part of our physical world (e.g. an intelligent room) (b) developing new algorithms and systems to understand the media artifacts resulting from human activity (e.g. emails, photos / video). Specific projects include - context models for action, resource adaptation, interaction architectures, communication patterns in media sharing social networks, collaborative annotation, as well analysis of online communities.

Prof. Sundaram's research has won several awards - the best student paper award at JCDL 2007, the best ACM Multimedia demo award in 2006. The best student paper award at ACM conference on Multimedia 2002, the 2002 Eliahu I. Jury Award for best Ph.D. dissertation. He has also received a best paper award on video retrieval, from IEEE Trans. On Circuits and Systems for Video Technology, for the year 1998. He is an active participant in the Multimedia community - he is an associate editor for ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), as well as the IEEE Signal Processing magazine. He has co-organized workshops at acm multimedia on experiential telepresence (ETP 2003, ETP 2004), archival of personal experiences (CARPE 2004, CARPE 2005) and a conference of image and video retrieval (CIVR 2006).

Slides (.pdf)
17 Jul, Tue, 10:30-11:30am **Note special time Jing Jiang (UIUC) / Two Perspectives on Domain Adaptation in Natural Language Processing

The problem of domain adaptation for statistical classifiers arises when our labeled training examples and unlabeled test examples come from different domains. This problem is commonly encountered in natural language processing (NLP) tasks. For example, we may train a named entity recognition (NER) system on news articles but apply the system to blog or email text. It is generally observed that the performance of a classifier tends to drop significantly when it is applied to a different domain.

In this talk, I will present our recent work addressing the domain adaptation problem. We have proposed two frameworks, corresponding to two different perspectives on this problem: feature selection and instance weighting. In the feature selection framework, we seek to identify .generalizable features. that behave similarly across domains; in the instance weighting framework, our idea is to re-weight the examples in order to minimize the expected loss on the test domain. In both frameworks, we have also incorporated semi-supervised learning to make use of the unlabeled test domain examples. Experiment results on a number of NLP tasks, including NER, part-of-speech (POS) tagging, and spam filtering, show the effectiveness of both frameworks. At the end of the talk, I will briefly mention our current effort of unifying the two perspectives, as well as some future directions to pursue.

BIODATA: Jing Jiang is a Ph.D. candidate in the Department of Computer Science at the University of Illinois at Urbana-Champaign. She is a member of the Information Retrieval Group led by Professor ChengXiang Zhai. Her research interests include information extraction, information retrieval, biomedical text mining, and machine learning. She received her B.S. degree and her M.S. degree in Computer Science from Stanford University in 2002 and 2003, respectively.

Slides (.htm) Slides (.ppt) (Internal to NUS only)
11 Jun, 10:30-11:30am **Note special time Eric Nyberg (CMU LTI) / JAVELIN: Multilingual Question Answering with Semantic Indexing, Retrieval and Inference
The JAVELIN question answering architecture has been used to build QA systems for monolingual English, Japanese and Chinese, as well as cross-lingual QA systems for English-Japanese and English-Chinese. This talk will present and discuss recent research results in structured retrieval, answer extraction and answer selection for QA, and summarize end-to-end system performance as evaluated in the recent NTCIR-6 competition.
No slides available
21 May
(note special time 1:00-3:30pm)
ACL/EMNLP/SIGIR Practice Session

  1. 1:00-1:25 Chia Tee Kiah, "A Statistical Language Modeling Approach to Lattice-Based Spoken Document Retrieval"
  2. 1:25-1:50 Zhao Shanheng, "Identification and Resolution of Chinese Zero Pronouns: A Machine Learning Approach"
  3. 1:50-2:15 Hendra Setiawan, "Ordering Phrases with Function Words"
  4. 2:15-2:30 15 minute break
  5. 2:30-2:55 Dave Kor, TBA
  6. 2:55-3:20 Yang Xiaofeng, "Coreference Resolution Using Semantic Relatedness Information from Automatically Discovered Patterns"
  7. 3:20-3:35 Tan Yee Fan, "PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features"

ABSTRACTS:

Title: A Statistical Language Modeling Approach to Lattice-Based Spoken Document Retrieval

Abstract: Speech recognition transcripts are far from perfect; they are not of sufficient quality to be useful on their own for spoken document retrieval. This is especially the case for conversational speech. Recent efforts have tried to overcome this issue by using statistics from speech lattices instead of only the 1-best transcripts; however, these efforts have invariably used the classical vector space retrieval model. This paper presents a novel approach to lattice-based spoken document retrieval using statistical language models: a statistical model is estimated for each document, and probabilities derived from the document models are directly used to measure relevance. Experimental results show that the lattice-based language modeling method outperforms both the language modeling retrieval method using only the 1-best transcripts, as well as a recently proposed lattice-based vector space retrieval method.

Title: Identification and Resolution of Chinese Zero Pronouns: A Machine Learning Approach

Abstract: In this paper, we present a machine learning approach to the identification and resolution of Chinese anaphoric zero pronouns. We perform both identification and resolution automatically, with two sets of easily computable features. Experimental results show that our proposed learning approach achieves anaphoric zero pronoun resolution accuracy comparable to a previous state-of-the-art, heuristic rule-based approach. To our knowledge, our work is the first to perform both identification and resolution of Chinese anaphoric zero pronouns using a machine learning approach.

Title: Ordering Phrases with Function Words

Abstract: Function words are a class of words with little intrinsic meaning but is vital in expressing grammatical relationships among phrases within a sentence. Such encoded grammatical information, often implicit, makes function words pivotal in modeling structural divergences, as projecting them in different languages often result in long-range structural changes to the realized sentences. This distinctive feature has not been fully-utilized to address phrase ordering problem in the context of statistical machine translation (SMT). We observe that just like foreign language learner often makes mistakes in using function words, current SMT system often perform poorly in ordering function words' arguments; lexically correct translations often end up reordered incorrectly. In this talk, I will present a Function Words centered, Syntax-based (FWS) solution to address the phrase ordering problem, including its statistical formalism, its implementation and experimental results.

Title: Coreference Resolution Using Semantic Relatedness Information from Automatically Discovered Patterns

Abstract: Semantic relatedness is a very important factor for the coreference resolution task. To obtain this semantic information, corpus-based approaches commonly leverage patterns that can express a specific semantic relation. The patterns, however, are designed manually and thus are not necessarily the most effective ones in terms of accuracy and breadth. To deal with this problem, in this paper we propose an approach that can automatically find the effective patterns for coreference resolution. We explore how to automatically discover and evaluate patterns, and how to exploit the patterns to obtain the semantic relatedness information. The evaluation on ACE data set shows that the pattern based semantic information is helpful for coreference resolution.

Title: PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features

Abstract: We describe about the system description of the PSNUS team for the SemEval-2007 Web People Search Task. The system is based on the clustering of the web pages by using a variety of features extracted and generated from the data provided. This system achieves F_alpha=0.5 = 0.75 and F_alpha=0.2 = 0.78 for the final test data set of the task.

14 May
(at I2R)
ACL Practice Session

AGENDA:

  1. 1:00-1:25 Chan Yee Seng, "Word Sense Disambiguation Improves Statistical Machine Translation"
  2. 1:25-1:50 Chan Yee Seng, "Domain Adaptation with Active Learning for Word Sense Disambiguation"
  3. 1:50-2:15 Li Haizhou, "Semantic Transliteration of Personal Names"
  4. 2:15-2:30 15 minute break - refreshments to be served.
  5. 2:30-2:55 Min Zhang, "A Grammar-driven Convolution Tree Kernel for Semantic Role Classification"
  6. 2:55-3:20 Mstislav Maslennikov, "ARE&D: A Discourse-based Multi-resolution Framework for Information Extraction on Free Text"

ABSTRACTS:

Title: Word Sense Disambiguation Improves Statistical Machine Translation

Abstract: Recent research presents conflicting evidence on whether word sense disambiguation (WSD) systems can help to improve the performance of statistical machine translation (MT) systems. In this paper, we successfully integrate a state-of-the-art WSD system into a state-of-the-art hierarchical phrase-based MT system, Hiero. We show for the first time that integrating a WSD system improves the performance of a state-of-the-art statistical MT system on an actual translation task. Furthermore, the improvement is statistically significant.

Title: Domain Adaptation with Active Learning for Word Sense Disambiguation

Abstract: When a word sense disambiguation (WSD) system is trained on one domain but applied to a different domain, a drop in accuracy is frequently observed. This highlights the importance of domain adaptation for word sense disambiguation. In this paper, we first show that an active learning approach can be successfully used to perform domain adaptation of WSD systems. Then, by using the predominant sense predicted by expectation-maximization (EM) and adopting a count-merging technique, we improve the effectiveness of the original adaptation process achieved by the basic active learning approach.

Title: Semantic Transliteration of Personal Names

Abstract: Words of foreign origin are referred to as borrowed words or loanwords. A loanword is usually imported to Chinese by phonetic transliteration if a translation is not easily available. Semantic transliteration is seen as a good tradition in introducing foreign words to Chinese. Not only does it preserve how a word sounds in the source language, it also carries forward the word's original semantic attributes. This paper attempts to automate the semantic transliteration process for the first time. We conduct an inquiry into the feasibility of semantic transliteration and propose a probabilistic model for transliterating personal names in Latin script into Chinese. The results show that semantic transliteration substantially and consistently improves accuracy over phonetic transliteration in all the experiments.

Title: A Grammar-driven Convolution Tree Kernel for Semantic Role Classification

Abstract: Convolution tree kernel has shown very promising results in semantic role classification. However, this method considers less linguistic knowledge and only carries out hard matching between substructures, which may lead to over-fitting and less accurate similarity measure. To remove the constraints, this paper proposes a grammar-driven convolution tree kernel for semantic role classification by introducing more linguistic grammar information into the standard convolution tree kernel. The proposed grammar-driven convolution tree kernel displays two advantages over the previous one: 1) grammar-driven approximate substructure matching and 2) grammar-driven approximate tree node matching. The two improvements enable the proposed grammar-driven tree kernel explore more linguistically motivated substructure features than the previous one. Experiments on the CoNLL-2005 SRL shared task show that the proposed grammar-driven tree kernel significantly outperforms the previous non-grammar-driven one in semantic role classification. Moreover, we present a composite kernel to integrate feature-based and tree kernel-based methods. Experimental results show that the composite kernel outperforms the previous best-reported methods.

Title: ARE&D: A Discourse-based Multi-resolution Framework for Information Extraction on Free Text

Abstract: Extraction of relations between entities is an important part of Information Extraction on free text. Previous methods are mostly based on statistical correlation and dependency relations between entities. This paper re-examines the problem at the multi-resolution layers of phrase, clause and sentence using dependency and discourse relations. Our multi-resolution framework ARE&D (Anchor and Relation and Discourse analysis) uses clausal relations in 2 ways: 1) to filter noisy dependency paths; and 2) to increase reliability of dependency path extraction. The resulting system outperforms the previous approaches by 3%, 7%, 4% on MUC4, MUC6 and ACE RDC domains respectively.

25 Apr Hendra Setiawan (NUS, Institute for Infocomm Research I2R) / Ordering Phrases with Function Words
Function words are a class of words with little intrinsic meaning but is vital in expressing grammatical relationships among phrases within a sentence. Such encoded grammatical information, often implicit, makes function words pivotal in modeling structural divergences, as projecting them in different languages often result in long-range structural changes to the realized sentences. This distinctive feature has not been fully-utilized to address phrase ordering problem in the context of statistical machine translation (SMT). We observe that just like foreign language learner often makes mistakes in using function words, current SMT system often perform poorly in ordering function words' arguments; lexically correct translations often end up reordered incorrectly.

In this talk, I will present a Function Words centered, Syntax-based (FWS) solution to address the phrase ordering problem, including its statistical formalism, its implementation and experimental results.

Slides (.htm)
18 Apr @ MR 1 (S16 Lvl 5) **Note special place. Bang Viet Nguyen (NUS) and Lin Ziheng (NUS) / Functional Faceted Web Query Analysis and Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization
1st talk: We propose a faceted classification scheme for web queries. Unlike previous work, our functional scheme ties its classification to actionable strategies for search engines to take. Our scheme consists of four facets of ambiguity, authority sensitivity, temporal sensitivity and spatial sensitivity. We hypothesize that the classification of queries into such facets yields insight on user intent and information needs. To validate our classification scheme, we asked users to annotate queries with respect to our facets and obtained high agreement. We also assess the coverage of our faceted classification on a random sample of queries from logs. Finally, we discuss the algorithmic approaches we take in our current work to automate such faceted classification.

2nd talk: In this talk, I will present a new graph-based approach to text understanding and summarization. Current graph-based approaches to automatic text summarization, such as LexRank and TextRank, assume a static graph which does not model how the input texts emerge. A suitable evolutionary text graph model may impart a better understanding of the texts and improve the summarization process. We give simplified assumptions of human writing and reading processes, and then propose a timestamped graph (TSG) model that is motivated by these processes and show how text units in this model emerge over time. This model not only captures the evolving process of text within a document, but also the evolving process across documents. In our model, the graphs used by LexRank and TextRank are specific instances of our timestamped graph with particular parameter settings.

1st Talk: Slides (.htm)
2nd Talk: Slides (.htm)
16 Apr, 3-4pm, @ TR20 (S15 #02-07) **Note special time and place. Lan Man (NUS/I2R) / A New Term Weighting Method for Text Categorization
Text representation is the task of transforming the content of a textual document into a compact representation of its content so that the document could be recognized and classified by a computer or a classifier. This thesis focuses on the development of an effective and efficient term weighting method for text categorization task. We selected the single token as the unit of feature because the previous researches showed that this simple type of features outperformed other complicated type of features.

We have investigated several widely-used unsupervised and supervised term weighting methods on several popular data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection and analysis of the term's discriminating power, we have proposed a new term weighting scheme, namely tf.rf. The controlled experimental results showed that the term weighting methods show mixed performance in terms of different category distribution data sets and different learning algorithms. Most of the supervised term weighting methods which are based on information theory have not shown satisfactory performance according to our experimental results. However, the newly proposed tf.rf method shows a consistently better performance than other term weighting methods. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance with respect to different category distribution data sets.

Slides (.htm) Set 1 Set 2
11 Apr Qiu Long (NUS) / A Graph Approach to Scenario Template Generation
A Scenario Template is a data structure that reflects the salient aspects shared by a set of events, which are similar enough to be considered as belonging to the same scenario. The salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes. Such a scenario template, once populated with respect to a particular event, serves as a concise overview of the event. It also provides valuable information for applications such as information extraction (IE), text summarization, etc.

Manually defining scenario template is expensive and we aim to automatize this template generation process. We argue that context is valuable to identify semantically similar text spans, from which template slots could be generalized. To leverage context, we convert news articles into a graphical representation and then apply a generic context-sensitive clustering (CSC) framework to get meaningful clusters of text spans by examining the intrinsic and extrinsic similarities between them. We use the Expectation-Maximization algorithm to guide the clustering process. The experiments show that: 1) our approach generates high quality clusters, and 2) information extracted from the clusters is adequate to build high coverage templates.

Slides (May not be available outside of NUS)
2 Apr (**note special date) Chen Jinxiu (NUS, Institute for Infocomm Research I2R) / Automatic Relation Extraction among Named Entities from Text Contents
This thesis studies the task of Relation Extraction, which has received more and more attention in recent years. The task of relation extraction is to identify various semantic relations between named entities from text contents. With the rapid increase of various textual data, relation extraction will play an important role in many areas, such as Question Answering, Ontology Construction, and Bioinformatics.

The goal of our research is to reduce the manual effort and automate the process of relation extraction. To realize this intention, we investigate semi-supervised learning and unsupervised learning solutions to rival supervised learning methods to resolve the problem of relation extraction with minimal human cost and still achieve comparable performance to supervised learning methods.

First, we presented a Label Propagation (LP) based semi-supervised learning algorithm for relation extraction problem to learn from both labeled and unlabeled data. It represents labeled and unlabeled examples and their distances as the nodes and the weights of edges of a graph, then propagating the label information from any vertex to nearby vertices through weighted edges iteratively, finally inferring the labels of unlabeled examples after the propagation process converges.

Secondly, we introduced an unsupervised learning algorithm based on model order identification for automatic relation extraction. The model order identification is achieved by resampling based stability analysis and used to infer the number of relation types between entity pairs automatically.

Thirdly, we further investigated unsupervised learning solution for relation disambiguation using graph based strategy. We defined the unsupervised relation disambiguation task for entity mention pairs as a partition of a graph so that entity pairs that are more similar to each other, belong to the same cluster. We apply spectral clustering to resolve the problem, which is a relaxation of such NP-hard discrete graph partitioning problem. It works by calculating eigenvectors of an adjacency graph's Laplacian to recover a submanifold of data from a high dimensionality space and then performing cluster number estimation on such spectral information.

The thesis evaluates the proposed methods for extracting relations among named entities automatically, using the ACE corpus. The experimental results indicate that our methods can overcome the problem of not having enough manually labeled relation instances for supervised relation extraction methods. The results show that when only a few labeled examples are available, our LP based relation extraction can achieve better performance than SVM and another bootstrapping method. Moreover, our unsupervised approaches can achieve order identification capabilities and outperform the previous unsupervised methods. The results also suggest that all of the four categories of lexical and syntactic features used in the study are useful for the relation extraction task.

28 Mar Che Wanxiang (Harbin Institute of Technology, Institute for Infocomm Research I2R) / A Hybrid Convolution Tree Kernel for Semantic Role Labeling
... and ...
Sun Chengjie (Harbin Institute of Technology, Institute for Infocomm Research I2R) / Using Maximum Entropy to Recognize Name Origin in Machine Transliteration
1st talk: As a kind of Shallow Semantic Parsing, Semantic Role Labeling (SRL) is being paid more attention and illustrating a good prospect of application on wide natural language processing problems. So I will show a demo at first to explain what is the semantic role labeling is. Usually, feature-based methods with feature vector are used for semantic role labeling as the state of the art methods. However, these methods, which are widely used in natural language processing field, are difficult in modeling structure features, e.g. the useful Path features for semantic role labeling. As an extension of the feature-based methods, kernel-based methods are able to do this efficiently in a much higher dimension. Convolution tree kernel, a special kind of kernel, has been used in semantic role labeling. The conventional convolution tree kernel which selected the tree portion of a predicate and one of its arguments as feature space is named as predicate-argument feature (PAF). However, the integral view of PAF is not suitable for the semantic role labeling. A hybrid convolution tree kernel is proposed to model syntactic tree structure features more effectively. The hybrid kernel consists of two individual convolution kernels: a Path kernel, which captures predicate-argument link features, and a Constituent Structure kernel, which captures the syntactic structure features of arguments. Evaluation on the data sets of CoNLL-2005 SRL shared task shows that our novel hybrid convolution tree kernel significantly outperforms the previous tree kernels. We future provide a composite kernel combining our hybrid tree kernel with the polynomial kernel using standard flat feature vector. The experimental results show that the composite kernel achieves better performance than each of the individual methods.

and

2nd talk: Name origin recognition is to identify the original source of a name. It is a necessary step for name translation/transliteration because of different origins need different translation strategies. It is more important when translating across languages with different alphabets and sound inventories. Previous works used rule based methods or statistics based methods to solve this problem. In this work, we cast name origin recognition as a multi-class classification task and propose to use Maximum Entropy model to solve it. Experiments show that our approach can achieve an overall accuracy 98.35% for name written in English and 98.10% for name written in Chinese, which are much better than previous methods.

Slides (1st talk) Slides (2nd talk) (.pdf, open to all hosts in TLD .sg)
28 Feb, 3-4pm @ TR9 (S16 #03-09) **Note special time and place. Mstislav Maslennikov (NUS) / A Multi-resolution Framework for Information Extraction from Free Text
Extraction of relations between entities is an important part of IE on free text. Previous methods are mostly based on statistical correlation and dependency relations between entities. This paper re-examines the problem at the multi-resolution layers of phrase, clauses and sentences using dependency and discourse relations. Our multi-resolution framework uses clausal relations in 2 ways: 1) to filter noisy dependency paths; and 2) to increase reliability of dependency path extraction. The resulting system outperforms the previous approaches by 3%, 7%, 4% on MUC4, MUC6 and ACE RDC domains respectively.
Slides (.pdf)
22 Feb (Note special date, time and place (2-3pm, SR 5, S16 Lvl 4)) Graeme Hirst (University of Toronto) / Fine-grained differences and similarities in meanings
Writing or speaking requires making choices from words and syntactic constructions that have similar but not identical meanings. Are two parties "foes" or "enemies"? Did John meet Mary or was Mary met by John? An important component of language understanding is recognizing the implications of the nuances in the speaker's or writer's choices. I will describe our research on computational aspects of linguistic nuance, focusing on the differentiation of near-synonyms and on the consequences that arise for knowledge representation formalisms. In addition, I will discuss how contemporary views of meaning in computational linguistics need to be broadened to take into account the choices that the speaker or writer makes.
Slides (.pdf, Internal to NUS only)
5 Feb (**10:00-11:00am, note special time) Yin Xinyi (NUS) / Random Walk and Web Information Processing for Mobile Devices
Accessing web pages from a mobile device is becoming very valuable, especially for people constantly on the move. However, the small screen, limited memory, and the slow wireless connection make the surfing experience on mobile devices unacceptable to most people. In this thesis, we aim to solve three fundamental challenges in the mobile Internet: web page content ranking, web content classification, and web article summarization. We propose a new method to solve these three fundamental challenges. As a web page is too complex to analyze as a whole, we will first divide the entire web page into basic elements such as text blocks, pictures, etc. Next, based on the relationship between the elements, we will connect the elements with edges to make a graph. Finally, we will use random walk methods to provide solution for the three challenges. The main contribution of this thesis is a graph and a random walk based framework for the Internet information process. It is shown to be very simple and effective. For example, our experiments of web page ranking show that from randomly selected websites, the system need only deliver 39% of the objects in a web page in order to fulfill 85% of a viewer's desired viewing content. In the experiments of web content classification, the system generates good performance with the F value for main content and advertisement (A) as high as 0.93 and 0.82 respectively. In the experiments of text summarization, with the use of the well-accepted dataset for single document summarization, the graph and random walking based text summarization system outperformed the results of all participants of the conference
Slides (.htm)
30 Jan (10:00-11:00 am, note special time) Upali Kohomban (NUS) / Application of Generic Sense Classes in Word Sense Disambiguation
Word Sense Disambiguation (WSD) is a problem in Natural Language Processing concerned on identifying correct meaning of a word used in a given context. Over time, supervised machine learning has consistently shown better performance in WSD, compared to unsupervised learning. However, supervised approach for WSD has been facing the serious problem of knowledge acquisition bottleneck, or the difficulty of acquiring enough labeled training data for learning classifiers. This problem is exasperated by several facts, including the large number of fine-grained senses in contemporary lexicons, need of training data for individual polysemous word, and the high cost of manually sense-labeling training examples. Our research focuses on an approach to find a workaround to this problem, by exploiting the usage similarities of different words. We propose using a generalized and coarse-grained set of senses at classifier level, and then using lexicon-induced heuristics to convert the resulting classes into fine-grained senses. The generic nature of the sense classes allows us to use labeled training examples from different words to be used for learning the classes, effectively increasing the amount of available training data. We discuss how the noise due to generalization can be reduced by using a semantic similarity based weighting strategy, and show, using WordNet lexicographer files as generic classes, that this approach can yield state of the art WSD performance with sparse training data. Further, we argue that the human-created, taxonomy based class schema such as WordNet lexicographer files are not ideal for supervised learning, as they are not necessarily coherent with the contextual usage patterns, which are available for the classifier as features. In addition, they have undesirable properties that result in high losses during the class to fine-grained sense conversion. We propose using clustering techniques to automatically create generic sense classes that are aimed for better performance of WSD as an end-task, and show that such classes can improve the WSD performance over manually created classes.
Slides (.htm)

Jump to: 2008 2007 2006 2005 2004

Date Speaker / Title Notes / Slides
2006
27 Dec (10:30-11:30 am, note special time) Ng Hwee Tou (NUS) / One Class per Named Entity: Exploiting Unlabeled Text for Named Entity Recognition
In this talk, I will present a simple yet novel method of exploiting unlabeled text to further improve the accuracy of a high-performance state-of-the-art named entity recognition (NER) system. The method utilizes the empirical property that many named entities occur in one name class only. Using only unlabeled text as the additional resource, our improved NER system achieves an F1 score of 87.13%, an improvement of 1.17% in F1 score and a 8.3% error reduction on the CoNLL 2003 English NER official test set. This accuracy places our NER system among the top 3 systems in the CoNLL 2003 English shared task. This work was done jointly with Wong Yingchuan.
Slides (.pdf)
14 Dec (2:00-3:00 pm) Lee Dongwon (IST, PSU) / Name Disambiguation in Digital Libraries
When the names of people are used as unique identifiers, it often causes problems -- different people may share the same name spelling or a person may have several names spelled or used. As the searching by person' name is one of the most common query types in Digital Libraries and WWW (about 30%), it becomes increasingly important to have clean name data in such systems. In this talk, I will first present various types of ambiguous names drawn from real Digital Libraries. Then, I will discuss various approaches for identifying and fixing such ambiguous names -- syntactic, semantic, and google-based approaches.

This talk borrows materials from my recent work in IQIS'05 JCDL'06, ICDM'06, and ICDE'07, that are the results of joint work with several students and collaborators:

Ergin Elmacioglu (Penn State), Min-Yen Kan (NUS), Jaewoo Kang (Korea U.), Nick Koudas (U. Toronto), Byung-Won On (Penn State), Jian Pei (Simon Fraser U.), Divesh Srivastava (AT&T Labs -- Research), Yee Fan Tan (NUS)

Slides (external link to .ppt)
20 Oct (2-2:30pm, **Special Date, Time and Place, SR 4: SoC 1 06-12) Chia Tee Kiah (NUS) / Probabilistic Lattice-Based Spoken Document Retrieval
Spoken Document Retrieval involves finding from within a collection of spoken documents (e.g. voice mails, news broadcasts) the documents which satisfy a given information need. One way to represent a spoken document for this task is the lattice -- a directed acyclic graph whose paths correspond to a hypothesis of the words spoken in the document. In this talk I present a method for using word statistics derived from lattices in a probabilistic retrieval algorithm to perform spoken document retrieval. Results which compare the performance of this approach with using only the 1-best speech recognizer transcription are also presented.
Slides (.pdf)
19 Oct (11am-12n, **Special Date and Time) Liu Ting (Harbin Institute of Technology) / Language Technology Platform (LTP) and WSD based on Equivalent Pseudoword
I will present the architecture of a XML based Chinese processing platform for web application. It is named as Language Technology Platform (LTP). There are five main points of it: a suite of DLL modules for DOM Tree, Language Technology Markup Language (LTML), a suite of visualization tools, language corpora based on LTML and web service for LTP. Current LTP has integrated ten key Chinese processing modules on morphology, word sense, and syntax and document analysis. A suite of systematism tools is supplied for beginners of natural language processing and information retrieval. Based on it, they can study on the relationship between all levels and some advanced topics. Currently, the platform has been shared to more than 60 research labs in the world. Another topic of my talk is about WSD. I will present a new approach based on Equivalent Pseudowords (EPs) to tackle Word Sense Disambiguation (WSD) in Chinese language. EPs are particular artificial ambiguous words, which can be used to realize unsupervised WSD. A Bayesian classifier is implemented to test the efficacy of the EP solution on Senseval-3 Chinese test set. The performance is better than state-of-the-art results with an average F-measure of 0.80. The experiment verifies the value of EP for unsupervised WSD.
No slides available
19 Oct (10-11am, **Special Date and Time) Wong Kam-Fai (SEEM, CUHK) / A Phonetic-Based Approach to Chinese Chat Text Normalization
Web 2.0 is the latest trend in the Word Wide Web. In the first part of my seminar, I shall review the social characteristics of this paradigm and how suitable it is for the Asian community. In the second part, I shall focus on a particular communication means on Web 2.0, namely chatting, e.g. via ICQ, chat rooms, etc. A unique dialect is commonly used for chatting. I refer it as the Chat Language (CL). CL is different from natural languages due to its anomalous and dynamic natures. These render conventional NLP tools inapplicable for analyzing CL. The language changes frequently rendering contemporary chat language corpora quickly out-dated. To address this dynamic language problem in Chinese, we propose a phonetic language model to map between chat terms and standard words via phonetic transcription, i.e. Chinese Pinyin in our case. Different from grapheme mapping, phonetic mapping can be constructed from available standard Chinese corpus. For term normalization, i.e. to translate a chat term to its natural language counterpart, we extend the source channel model by incorporating the phonetic mapping model. Experimental results show that this method is effective and robust.
No slides available
14 Aug (Mon, 3:00-4:30 pm, ** Special date and time) David Chiang (ISI/USC) / An introduction to synchronous grammars
Synchronous grammars are rapidly gaining importance for modeling machine translation and other complex language transformations. It has therefore become useful to understand their basic formal properties. Many advances in NLP in the 1990s exploited basic algorithms for probabilistic finite-state transducers, whose theory is well understood and widely taught. The analogous theory for trees is less widely known but well developed, with roots going back to the 1960s. In this tutorial, we aim to (1) cover the literature of synchronous grammars, (2) describe how they relate to current NLP applications, such as machine translation, and (3) discuss some new theoretical and algorithmic problems raised by these applications, and some recent solutions.

This talk is part of a tutorial given with Kevin Knight at ACL 2006

No slides available
24 Jul (2:00-3:00 pm) (** Special date and time) John Prager (IBM T.J. Watson Labs) / Improving Question-Answering Precision by asking More and Better Questions
If we define a QA system as a system which takes a natural-language question, searches a text corpus and returns a ranked a list of answers, then we can broadly discern two ways in which accuracy can be increased: intrinsically, by generating better candidate lists (by e.g. more accurate entity recognition, deeper parsing, better pattern-matching and/or more judicious choice of keywords in search), or extrinsically, by re-evaluating and re-shaping such answer lists by reference to other QA methods or other data sources. This talk is about approaches of each kind that we are using at IBM Research to improve the accuracy of our QA system. I will first describe the semantic information we build into the search-engine index from running text analytics on the corpus. In addition to text tokens, we index types, typed tokens and relations. I will present the results of several evaluations demonstrating how such "Semantic Search" can increase precision.

As far as extrinsic methods go, leading QA systems employ a variety of means to boost accuracy. Such methods include redundancy (getting the same answer from multiple documents/sources), inferencing (proving the answer from information in texts plus background knowledge) and sanity-checking (verifying that answers are consistent with known facts). To our knowledge, however, no other QA system deliberately asks additional questions in order to derive constraints on the answers to the original questions. We present two variations on this idea. The first is the method of QA-by-Dossier-with-Constraints (QDC), which is an extension of the simpler method of QA-by-Dossier, in which definitional questions ("Who/what is X?") are addressed by asking a set of questions about anticipated properties of X. In QDC, the collection of Dossier candidate answers is subjected to satisfying a set of naturally-arising constraints. For example, for a "Who is X?" question, the system will ask about birth, accomplishment and death dates, which if they exist, must occur in that order, and also obey other constraints such as lifespan. Temporal, spatial and kinship relationships seem to be parti