| |
This
long-standing seminar series brings together faculty and students
to discuss issues in the general field of text processing, as it
applies to machine learning, natural language processing, information
retrieval and digital libraries. Unless stated otherwise, meetings
will be held biweekly in MR6 (most of the time), from 10-11 am
on Tuesdays.
You can get
announcements of the CHIME Text Processing Seminar by joining our
mailing list: ChimeText.
Upcoming meetings
Upcoming meetings
listed in chronological order.
| Date |
Speaker / Title |
|
2008
| | No talks currently scheduled |
Past meetings
Past meetings listed
in reverse chronological order.
Jump to: 2008
2007
2006
2005
2004
| Date |
Speaker / Title |
Notes /
Slides |
|
2008
|
14 Aug, Thursday, 9:00am - 11:00am, MR6 (AS6 05-12)
|
Hendra Setiawan / Reordering in Statistical Machine Translation: A Function Word, Syntax-based Approach
ABSTRACT: In this thesis, we investigate a specific area within Statistical Machine Translation (SMT): the reordering task -- the task of arranging translated words from source to target language order. This task is crucial as well as challenging, as the failure to order words correctly leads to a disfluent discourse and it may require in-depth knowledge about the source and target language syntaxes, which are often not available to SMT systems.
In this thesis, we propose to address the reordering task by using knowledge of function words. In many languages, function words -- which include prepositions, determiners, articles, etc -- are important in explaining the grammatical relationship among phrases within a sentence. Projecting them and their dependent arguments into another language often results in structural changes in target sentence. Furthermore, function words have desirable empirical properties as they are enumerable and appear frequently in the text, making them highly amenable to statistical modeling.
We demonstrate the utility of this function word idea to the syntax-based approach, following the recent trend of using syntactic formalisms in modeling reordering. We also believe the idea brought forward and developed in this thesis is applicable to other SMT approaches. We implement this idea in a specific syntax-based approach: the formally syntax-based approach, which assumes a knowledge-poor environment where no linguistic annotation is available to the model. In particular, we demonstrate the benefit of our function words idea by proposing several statistical models that address the suboptimalities of the current formally syntax-based models.
We first argue that the current formally syntax-based models are still problematic, although they achieve state-of-the-art performance. More specifically, without access to linguistic knowledge, these models typically come with only one type of nonterminal symbol, which unfortunately introduces many structural ambiguities. In contrast, our idea, which is implemented as a Head-driven Synchronous Context Free Grammar, is better at addressing this problem since it introduces two types of nonterminals: one for function words, and one for their arguments. With this richer set of nonterminals, we develop novel statistical models to better resolve the structural ambiguities. Our experimental results suggest that our syntax-based approach performs well in the reordering task in perfect lexical choice scenarios, thanks to its stronger structural modeling with the advantage of being more compact. We also validate this approach in the full translation task where the training data contains noise, confirming the merit of our idea to both the reordering and the translation task.
BIODATA: Hendra Setiawan is a Doctoral Student at SoC, NUS, co-supervised by Dr. Min-Yen Kan and Dr. Haizhou Li. His main research interest is Statistical Machine Translation and Natural Language Processing (NLP) in general.
|
Slides (.htm)
|
28 July, Friday, 9:00am - 11:00am, MR6 (AS6 05-12)
|
Qiu Long / Context for Semantic Similarity Calculation in Scenario Template Creation
Abstract: Scenario Template Creation (STC) is a Natural Language Processing (NLP) task to detect the commonalities among articles on similar events and generalize them into an abstract representation -- a scenario template (ST). For this task, the estimation of verb-centric text span similarity is the key. Since text span similarity calculation plays an important role in many NLP applications, various approaches have been proposed. They range from bag-of-words to more complicated ones involving thesauri and features at different linguistic levels. However, there are still demands and opportunities for further improvement. Contextual information, for instance, by intuition would be a source to enhance text span similarity estimation. But it has yet to be exploited as well as the internal features have been.
In this talk, I first discuss an intrinsic similarity measure for predicate-argument tuples (PATs). It is applied to a Paraphrase Recognition (PR) task, demonstrating its feasibility. Then I show a context model to capture contexts that could be more informative compared to other surrounding tokens. With different contextual relations defined, I hypothesize that two PATs' semantic similarity can also be reflected by their extrinsic similarity, i.e., whether they are contextually similarly connected to similar contexts. I show experimental results that confirm the correlation between such an extrinsic similarity and the semantic similarity of PATs. To integrate intrinsic and extrinsic similarities for PAT clustering, I propose a graphical framework, using a novel core algorithm called Context Sensitive Clustering (CSC). This clustering process is guided by the Expectation-Maximization (EM) algorithm. I conduct experiments comparing this EM-based CSC algorithm with the standard K-means algorithm. Under the widely-used purity and inverse purity metrics, the proposed algorithm outperforms K-means over all the scenarios tested.
Biodata: Long Qiu is a Doctoral Student at SoC, NUS, co-supervised by Professor Chua Tat-Seng and Dr. Min-Yen Kan. He got his Master of Science (SM) in Computer Science from Singapore-MIT Alliance in 2002. He is interested in Natural Language Processing (NLP) and the related machine learning techniques.
|
Slides (.pdf)
|
25 July, Friday, 2:30pm - 3:30pm, SR1 (COM1 02-06)
|
William Chang (Chief Scientist, Baidu) / The WWW in China and Three Generations of Intelligent Search
China has become the world's biggest online market in terms of
users. What continues to drive this growth? What are its challenges
and opportunities? In this survey we will outline the social and
economic background, the key business models and competitive
advantages, how media and multimedia interact, and how people use the
Internet in their daily lives. The second part of this talk will
present an overview and forward-looking synopsis of the principles and
applications of search, from the perspective of a long-time search
engineer.
Bio: Dr. William Chang has been the Chief Scientist at Baidu since
January 2007. Prior to joining Baidu, Dr. Chang served as the CTO of
Infoseek and the VP of Strategy of Go Network. He is also the creator
of the highly successful Infoseek natural language search engine and
Ultraseek enterprise search engine. Dr. Chang has extensive expertise
in search technology, online community building and advertising
business models. Dr. Chang earned an undergraduate degree in
mathematics from Harvard and a PhD in computer science from the
University of California, Berkeley for his breakthrough work in text
search. At the renowned Cold Spring Harbor Laboratory, Dr. Chang
mapped a genome and invented a protein sequence search
methodology. More recently, he created a contextual advertising
product at Sentius Corporation, and founded Affini, Inc., a social
network technology company.
|
No slides available |
25 July, Friday, 9:30am - 12:00noon, SR1 (COM1 02-06)
|
Yahoo! Research Labs talks / Recent Research in NLP / IR at YRL
Talk Overviews (times are approximate):
9:30-10:00 - Ricardo Baeza-Yates / Towards a Distributed Search Engine
10:00-10:30 - Evgeniy Gabrilovich / Overview of Computational Advertising
10:30-11:00 - Rosie Jones / Geography in Web Search
11:00-11:30 - Donald Metzler / Predicting when (not) to Advertise
11:30-12:00 - Vanessa Murdock / Diversifying Image Search with User
Generated Content
- Ricardo Baeza-Yates
Title: Towards a Distributed Search Engine
Abstract: Distributed search engines are often more complex to implement compared to centralized engines. Distributing a search engine across multiple sites, however, has several advantages. In particular, it enables the utilization of less computer resources and the exploitation of data and user locality. In this presentation we show the feasibility of distributed Web search engines, by proposing a model for assessing the total cost of a distributed Web-search engine that includes the computational costs as well as the communication cost among all distributed sites. Using examples, we show that a distributed Web search engine can be more cost effective than a centralized one, if there is a large percentage of local queries, which is usually the case. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results, by using caching and partial replication. We simulate our algorithm on real document collections and real query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to cost. This is joint work with Aris Gionis, Flavio Junqueira, Vassilis Plachouras and Luca Telloli.
Bio: Ricardo Baeza-Yates is VP of Yahoo! Research for Europe and Latin America, leading the labs at Barcelona, Spain and Santiago, Chile. Until 2005 he was the director of the Center for Web Research at the Department of Computer Science of the Engineering School of the University of Chile; and ICREA Professor at the Dept. of Technology of Univ. Pompeu Fabra in Barcelona, Spain. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley, as well as co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, among more than 150 other publications. He has received the Organization of American States award for young researchers in exact sciences (1993) and with two Brazilian colleagues obtained the COMPAQ prize for the best CS Brazilian research article (1997). In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences. During 2007 he was awarded the Graham Medalfor innovation in computing, given by the University of Waterloo to distinguished ex-alumni.
- Evgeniy Gabrilovich
Title: Overview of Computational Advertising
Abstract: Web advertising is the primary driving force behind many Web
activities, including Internet search as well as publishing of online
content by third-party providers. A new discipline - Computational
Advertising - has recently emerged, which studies the process of
advertising on the Internet from a variety of angles. A successful
advertising campaign should be relevant to the immediate user's
information need as well as more generally to user's background, be
economically worthwhile to the advertiser and the intermediaries (e.g.,
the search engine), as well as not detrimental to user experience. At
first approximation, the process of obtaining relevant ads can be
reduced to conventional information retrieval, where one constructs a
query that describes the user's context, and then executes this query
against a large inverted index of ads. We show how to augment the
standard IR approach using query expansion and text classification
techniques. We demonstrate how to employ a relevance feedback assumption
and use Web search results retrieved by the query. We will also survey
the numerous challenges and open research problems posed by
computational advertising, such as text summarization, natural language
generation, named entity extraction, handling geographic names, and
others.
Bio: Evgeniy Gabrilovich is a Senior Research Scientist and Manager of the
NLP & IR Group at Yahoo! Research. His research interests include
information retrieval, machine learning, and computational linguistics.
Recently, he co-organized a workshop on the synergy between Wikipedia
and research in AI at AAAI 2008, as well as co-presented a tutorial on
computation advertising at ACL 2008 and EC 2008. He served on the
program committees of ACL-08:HLT, AAAI 2008, WWW 2008, CIKM 2008, JCDL
2008, AAAI 2007, EMNLP-CoNLL 2007, and COLING-ACL 2006. Evgeniy earned
his MSc ad PhD degrees in Computer Science from the Technion - Israel
Institute of Technology. In his Ph.D. thesis, Evgeniy developed a
methodology for using large scale repositories of world knowledge (e.g.,
all the knowledge available in Wikipedia) in order to enhance text
representation beyond the bag of words. URL:
http://research.yahoo.com/Evgeniy_Gabrilovich
- Rosie Jones
Title: Geography in Web Search
Abstract: Web search results are typically based on the user's search query,
without taking other contextual information into account. However, we
can see from user search behavior that for some search topics the user
may prefer results which are geographically close to home. We will show
topics which have a geographical dependence, as well as others which
appear to be geographically independent. Based on these findings, we
propose a more flexible approach to web search, which in which we prefer
a ranking with results close to the user location when this will best
satisfy the user's information need.
Bio: Rosie Jones is a Senior Research Scientist at Yahoo!. Her research
interests include web search, geographic information retrieval and
natural language processing. She received her PhD from the School of
Computer Science at Carnegie Mellon University. In 2005 she co-organized
the SIGIR workshop on lexical cohesion and information retrieval, and in
2003 she co-organized the ICML workshop on The Continuum from Labeled to
Unlabeled Data in Machine Learning and Data Mining. She served as a
Senior PC member for SIGIR in 2007 and 2008. URL:
http://research.yahoo.com/Rosie_Jones
- Donald Metzler
Title: Predicting when (not) to Advertise
Abstract: In this talk we discuss the problem of whether or not to show online
advertisements. We propose two methods for addressing this problem, a
simple thresholding approach and a machine learning approach, which
collectively analyzes the set of candidate ads augmented with external
knowledge. Our experimental evaluation, based on over 28,000 editorial
judgments, shows that we are able to predict, with high accuracy, when
to show ads for both content match and sponsored search advertising
tasks.
Bio: Donald Metzler is a Research Scientist at Yahoo! Research in Santa
Clara, CA. He obtained his Ph.D. degree in Computer Science from the
University of Massachusetts Amherst in 2007. His research interests
include information retrieval, machine learning, and their intersection.
He is the co-author of Search Engines: Information Retrieval in
Practice, which will be published in the early part of 2009. URL:
http://research.yahoo.com/Don_Metzler
- Vanessa Murdock
Title: Diversifying Image Search with User Generated Content
Abstract: Large-scale image retrieval on the Web relies on the availability of
short snippets of text associated with the image. This user-generated
content is a primary source of information about the content and context
of an image. While traditional information retrieval models focus on
finding the most relevant document without consideration for diversity,
image search requires results that are both diverse and relevant. This
is problematic for images because they are represented very sparsely by
text, and as with all user-generated content the text for a given image
can be extremely noisy.
The contribution of this paper is twofold. We show that it is possible
to minimize the trade-off between precision and diversity, relevance
models offer a unified framework to afford the greatest diversity
without harming precision. Furthermore we show that estimating the
query model from the distribution of tags favors the dominant sense of a
query. Relevance models operating only on tags offers the highest level
of diversity with no significant decrease in precision.
Bio: Vanessa Murdock currently holds a Post Doc position at Yahoo! Research
Barcelona. Her current work focuses on retrieval of short texts, such as
for advertisements, and user-generated content for images and video. She
completed her PhD in 2006 at the University of Massachusetts, working
with W. Bruce Croft. Her thesis, focusing on sentence retrieval for
applications such as Question Answering, novelty detection, and
information provenance, was recently published as a book "Exploring
Sentence Retrieval. URL: http://research.yahoo.com/Vanessa_Murdock.
|
2nd Talk: Slides (.pdf)
4th Talk: Slides (.pdf)
|
24 July, Thursday, 3:00pm - 5:00pm, SR7 (COM1 02-07)
|
Microsoft Research Asia Lab talks / Recent Research in NLP at MSRA
Talk Overviews:
3:00-4:00 - Ming Zhou / Generating Chinese Couplets using a Statistical MT Approach
4:00-5:00 - Chin-Yew Lin / Web Scale Question Answering -- SQuAD
ABSTRACTS:
- Ming Zhou
Title: Generating Chinese Couplets using a Statistical MT Approach
Part of the unique cultural heritage of China is the game of
Chinese couplets (duìlián) One person challenges the
other person with a sentence (first sentence). The other person then
replies with a sentence (second sentence), in a way that corresponding
words in the two sentences match each other by obeying certain
constraints on semantic, syntactic, and lexical relatedness. This task
is viewed as a difficult problem in AI and has not been explored in
the research community.
In this paper, we regard this task as a kind of machine translation
process. We present a phrase-based SMT approach to generate the second
sentence. First, the system takes as input the first sentence and
generates as output an N-best list of proposed second sentences using
a phrase-based SMT decoder. Then, a set of filters is used to remove
candidates violating linguistic constraints. Finally, a Ranking SVM is
applied to rerank the candidates. A comprehensive evaluation, using
both human judgments and BLEU scores, has been conducted, and the
results demonstrate that this approach is very successful.
You can view this interesting AI gaming at http://duilian.msra.cn/
which has become very popular in China.
Bio: Ming Zhou, research manager of Natutal Language Computing Group at
Microsoft Research Asia (MSRA). As one of the first group in MSRA,
this group has been working on machine translation, information
retrieval, question answering and language gaming and has contributed
many technologies to MS products such as Chinese/Japanese IME, Chinese
word breaker, English writing assistant, search engine speller,
multi-language search and keyword bidding, text mining, etc.
Ming developed the China's first Chinese-English machine system CEMT-I
in 1988 which set up the foundation of machine translation research of
Harbin Institute of Technology. He is the inventor of J-Beijing
Chinese-Japanese machine translation system, a famous MT product in
Japan which has taken the 62% market share for 10 years since it was
launched in 1998. Ming Zhou got his PhD degree at Harbin Institute of
Technology in 1991. Then he had his post-doc in Tsinghua University in
1991-1993. He then became an associate professort at the same
university untill 1999 when he joined MSRA.
- Chin-Yew Lin
Title: Web Scale Question Answering -- SQuAD
Abstract: Question answering has been a very active research field in
information retrieval and natural language processing. Despite the
success of TREC QA track, large scale robust QA systems are still yet
to be found in the real world. In this talk, I will briefly introduce
recent progress on SQuAD --a question and answering project aiming to
crawl, index, and serve all question and answer pairs existing on the
web. I will address six main challenges of the project and then focus
on the topic of question search and recommendation. Three demos will
be shown to highlight how SQuAD technologies can be used in different
scenarios.
Bio: Dr. Chin-Yew LIN is a lead researcher and research manager at
Microsoft Research Asia. Before joining Microsoft in 2006, he was a
senior research scientist at the Information Sciences Institute at
University of Southern California (USC/ISI) where he worked in the
Natural Language Processing and Machine Translation group since 1997.
His research interests are automated summarization, opinion analysis,
question answering, computational advertising, community intelligence,
machine translation, and machine learning.
Recently, his main focus is developing scalable automatic question
answering and distillation system -- SQuAD. He also developed
automatic
evaluation technologies for summarization, QA, and MT. In particular,
he created the ROUGE automatic summarization evaluation package. It
has become the de facto standard in summarization evaluations. More
than 200 research sites worldwide have downloaded this package.
|
1st Talk: Slides (.pdf)
|
17 July, Thursday, 10:30am - 12nn, EC (SoC1 05-46)
|
Douglas Oard (University of Maryland / Fourth-Generation Content Analysis: Supporting social science research using computational linguistics)
ABSTRACT:
Babbie defines content analysis as "the study of recorded human
communications such as books, Web sites, paintings and laws." We all
practice what we might call "first generation" content analysis every
time we read a paper. What we might call "second generation" content
analysis involves social scientists who develop coding frames
appropriate to their research question and then meticulously annotate
a collection of moderate size in order to support their analysis.
Third-generation content analysis leverages extensive automation in
fairly straightforward ways, such as by counting words or preparing a
concordance. We now find ourselves on the verge of a fourth
generation of content analysis techniques in which computational
linguistics holds promise for automated population of complex coding
frames. This could enable sophisticated Web-scale studies,
potentially fostering emergence of research methods that go well
beyond content to encompass many forms of evidence from human
interaction with information. In this talk, I will describe some
challenges that we must overcome as these two communities learn to
work together. I'll illustrate my talk with examples from the PopIT
procect collaboration between social scientists and computational
linguists at the University of Maryland in which we are developing
automated tools for computational analysis of trends in the popularity
of information technology innovations. I'll start with a sketch of
our research design for working at the intersection of these two
fields, and then I'll describe a few specific pieces of that puzzle
that we have already started to build.
Finally, I'll conclude with a few remarks about where we see potential
for collaboration with others who share similar interests.
BIODATA:
Douglas Oard is Associate Dean for Research at the College of
Information Studies of the University of Maryland, College Park, where
he holds joint appointments as Associate Professor in the College of
Information Studies and in the Institute for Advanced Computer
Studies. He earned his Ph.D. in Electrical Engineering from the
University of Maryland. Dr. Oard's research interests center around
the use of emerging technologies to support information seeking by end
users, with recent work focusing on interactive techniques for
cross-language information retrieval, searching conversational media,
and leveraging observable behavior to improve user modeling. Together
with Ping Wang and Ken Fleischmann, he helps to lead the NSF-funded
PopIT project. Additional information is available at
http://www.glue.umd.edu/~oard/
|
Slides (.htm) |
16 July, Wed, 3-4pm, (SR3 COM1 #02-12)
|
Xiong Deyi (I2R / Linguistically Annotated BTG for Statistical Machine Translation)
ABSTRACT:
Bracketing Transduction Grammar (BTG) is a natural choice for
effective integration of desired linguistic knowledge into
statistical machine translation (SMT). In this talk, we introduce a
Linguistically Annotated BTG (LABTG) for SMT. It conveys
linguistic knowledge of source-side syntax structures to BTG
hierarchical structures through linguistic annotation. From the
linguistically annotated data, we learn annotated BTG rules and
train linguistically motivated phrase translation model and
reordering model. We also present an annotation algorithm that
captures syntactic information for BTG nodes. The experiments show
that the LABTG approach significantly outperforms a baseline
BTG-based system and a state-of-the-art phrase-based system on the
NIST MT-05 Chinese-to-English translation task. Moreover, we
empirically demonstrate that the proposed method achieves better
translation selection and phrase reordering.
BIODATA:
Xiong Deyi received his Ph.D. from the Institute of Computing
Technology of Chinese Academy of Sciences. His research interests
include statistical machine translation, Chinese language processing,
information extraction, and statistical parsing. He is currently a
research fellow at the Institute for Infocomm Research of Agency for
Science, Technology and Research (I2R,A-STAR).
|
Slides (.pdf) |
9 Jul, Wed, 2-3pm / SR7 (COM1 #02-07)
|
Mstislav Maslennikov (NUS) Relation Extraction for Information Extraction from Free Text)
ABSTRACT:
Information Extraction (IE) is the task of identifying information (e.g. entities, relations or events) from free text. Numerous previous context-, ontology-,
rule- and classification-based methods were actively explored during the decades of research on this task. However, a challenging open question of effectively
handling the flexibility of natural language remains unresolved over the years. In IE, this implies the problem of sparseness of data instances, which
in turn causes the problems of paraphrasing and misalignment of context features of the extracted information. In this thesis, we hypothesize that such
problems can be alleviated by combining relations between entities at the phrasal, dependency, semantic and inter-clausal discourse levels. To validate
our hypothesis, we develop a 2-level multi-resolution framework ARE (Anchors and Relations). The first level of ARE extracts candidate phrases (anchors),
while the second level evaluates the relations among the anchors and composes possible candidate templates.
The relations between the anchors are combined in several ways. First, we evaluate dependency relations between anchors. We classify dependency
relation paths between the anchors into the Simple, Average and Hard categories according to the path length and develop different techniques to handle
them. The category-specific strategies resulted in the improvement of 3%, 4% on the MUC4 (Terrorism) and MUC6 (Management Succession) domains,
respectively. The increased performance demonstrates that dependency relations are important to handle paraphrases at the syntactic level. Second, we
incorporate the discourse relation analysis in a multi-resolution framework for IE to handle long distance dependency relations and possible
paraphrasings at the intra-clausal level. This leads to a further improvement of 3%, 7%, 3% and 4% on MUC4, MUC6 and ACE RDC 2003
(general and specific types) domains, respectively. Third, we explore 2 supplementary strategies to combine relation paths between anchors.
Since the amount of negative paths between the anchors is many times more than that of positive paths, we apply a filtering strategy
to eliminate negative paths. Also, we support the learning process of our dependency relation classifier by the cascading of the features from
the discourse classifier. These 2 strategies further improve the IE performance on the MUC4, MUC6 and ACE RDC 2003 (general and specific
types) corpora.
Overall, our results affirm the hypothesis that the extraction of candidate phrases (anchors) and the combination of different relation types
between anchors in a multi-resolution framework is important to tackle the key problems of paraphrasing and misalignment in Information Extraction.
BIODATA:
Mr. Maslennikov Mstislav is a Doctoral Student at SOC, NUS. He received his 5-year diploma (equivalent to M.Sc.) degree from the Moscow State University, Russia. Since 2002, he has been studying in the internship and PhD programs under the supervision of Prof. Chua Tat-Seng and Dr. Tian Qi. His research is on the theme of improving Information Extraction through relation-based analysis of free text. |
Slides (.pdf) |
12, June, Thursday, 10:00am - 12:00n, MR6 (AS6 05-12)
|
JCDL/LREC Practice Session
AGENDA:
- 10:00-10:30 Zhao Jin, "Math Information Retrieval: User Requirements and Prototype Implementation" (JCDL)
- 10:30-10:50 Kan Min-Yen, "Slide Image Retrieval: A Preliminary Study" (JCDL, Short paper)
- 10:50-11:20 Michael Brown, "User-Assisted Ink-Bleed Correction for Handwritten Documents" (JCDL)
- 11:20-11:40 Kan Min-Yen, "The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics" (LREC)
- 11:40-12:00 Kan Min-Yen, "ParsCit: An open-source CRF reference string parsing package" (LREC)
ABSTRACTS:
Talk #1: We report on the user requirements study and preliminary
implementation phases in creating a digital library that indexes and
retrieves educational materials on math. We first review the current
approaches and resources for math retrieval, then report on the
interviews of a small group of potential users to properly ascertain
their needs. While preliminary, the results suggest that meta-search
and resource categorization are two basic requirements for a math
search engine. In addition, we implement a prototype categorization
system and show that the generic features work well in identifying the
math contents from the webpage but perform less well at categorizing
them. We discuss our long term goals, where we plan to investigate how
math expressions and text search may be best integrated.
Talk #2: We consider the task of automatic slide image retrieval,
in which slide images are ranked for relevance against a textual
query. Our implemented system, SLIDIR caters specifically for this
task using features specifically designed for synthetic images
embedded within slide presentation. We show promising results in both
the ranking and binary relevance task and analyze the contribution of
different features in the task performance.
Talk #3: We describe a user-assisted framework for correcting
ink-bleed in old handwritten documents housed at the National Archives
of Singapore (NAS). Our approach departs from traditional correction
techniques that strive for full automation. Fully automated approaches
make assumptions about ink-bleed characteristics that are not valid
for all inputs. Furthermore, fully-automated approaches often have to
set algorithmic parameters that have no meaning for the end-user. In
our system, the user needs only to provide simple examples of
ink-bleed, foreground ink, and background. These training examples
are used to classify the remaining pixels in the document to produce a
computer generated result that is equal or better than existing
fully-automated approaches.
To offer a complete system, we provide additional tools to allow any
remaining errors to be easily cleaned up by the user. The initial training
markup, computer-generated results, and manual edits are all recorded with
the final output, allowing subsequent viewers to see how a corrected
document was created and to make changes or updates. While an on-going
project, our feedback from the NAS staff has been overwhelmingly positive
that this user-assisted approach is a practical and useful way to address
the ink-bleed problem.
Talk #4: The ACL Anthology is a digital archive of conference and
journal papers in natural language processing and computational
linguistics. Its primary purpose is to serve as a reference
repository of research results, but we believe that it can also be an
object of study and a platform for research in its own right. We
describe an enriched and standardized reference corpus derived from
the ACL Anthology that can be used for research in scholarly document
processing. This corpus, which we call the ACL Anthology Reference
Corpus (ACL ARC), brings together the recent activities of a number of
research groups around the world. Our goal is to make the corpus
widely available, and to encourage other researchers to use it as a
standard testbed for experiments in both bibliographic and
bibliometric research.
Talk #5: We describe ParsCit, a freely available, open-source
implementation of a reference string parsing package. At the core of
ParsCit is a trained conditional random field (CRF) model used to
label the token sequences in the reference string. A heuristic model
wraps this core with added functionality to identify reference strings
from a plain text file, and to retrieve the citation contexts. The
package comes with utilities to run it as a web service or as a
standalone utility. We compare ParsCit on three distinct reference
string datasets and show that it compares well with other previously
published work.
|
1st Talk: Slides (.htm)
2nd Talk: Slides (.htm)
4th Talk: Slides (.htm)
5th Talk: Slides (.htm)
|
4, June, Wed, 2:30pm - 3:30pm, SR8 (COM1 208)
|
Timothy
Baldwin (University of Melbourne) / Enhanced Information Access to Troubleshooting-oriented Web
User Forum Data
ABSTRACT:
The ILIAD (Improved Linux Information Access
by Data Mining) Project
is an attempt to apply language technology to the task of Linux
troubleshooting by analysing the underlying information structure of a
multi-document text discourse and improving information delivery
through a combination of filtering, term identification and
information extraction techniques. In this talk, I will outline the
overall project design and present results for a variety of
thread-level filtering tasks.
BIODATA:
Timothy Baldwin is a Senior Lecturer in the Department of Computer
Science and Software Engineering, University of Melbourne. Since
completing his PhD at the Tokyo Institute of Technology in 2001, he
has been involved with research grants from including the NSF, NTT,
ARC, NICTA and Google. His research interests include web mining,
information extraction, deep linguistic processing, multiword
expressions, deep lexical acquisition, and biomedical text mining. He
is the author of over 130 journal and conference publications, and has
held visiting appointments at NTT Communication Science Laboratories
and Saarland University. He is the recipient of a number of awards for
both teaching and research in the areas of computer science and
natural language processing. He is currently on the editorial board of
Computational Linguistics, a series editor for CSLI Publications, and
a member of the Deep Linguistic Processing with HPSG Initiative
(DELPH-IN).
|
Slides (.pdf)
|
2, June, Monday, 2:30pm - 3:30pm, SR2 (COM1 02-04)
|
ACL/SIGIR/WebDB Practice Session
AGENDA:
- 2:00-3:30 Chan Yee Seng, "MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation"
- 2:30-3:00 Chia Tee Kiah, "Lattice-Based Approach to Query-by-Example Spoken Document Retrieval"
- 3:00-3:30 Tan Yee Fan, "Efficient Web-Based Linkage of Short to Long Forms"
ABSTRACTS:
Talk #1: We propose an automatic machine translation (MT) evaluation
metric that calculates a similarity score (based on precision and
recall) of a pair of sentences. Unlike most metrics, we compute a
similarity score between items across the two sentences. We then find
a maximum weight matching between the items such that each item in one
sentence is mapped to at most one item in the other sentence. This
general framework allows us to use arbitrary similarity functions
between items, and to incorporate different information in our
comparison, such as n-grams, dependency relations, etc. When evaluated
on data from the ACL-07 MT workshop, our proposed metric achieves
higher correlation with human judgements than all 11 automatic MT
evaluation metrics that were evaluated during the workshop.
Talk #2: Recent efforts on the task of spoken document retrieval (SDR) have
made use of speech lattices: speech lattices contain information about
alternative speech transcription hypotheses other than the 1-best
transcripts, and this information can improve retrieval accuracy by
overcoming recognition errors present in the 1-best transcription. In
this paper, we look at using lattices for the query-by-example spoken
document retrieval task -- retrieving documents from a speech corpus,
where the queries are themselves in the form of complete spoken
documents (query exemplars). We extend a previously proposed method
for SDR with short queries to the query-by-example task. Specifically,
we use a retrieval method based on statistical modeling: we compute
expected word counts from document and query lattices, estimate
statistical models from these counts, and compute relevance scores as
divergences between these models. Experimental results on a speech
corpus of conversational English show that the use of statistics from
lattices for both documents and query exemplars results in better
retrieval accuracy than using only 1-best transcripts for either
documents, or queries, or both. In addition, we investigate the effect
of stop word removal which further improves retrieval accuracy. To our
knowledge, our work is the first to have used a lattice-based approach
to query-by-example spoken document retrieval.
Talk #3: Abbreviations, acronyms, initialisms, and shortenings frequently
occurin many texts found on the Web, such as publication metadata,
stock ticker codes, and biological articles. To connect these
disparate forms together for knowledge discovery, short forms must be
properly linked to their canonical long forms. In this paper, we
demonstratehow a search engine can be efficiently utilized in mining
the requiredcontextual information, so that short forms can be
effectively linked to long forms. We show that a count-based method
consistently outperforms other methods, and that using the snippets is
better thanusing the full web pages. We also consider adaptively
combining a query probing algorithm together with our count-based
method. This reduces running time and network bandwidth, while
maintaining the strong linkage performance.
|
1st Talk: Slides (.htm)
2nd Talk: Slides (.htm)
3rd Talk: Slides (.pdf)
|
8, Apr, Tue, 2pm - 3pm, MR6 (AS6 #05-12)
|
Su Nam
Kim (SoC) / Statistical
Modeling of Multiword Expressions (2)
ABSTRACT:
In this work, we propose a novel method
based on ellipsed predicates to automatically interpret compound nouns
with a predefined set of semantic relations. First we map verb tokens
in sentential contexts to a fixed set of seed verbs using
WordNet::Similarity and Moby's Thesaurus. We then match the sentences
with semantic relations based on the semantics of the seed verbs and
grammatical roles of the head noun and modifier. Based on the semantics
of the matched sentences, we then build a classifier using a
memory-based classification tool, Timbl 5.1. The performance of our
final system at interpreting NCs is 52.6%. We also compared our method
with previous methods and confirmed better performance over the same
dataset.
BIODATA:
Su Nam Kim is a postdoctoral research fellow
at NUS. She received her
BS and MS degrees from Pusan National University, South Korea, a MS
degree from State University of New York at Stony Brook, USA. She
recently completed her Ph.D study at University of Melbourne,
Australia. She has a broad research interest in AI but primarily
focuses on lexical semantics including multiword expressions, word
sense disambiguation and cross-lingual lexical acquisition. She is
also interested in multi-document/multilingual summarization and
question-answering systems.
|
Slides
(.pdf)
|
| 11 Mar, Tue, 10am - 11am, VIP Studio (AS6
#05-17) |
Gong
Tianxia (SoC) / Automated Retrieval and Generation
of Brain CT Radiology Reports
ABSTRACT:
With the advances of medical techniques,
large amounts of medical data are
produced in hospitals every day. Radiology reports contain
rich information about the corresponding medical images but are often
under mined. Therefore, our research topics focus on information
extraction from brain CT radiology reports, radiology reports assisted
medical
image content retrieval, and automatic generation of brain CT reports
based on domain knowledge and associated images. Current medical record
search systems will benefit from our research so that searching for
information
is more efficient and convenient. Doctors and radiologists can also be
more efficient to conduct their research in the area using the improved
system.
The automatical generation of reports can give reference to
radiologists.
Our research will also be helpful to facilitate an education system for
junior
doctors and researchers in the area.
BIODATA:
Gong Tianxia is a PhD candidate in
computer science at School of Computing (SOC), National University of
Singapore (NUS), supervised by A/P Tan Chew Lim. She received her
bachelor's degree in Computer Engineering at SOC in 2006. Her research
interests are in information Rretrieval and medical text processing.
|
Slides
(.pdf)
Slides
(.ppt) |
| 26,
Feb, Tue, 10am - 11am, MR6 (AS6 #05-12) |
Su Nam
Kim (SoC) / Statistical
Modeling of Multiword Expressions (1)
ABSTRACT:
This research focuses on multiword
expressions (MWEs), that is lexical
items
that are made up of two or more simplex words, such as "dog pound",
"call up"
or "red herring". My goals are: to shed light on underlying the
linguistic
processes giving rise to MWEs; to generalize techniques for
indentifying,
extracting and analyzing MWEs; to compare pre-existing MWE
classifications;
and finally, to exemplify the utility of MWE interpretation within NLP
tasks. This is aimed at improving the fluency, robustness and
understanding of
natural language.
The first of the three part presentation on
Feb. 26th will
provide a brief background on MWEs including different research
perspectives and linguistic foundations of MWEs. It will also cover
the basic statistical approaches broadly used in MWE studies and will
present a summary of recent advances. The second and third talks will
present a more technical and detailed discussion on work done in the
past two years. The schedule for the second and third talks will be
announced later.
BIODATA:
Su Nam Kim is a postdoctoral research fellow
at NUS. She received her
BS and MS degrees from Pusan National University, South Korea, a MS
degree from State University of New York at Stony Brook, USA. She
recently completed her Ph.D study at University of Melbourne,
Australia. She has a broad research interest in AI but primarily
focuses on lexical semantics including multiword expressions, word
sense disambiguation and cross-lingual lexical acquisition. She is
also interested in multi-document/multilingual summarization and
question-answering systems.
|
Slides
(.pdf) |
| 28, Jan, Mon, 2:00pm - 3:00pm, SR11(COM1
#02-11). |
Yee Whye Teh
(UCL) / Bayesian Agglomerative Clustering with Coalescents
ABSTRACT:
Hierarchical clustering of data is one of
the most widely used machine
learning techniques. Traditional hierarchical clustering techniques
construct a single tree in a greedy fashion, either in a top-down or a
bottom-up agglomerative fashion. Sometimes we are interested in how
reliable the constructed tree is, i.e. how much we believe that the
structure of the tree reflects true underlying structure in the data
rather than spurious effects due to noise. Such a question can be
answered using a Bayesian approach where we define a prior over trees
and compute a posterior distribution over trees which captures the
uncertainty in the learned tree structure.
However past Bayesian models for
hierarchical clustering either do not
give a posterior over trees (Heller and Ghahramani 2005, Friedman
2003), not infinitely exchangeable (Williams 2000), or is simply too
complex to have widespread appeal (Neal 2003). In this talk we
present a model that
1) gives a posterior distribution over trees,
2) is easy to implement, and
3) has the additional nice property that it is infinitely exchangeable.
Our model is based upon a standard model in
population genetics called
Kingman's coalescent. We propose both greedy and sequential Monte
Carlo inference algorithms for the model. We show that our model
performs well compared to previous approaches on a number of small
datasets, and apply it to document clustering and phylolinguistics.
BIODATA:
Dr Teh Yee Whye is a lecturer at the Gatsby
Computational Neuroscience
Unit, University College London in the United Kingdom. Prior to this
appointment he worked with Prof Lee Wee Sun as Lee Kuan Yew
Postdoctoral Fellow at the National University of Singapore, and with
Prof. Michael I. Jordan as a postdoc at University of California at
Berkeley. He obtained his PhD from the University of Toronto under
Prof. Geoffrey E. Hinton. His research interests are in Bayesian
machine learning and probabilistic graphical models.
|
Slides
(.ppt) Slides
(.htm) |
| 8, Jan, Tue, 10:30am - 12:00pm, SR3A(COM1
#02-12). |
Jing Jiang
(UIUC) / Domain Adaptation in Natural Language Processing
ABSTRACT:
With the explosion of the amount of textual
data in the information
age, natural language processing (NLP) has become increasingly
important, with direct applications in areas such as Web mining and
biomedical literature mining. Currently, the most effective approach
to solving most NLP problems is supervised learning coupled with
linguistic knowledge. However, standard supervised learning requires
the training and the test corpora to be similar, and therefore falls
apart in real NLP applications because obtaining labeled data for
every new domain is expensive and thus infeasible. In this talk, I
will present the major line of my PhD research on domain adaptation in
NLP, which aims at adapting classifiers trained on one domain to
another domain. We have proposed two frameworks to achieve domain
adaptation, both having been evaluated on real NLP problems and
outperformed standard learning methods. I will also briefly mention
the future plan to incorporate knowledge bases and expert interactions
into the domain adaptation process, with applications in large-scale
information extraction from biomedical literature.
BIODATA:
Ms Jing Jiang is a final year PhD student in
the Text Information
Management Group in the Computer Science Department at the University
of Illinois at Urbana-Champaign, working with Professor ChengXiang
Zhai. Her research interests include natural language processing,
information retrieval, machine learning, and biomedical literature
mining. She received her B.S. degree and her M.S. degree in Computer
Science from Stanford University in 2002 and 2003, respectively.
|
Slides
(.ppt) Slides
(.htm) |
Jump to: 2008
2007
2006
2005
2004
| Date |
Speaker / Title |
Notes /
Slides |
|
2007
|
| 18, Dec, Tue, 3 - 4pm, SR7(COM1). |
Simone Teufel
(Cambridge University) / Citations and discourse structure:
AZ and its use in
large-scale intelligent search
ABSTRACT:
I will describe how one useful aspect of the
structure of
scientific articles
can be discovered with reasonably shallow means, namely the
prototypical
argumentation for the validity of the current research. Reference to
other
people's work, and reasonably standardised statements about this work,
are a
staple part of the argumentation, and citation analysis can exploit
this fact.
AZ-discourse analysis is the robust machine-learning of this structure,
based
on the extraction of correlated, and often linguistically interesting,
features. I will show results of AZ on two domains (computational
linguistics
and chemistry), and discuss several search and summarisation
applications
using AZ. I will also speculate on more sequence-based methods for
recognising
AZ-type structures in text.
BIODATA:
Simone Teufel is a senior lecturer in the
Computer laboratory
at Cambridge University,
where she has worked since 2001. Her main research interests are in
corpus-linguistic approaches to discourse theory, and in the
application of
such information to summarisation, information retrieval and citation
analysis. She has a background in computer science (1994 Diploma from
University Stuttgart) and in cognitive science (2000 PhD from Edinburgh
University), and has also experience in medical information processing
and search, from a postdoctoral stay at Columbia University, and in
collocation extraction, from a research post at Xerox Europe. Her
lastest research interests include lexical acquisition, and the
visualisation and language generation of the analysis results of
scientific articles.
|
Slides
(.pdf) |
| 13, Dec, Thu, 2:30 - 3:30pm, The Big
One(I2R). |
Simone Teufel
(Cambridge University) / Information extraction and
intelligent search in the Chemical domain: Sciborg
ABSTRACT:
While bioinformatics has far advanced in the
past years and recognisers for
gene and protein names and interactions have been built, biochemistry
is a new
field for computational linguistics to move into. I will be talking
about the
recognition strategy for scientific papers in general which the NLIP
group at
Cambridge University is developing, while concentrating on the research
done
in the project SciBorg, on chemical name parsing, ontology discovery,
and
discourse-related search. I will also talk a bit about the role of
citations
in this recognition effort, and about quite unusual infrastructure that
our
project is built on -- robust semantic representations, encoded as XML
standoff.
BIODATA:
Simone Teufel is a senior lecturer in the
Computer laboratory
at Cambridge University,
where she has worked since 2001. Her main research interests are in
corpus-linguistic approaches to discourse theory, and in the
application of
such information to summarisation, information retrieval and citation
analysis. She has a background in computer science (1994 Diploma from
University Stuttgart) and in cognitive science (2000 PhD from Edinburgh
University), and has also experience in medical information processing
and search, from a postdoctoral stay at Columbia University, and in
collocation extraction, from a research post at Xerox Europe. Her
lastest research interests include lexical acquisition, and the
visualisation and language generation of the analysis results of
scientific articles.
|
Slides
(.pdf) |
| 3, Dec, Mon, 9:30am - 11:30am, Big One(I2R). |
Talk 1:
Prof.
Junichi Tsujii (University of Tokyo) / Combining
Statistical Models with Symbolic Grammar in Parsing
Talk 2:
Dr. Sophia Ananiadou
(University of Manchester) / Text mining techniques for
building a Biolexicon
|
No slides available |
| 1, Nov, Thur, 3:00pm, Big One(I2R). |
Xiaofeng
Yang (I2R) / Coreference Resolution with
Knowledge-Rich Methods
ABSTRACT:
Coreference resolution is the task of
finding different mentions of the same entity in the word. In the past
decade, knowledge-lean approaches are widely adopted, in which only
simple morpho-syntactic cues as knowledge sources are employed in the
resolution process. Although these approaches have achieved reasonable
success, researchers have found that deeper syntactic or semantic
knowledge is necessary in order to reach the next level performance. In
this talk, we will introduce our knowledge-rich approaches to
coreference resolution, including a tree-kernel-based method for the
syntactic knowledge, and web-based methods for the semantic knowledge.
These sources of enriched knowledge are acquired automatically without
too many human efforts, and have proved effective for the coreference
resolution task.
|
No slides available |
| 19, Oct, Fri, 10am - 11am, MR6(AS6 #05-12). |
QIU Long
(NUS) / Scenario Template: Its Creation and Application to
Open Domain Q&A
ABSTRACT:
A Scenario Template is a data structure that
reflects the salient aspects
shared by a set of similar events, which are considered as belonging to
the same scenario. These salient aspects are typically the scenario's
characteristic actions, the entities involved in these actions and the
related attributes of them.
In this talk, I will first brief about our
approach to scenario template
creation and update the latest evaluation results. Then I will discuss
one
of the possible applications of scenario templates, namely, open-domain
question and answering. For Q&A systems, query expansion is a
common
strategy while sentence selection is an important process. I will show
how
scenario templates might help in these two aspects.
BIODATA:
Qiu Long is a Doctoral Student at SoC, NUS,
co-supervised by Professor
Chua Tat-Seng and Dr. Min-Yen Kan. He got his Master of Science (SM) in
CS
from Singapore - MIT Alliance. He is interested in Natural Language
Processing (NLP) and the related machine learning techniques.
|
Slides
(.pdf) |
| 20, Sep, Thu, 3pm - 4pm, SR10(COM1,
#02-10). **Note special time and venue |
Tanja Schultz
(CMU) / Multilingual Speech Processing
ABSTRACT:
In recent years, speech processing products
had been widely
distributed all over the world, reflecting a general believe that
speech technologies have a huge potential to overcome language
barriers and to let everyone participate in today's information
revolution. However, in spite of vast improvements in speech and
language technologies, the development of speech processing systems
still requires significant skills and resources to carry out.
Consequently, with more than 6500 languages in the world, the current
costs and effort in building speech support is prohibitive to all but
the most economically viable languages.
In this talk I will discuss the challenges
and limitations of rapidly
developing automatic speech processing systems for a large number of
languages and dialects. I will describe solutions to system
development based on sharing data and system components across
languages. Practical implementations and recent results are presented
in the light of our SPICE project, which aims to bridge the gap
between language and technology experts by providing innovative
strategies and tools for non-expert users. These tools enable the user
to easily collect appropriate text and speech data, to quickly develop
acoustic models, pronunciation dictionaries, and language models based
on very limited resources, and to monitor progress and performance
allowing for iterative improvements with the user in the loop.
BIODATA:
Tanja Schultz received her Ph.D. and Masters
in Computer Science from
University Karlsruhe, Germany in 2000 and 1995 respectively and got a
German Masters in Mathematics, Sports, and Education Science from the
University of Heidelberg, Germany in 1990. She joined Carnegie Mellon
University in 2000 and is a faculty member of the Language
Technologies Institute as a Research Computer Scientist. Since 2007
she also holds a full professorship at Karlsruhe University, Germany.
Her research activities center around
language independent and
language adaptive speech recognition but also include large vocabulary
continuous speech recognition systems, human-machine interfaces using
speech and various biosignals, speech translation, as well as language
and speaker identification approaches. With a particular area of
expertise in multilingual approaches, she performs research on
portability of speech processing systems to many different languages.
In 2001 Tanja Schultz was awarded with the FZI price for her
outstanding Ph.D. thesis on language independent and language adaptive
speech recognition. In 2002 she received the Allen Newell Medal for
Research Excellence from Carnegie Mellon for her contribution to
Speech-to-Speech Translation and the ISCA best paper award for her
publication on language independent acoustic modeling. In 2005 she was
awarded the Carnegie Mellon Language Technologies Institute Junior
Faculty Chair. Tanja Schultz is the author of more than 100 articles
published in books, journals, and proceedings.
She is a member of the IEEE Computer
Society, the European Language
Resource Association, the Society of Computer Science (GI) in Germany,
and currently serves on the ISCA board and several program and review
panels.
|
No slides available |
| 21 Aug, Tue, 2pm - 3pm, SR5 (COM1#02-01).
**Note special time and venue |
Yu-Han
Chang (USC ISI) / Toddler Machine Meets Pre-Teen
Children: Concepts and Language from Combining Lots of Computing with
Lots of Free Time
Abstract:
The idea of using humans to teach computers
is not a new one, but it
has been largely impractical and largely ignored. Modern-day
computers tend to "learn" by either sifting through large amounts of
data or by being programmed/endowed with expert knowledge. Typically
there is little interaction between man and machine. Our recent
project, called "Wubble World", capitalizes on the availability of
free hands-on human teaching as a means for machine learning of
language and concepts.
The basic premise of this work begins with
an online game situated in
a virtual 3D environment. Language is generated as children interact
with their personal creature, called a wubble, or with other children.
By virtue of the virtual environment, this language is situated and
forms a rich corpus of matched scenes and sentences upon which to learn
language and concepts. In one part of the environment, children
interact with their wubble by teaching it to accomplish certain given
tasks. The wubble, like a toddler, initially knows little about the
world, and must acquire concepts and labels by interacting with the
child. I'll describe this environment and the basic concept learning
that happens inside the wubble. In another part of the world, children
play a competitive team game against other children. The game is
designed to require cooperation among team members, typically using
spoken language. This language, combined with a log of the game state,
generates a rich sentence-scene corpus. This richness could potentially
enable natural language processing to move beyond current statistical
techniques by incorporating data that reveals underlying meaning. I'll
demonstrate the game, describe the data we have collected so far, and
discuss some of the possible approaches for learning from this data.
BIODATA:
Dr. Yu-Han Chang is a Computer Scientist at
the Information Sciences
Institute of the University of Southern California (USC ISI). His
current research interests span topics from reinforcement learning,
game theory, natural language understanding, interactive technologies,
and traditional AI. Recent and ongoing projects include harnessing
the power of the Internet to train intelligent agents via human
teaching, transfer learning, and the development of efficient
no-regret algorithms for non-cooperative learning domains. Dr. Chang
holds undergraduate degrees in Mathematics and Economics, as well as a
S.M. in Computer Science, from Harvard University. He received his
Ph.D. in Electrical Engineering and Computer Science from MIT,
focusing his efforts on developing algorithms for multi-agent learning
in the context of machine learning and game theory.
|
Slides
(.pdf) (Internal to NUS only) |
| 10 Aug, Fri, 10:30am - 11:30am, SR6
(COM1#02-03). **Note special time and venue |
Robert Dale
(Macquarie University) / The Generation of Referring
Expressions: Where We've Been, How We Got Here, and Where We're Going
Abstract:
The task of referring expression generation
is concerned with determining
what semantic content should be used in a reference to an intended
referent
so that the hearer will be able to identify that referent. The task has
been
a focus of interest within natural language generation at least since
the
early 1980s, in part because the problem appears relatively
well-defined.
Over the last 25 years, a range of algorithms and approaches have been
proposed and explored, making this the most intensely studied problem
in
natural language generation; and yet, even a casual analysis of real
human-authored texts suggests that we have a long way to go in terms of
providing an explanation for the range of real linguistic behaviour
that we
find. In this talk, I'll review research in the area to date, try to
characterise where we are now, and point to directions for future
research
in the area.
BIODATA:
Robert Dale received his PhD in
Computational Linguistics from the
University of Edinburgh in 1989. His research interests include
low-cost
approaches to intelligent text processing tasks; practical natural
language
generation; the engineering of habitable spoken language dialog
systems; and
computational, philosophical and linguistic issues in reference and
anaphora. He is Director of the Centre for Language Technology at
Macquarie
University, Convenor of the Australian Research Council's Human
Communication Science Network, and editor-in-chief of the Journal of
Computational Linguistics.
|
Slides
(.pdf) |
| 30 Jul, Mon, 2-3pm, SR3A (COM1#02-12).
**Note special time and venue |
Hari Sundaram
(Arts Media and Engineering (AME), Arizona
State University) / Contextual Wisdom: Social Relations and
Correlations for Multimedia Event Annotation
Abstract:
This work deals with the problem of event
annotation in
social networks. The problem is made difficult due to variability of
semantics and due to scarcity of labeled data. Events refer to
real-world phenomena that occur at a specific time and place, and
media and text tags are treated as facets of the event metadata. We
are proposing a novel mechanism for event annotation by leveraging
related sources (other annotators) in a social network. Our approach
exploits event concept similarity, concept co-occurrence and annotator
trust. We compute concept similarity measures across all facets. These
measures are then used to compute event-event and user-user activity
correlation. We compute inter-facet concept co-occurrence statistics
from the annotations by each user. The annotator trust is determined
by first requesting the trusted annotators (seeds) from each user and
then propagating the trust amongst the social network using the biased
PageRank algorithm. For a specific media instance to be annotated, we
start the process from an initial query vector and the optimal
recommendations are determined by using a coupling strategy between
the global similarity matrix, and the trust weighted global
co-occurrence matrix. The coupling links the common shared knowledge
(similarity between concepts) that exists within the social network
with trusted and personalized observations (concept co-occurrences).
Our initial experiments on annotated everyday events are promising and
show substantial gains against traditional SVM based techniques.
Co-authors: Amit Zunjarwad (AME), Lexing Xie
(IBM)
BIODATA:
Hari Sundaram is currently an assistant
professor, at Arizona State
University. This is a joint appointment with the department of Computer
science and the Arts Media and Engineering program. He received his
Ph.D.
from the Department of Electrical Engineering at Columbia University in
2002. He received his MS degree in Electrical Engineering from SUNY
Stony
Brook 1995 and a B.Tech in Electrical Engineering from Indian Institute
of
Technology, Delhi in 1993.
|
Slides
(.htm) Slides
(.ppt) |
| 24 Jul, Tue, 2-3pm, at Meeting Room
"BigOne", I2R. **Note special time and venue |
Hari Sundaram
(Arizona State University) / Rethinking media semantics:
acquisition, representation and learnability
Jointly organized by CHIME, I2R
and PREMIA.
This talk will examine some assumptions in
media semantics under three
broad categories - (a) aspects of meaning (b) rethinking semantic
construction (c) learnability contradictions. A re-examination of the
assumptions behind media semantics is useful, as the mechanisms by
which
people create and consume media have changed significantly in the last
decade. These changes offer fresh insight into the familiar problem of
the
semantic gap - how to go from sensory data to meaning. There are three
aspects of meaning of interest - context, approximations, and
variability.
We need to examine the construction of meaning in a manner very
different
from the familiar Marr model - specifically we shall examine embodiment
and networked construction. A significant challenge to the learnability
of
semantics lies in re-examining within the multimedia context, of what
Chomsky calls "the poverty of input" problem. How is it possible to
learn
a large number of concepts with very few / or even non-existent
training
examples? We will examine the role of context and semantic
approximation
with an application to media retrieval. The issues of embodiment and
its
relation to semantics will be discussed with respect to an educational
application. We hope to provide a partial answer to the issue of
semantic
construction and learnability in an application related to social
networks.
BIODATA:
Hari Sundaram is currently an assistant
professor, at Arizona State
University. This is a joint appointment with the department of Computer
science and the Arts Media and Engineering program. He received his
Ph.D.
from the Department of Electrical Engineering at Columbia University in
2002. He received his MS degree in Electrical Engineering from SUNY
Stony
Brook 1995 and a B.Tech in Electrical Engineering from Indian Institute
of
Technology, Delhi in 1993.
His research group works on developing
computational models and systems
for situated communication. There are two complementary (but coupled)
directions - (a) designing intelligent multimedia environments that
exist
as part of our physical world (e.g. an intelligent room) (b) developing
new algorithms and systems to understand the media artifacts resulting
from human activity (e.g. emails, photos / video). Specific projects
include - context models for action, resource adaptation, interaction
architectures, communication patterns in media sharing social networks,
collaborative annotation, as well analysis of online communities.
Prof. Sundaram's research has won several
awards - the best student paper
award at JCDL 2007, the best ACM Multimedia demo award in 2006. The
best
student paper award at ACM conference on Multimedia 2002, the 2002
Eliahu
I. Jury Award for best Ph.D. dissertation. He has also received a best
paper award on video retrieval, from IEEE Trans. On Circuits and
Systems
for Video Technology, for the year 1998. He is an active participant in
the Multimedia community - he is an associate editor for ACM
Transactions
on Multimedia Computing, Communications and Applications (TOMCCAP), as
well as the IEEE Signal Processing magazine. He has co-organized
workshops
at acm multimedia on experiential telepresence (ETP 2003, ETP 2004),
archival of personal experiences (CARPE 2004, CARPE 2005) and a
conference
of image and video retrieval (CIVR 2006).
|
Slides
(.pdf) |
| 17 Jul, Tue, 10:30-11:30am **Note special
time |
Jing Jiang
(UIUC) /
Two Perspectives on Domain Adaptation in Natural Language Processing
The problem of domain adaptation for
statistical classifiers arises
when our labeled training examples and unlabeled test examples come
from different domains. This problem is commonly encountered in
natural language processing (NLP) tasks. For example, we may train a
named entity recognition (NER) system on news articles but apply the
system to blog or email text. It is generally observed that the
performance of a classifier tends to drop significantly when it is
applied to a different domain.
In this talk, I will present our recent work
addressing the domain
adaptation problem. We have proposed two frameworks, corresponding to
two different perspectives on this problem: feature selection and
instance weighting. In the feature selection framework, we seek to
identify .generalizable features. that behave similarly across
domains; in the instance weighting framework, our idea is to re-weight
the examples in order to minimize the expected loss on the test
domain. In both frameworks, we have also incorporated semi-supervised
learning to make use of the unlabeled test domain examples. Experiment
results on a number of NLP tasks, including NER, part-of-speech (POS)
tagging, and spam filtering, show the effectiveness of both
frameworks. At the end of the talk, I will briefly mention our current
effort of unifying the two perspectives, as well as some future
directions to pursue.
BIODATA:
Jing Jiang is a Ph.D. candidate in the Department of Computer Science
at the University of Illinois at Urbana-Champaign. She is a member of
the Information Retrieval Group led by Professor ChengXiang Zhai. Her
research interests include information extraction, information
retrieval, biomedical text mining, and machine learning. She received
her B.S. degree and her M.S. degree in Computer Science from Stanford
University in 2002 and 2003, respectively.
|
Slides
(.htm) Slides
(.ppt)
(Internal to NUS only) |
| 11 Jun, 10:30-11:30am **Note special time |
Eric
Nyberg (CMU LTI) /
JAVELIN: Multilingual Question Answering with Semantic Indexing,
Retrieval and Inference
The JAVELIN question
answering architecture has been used to build QA
systems for monolingual English, Japanese and Chinese, as well as
cross-lingual QA systems for English-Japanese and English-Chinese.
This talk will present and discuss recent research results in
structured retrieval, answer extraction and answer selection for QA,
and summarize end-to-end system performance as evaluated in the recent
NTCIR-6 competition. |
No slides available |
21 May
(note special time 1:00-3:30pm) |
ACL/EMNLP/SIGIR Practice Session
- 1:00-1:25 Chia Tee Kiah, "A Statistical
Language Modeling Approach to
Lattice-Based Spoken Document Retrieval"
- 1:25-1:50 Zhao Shanheng, "Identification
and Resolution of Chinese
Zero Pronouns: A Machine Learning Approach"
- 1:50-2:15 Hendra Setiawan, "Ordering
Phrases with Function Words"
- 2:15-2:30 15 minute break
- 2:30-2:55 Dave Kor, TBA
- 2:55-3:20 Yang Xiaofeng, "Coreference
Resolution Using Semantic
Relatedness Information from Automatically Discovered Patterns"
- 3:20-3:35 Tan Yee Fan, "PSNUS: Web
People Name Disambiguation by
Simple Clustering with Rich Features"
ABSTRACTS:
Title: A Statistical Language Modeling
Approach to Lattice-Based Spoken Document Retrieval
Abstract: Speech recognition transcripts are
far from perfect; they are not
of sufficient quality to be useful on their own for spoken document
retrieval. This is especially the case for conversational speech.
Recent efforts have tried to overcome this issue by using statistics
from speech lattices instead of only the 1-best transcripts; however,
these efforts have invariably used the classical vector space retrieval
model. This paper presents a novel approach to lattice-based spoken
document retrieval using statistical language models: a statistical
model is estimated for each document, and probabilities derived from
the document models are directly used to measure relevance.
Experimental
results show that the lattice-based language modeling method
outperforms
both the language modeling retrieval method using only the 1-best
transcripts, as well as a recently proposed lattice-based vector space
retrieval method.
Title: Identification and Resolution of
Chinese Zero Pronouns: A Machine
Learning Approach
Abstract: In this paper, we present a
machine learning approach to the
identification and resolution of Chinese anaphoric zero pronouns. We
perform
both identification and resolution automatically, with two sets of
easily
computable features. Experimental results show that our proposed
learning
approach achieves anaphoric zero pronoun resolution accuracy comparable
to a
previous state-of-the-art, heuristic rule-based approach. To our
knowledge,
our work is the first to perform both identification and resolution of
Chinese anaphoric zero pronouns using a machine learning approach.
Title: Ordering Phrases with Function Words
Abstract: Function words are a class of
words with little intrinsic
meaning but is vital in expressing grammatical relationships among
phrases within a sentence. Such encoded grammatical information, often
implicit, makes function words pivotal in modeling structural
divergences, as projecting them in different languages often result in
long-range structural changes to the realized sentences. This
distinctive feature has not been fully-utilized to address phrase
ordering problem in the context of statistical machine translation
(SMT). We observe that just like foreign language learner often makes
mistakes in using function words, current SMT system often perform
poorly in ordering function words' arguments; lexically correct
translations often end up reordered incorrectly. In this talk, I will
present a Function Words centered, Syntax-based (FWS) solution to
address the phrase ordering problem, including its statistical
formalism, its implementation and experimental results.
Title: Coreference Resolution Using Semantic
Relatedness Information
from Automatically Discovered Patterns
Abstract: Semantic relatedness is a very
important factor for the
coreference resolution task. To obtain this semantic information,
corpus-based approaches commonly leverage patterns that can express a
specific semantic relation. The patterns, however, are designed
manually and thus are
not necessarily the most effective ones in terms of accuracy and
breadth. To
deal with this problem, in this paper we propose an approach that
can automatically find the effective patterns for coreference
resolution. We
explore how to automatically discover and evaluate patterns, and
how to exploit the patterns to obtain the semantic relatedness
information.
The evaluation on ACE data set shows that the pattern based semantic
information is helpful for coreference resolution.
Title: PSNUS: Web People Name Disambiguation
by Simple Clustering with
Rich Features
Abstract: We describe about the system
description of the PSNUS team
for the SemEval-2007 Web People Search Task. The system is based on
the clustering of the web pages by using a variety of features
extracted and generated from the data provided. This system achieves
F_alpha=0.5 = 0.75 and F_alpha=0.2 = 0.78 for the final test data set
of the task.
|
14 May
(at I2R) |
ACL Practice Session
AGENDA:
- 1:00-1:25 Chan Yee Seng, "Word Sense
Disambiguation Improves
Statistical Machine Translation"
- 1:25-1:50 Chan Yee Seng, "Domain
Adaptation with Active Learning for
Word Sense Disambiguation"
- 1:50-2:15 Li Haizhou, "Semantic
Transliteration of Personal Names"
- 2:15-2:30 15 minute break - refreshments
to be served.
- 2:30-2:55 Min Zhang, "A Grammar-driven
Convolution Tree Kernel for Semantic Role Classification"
- 2:55-3:20 Mstislav Maslennikov,
"ARE&D: A Discourse-based
Multi-resolution Framework for Information Extraction on Free Text"
ABSTRACTS:
Title: Word Sense Disambiguation Improves
Statistical Machine Translation
Abstract: Recent research presents
conflicting evidence on whether
word sense disambiguation (WSD) systems can help to improve the
performance of statistical machine translation (MT) systems. In this
paper, we successfully integrate a state-of-the-art WSD system into a
state-of-the-art hierarchical phrase-based MT system, Hiero. We show
for the first time that integrating a WSD system improves the
performance of a state-of-the-art statistical MT system on an actual
translation task. Furthermore, the improvement is statistically
significant.
Title: Domain Adaptation with Active
Learning for Word Sense Disambiguation
Abstract: When a word sense disambiguation
(WSD) system is trained on
one domain but applied to a different domain, a drop in accuracy is
frequently observed. This highlights the importance of domain
adaptation for word sense disambiguation. In this paper, we first show
that an active learning approach can be successfully used to perform
domain adaptation of WSD systems. Then, by using the predominant sense
predicted by expectation-maximization (EM) and adopting a
count-merging technique, we improve the effectiveness of the original
adaptation process achieved by the basic active learning approach.
Title: Semantic Transliteration of Personal
Names
Abstract: Words of foreign origin are
referred to as borrowed words or
loanwords. A
loanword is usually imported to Chinese by phonetic transliteration if
a
translation is not easily available. Semantic transliteration is seen
as a
good tradition in introducing foreign words to Chinese. Not only does
it
preserve how a word sounds in the source language, it also carries
forward
the word's original semantic attributes. This paper attempts to
automate the
semantic transliteration process for the first time. We conduct an
inquiry
into the feasibility of semantic transliteration and propose a
probabilistic
model for transliterating personal names in Latin script into Chinese.
The
results show that semantic transliteration substantially and
consistently
improves accuracy over phonetic transliteration in all the experiments.
Title: A Grammar-driven Convolution Tree
Kernel for Semantic Role
Classification
Abstract: Convolution tree kernel has shown
very promising results in
semantic role
classification. However, this method considers less linguistic
knowledge and
only carries out hard matching between substructures, which may lead to
over-fitting and less accurate similarity measure. To remove the
constraints, this paper proposes a grammar-driven convolution tree
kernel
for semantic role classification by introducing more linguistic grammar
information into the standard convolution tree kernel. The proposed
grammar-driven convolution tree kernel displays two advantages over the
previous one: 1) grammar-driven approximate substructure matching and
2)
grammar-driven approximate tree node matching. The two improvements
enable
the proposed grammar-driven tree kernel explore more linguistically
motivated substructure features than the previous one. Experiments on
the
CoNLL-2005 SRL shared task show that the proposed grammar-driven tree
kernel
significantly outperforms the previous non-grammar-driven one in
semantic
role classification. Moreover, we present a composite kernel to
integrate
feature-based and tree kernel-based methods. Experimental results show
that
the composite kernel outperforms the previous best-reported methods.
Title: ARE&D: A Discourse-based
Multi-resolution Framework
for Information Extraction on Free Text
Abstract: Extraction of relations between
entities is an important
part of Information
Extraction on free text. Previous methods are mostly based on
statistical
correlation and dependency relations between entities. This paper
re-examines the problem at the multi-resolution layers of phrase,
clause and
sentence using dependency and discourse relations. Our multi-resolution
framework ARE&D (Anchor and Relation and Discourse analysis)
uses clausal
relations in 2 ways: 1) to filter noisy dependency paths; and 2) to
increase
reliability of dependency path extraction. The resulting system
outperforms
the previous approaches by 3%, 7%, 4% on MUC4, MUC6 and ACE RDC domains
respectively.
|
| 25 Apr |
Hendra Setiawan (NUS, Institute for
Infocomm
Research I2R) / Ordering Phrases with
Function Words
Function words are a
class of words with little intrinsic meaning but
is vital in expressing grammatical relationships among phrases within
a sentence. Such encoded grammatical information, often implicit,
makes function words pivotal in modeling structural divergences, as
projecting them in different languages often result in long-range
structural changes to the realized sentences. This distinctive feature
has not been fully-utilized to address phrase ordering problem in the
context of statistical machine translation (SMT). We observe that just
like foreign language learner often makes mistakes in using function
words, current SMT system often perform poorly in ordering function
words' arguments; lexically correct translations often end up
reordered incorrectly.
In this talk, I will present a Function
Words centered, Syntax-based
(FWS) solution to address the phrase ordering problem, including its
statistical formalism, its implementation and experimental results.
|
Slides
(.htm) |
| 18 Apr @ MR 1 (S16 Lvl 5) **Note special
place. |
Bang Viet Nguyen (NUS) and Lin Ziheng
(NUS) / Functional Faceted Web Query Analysis and Timestamped
Graphs: Evolutionary Models of Text for Multi-document Summarization
1st talk: We propose
a faceted classification
scheme for web queries.
Unlike previous work, our functional scheme ties its classification to
actionable strategies for search engines to take. Our scheme consists
of four facets of ambiguity, authority
sensitivity, temporal sensitivity and spatial sensitivity. We
hypothesize that the classification of queries into such facets yields
insight on user intent and information needs. To validate our
classification scheme, we asked users to annotate queries with respect
to our facets and obtained high agreement. We also assess the coverage
of our faceted classification on a random sample of queries from logs.
Finally, we discuss the algorithmic approaches we take in our current
work to automate such faceted classification.
2nd talk: In this talk, I will present a new
graph-based approach to text understanding and summarization. Current
graph-based approaches to automatic text summarization, such as LexRank
and TextRank, assume a static graph which does not model how the input
texts emerge. A suitable evolutionary text graph model may impart a
better understanding of the texts and improve the summarization
process. We give simplified assumptions of human writing and reading
processes, and then propose a timestamped graph (TSG) model that is
motivated by these processes and show how text units in this model
emerge over time. This model not only captures the evolving process of
text within a document, but also the evolving process across documents.
In our model, the graphs used by LexRank and TextRank are specific
instances of our timestamped graph with particular parameter settings.
|
1st Talk: Slides
(.htm)
2nd Talk: Slides
(.htm) |
| 16 Apr, 3-4pm, @ TR20 (S15 #02-07) **Note
special time and place. |
Lan Man
(NUS/I2R) /
A New Term Weighting Method for Text Categorization
Text representation
is the task of transforming
the content of a textual document into a compact representation of its
content so that the document could be recognized and classified by a
computer or a classifier. This thesis focuses on the development of an
effective and efficient term weighting method for text categorization
task. We selected the single token as the unit of feature because the
previous researches showed that this simple type of features
outperformed other complicated type of features.
We have investigated several widely-used
unsupervised and supervised term weighting methods on several popular
data collections in combination with SVM and kNN algorithms. In
consideration of the distribution of relevant documents in the
collection and analysis of the term's discriminating power, we have
proposed a new term weighting scheme, namely tf.rf. The controlled
experimental results showed that the term weighting methods show mixed
performance in terms of different category distribution data sets and
different learning algorithms. Most of the supervised term weighting
methods which are based on information theory have not shown
satisfactory performance according to our experimental results.
However, the newly proposed tf.rf method shows a consistently better
performance than other term weighting methods. On the other hand, the
popularly used tf.idf method has not shown a uniformly good performance
with respect to different category distribution data sets.
|
Slides (.htm) Set 1
Set 2
|
| 11 Apr |
Qiu
Long (NUS) / A Graph Approach to Scenario Template
Generation
A Scenario Template
is a data structure that
reflects the salient aspects shared by a set of events, which are
similar enough to be considered as belonging to the same scenario. The
salient aspects are typically the scenario's characteristic actions,
the entities involved in these actions and the related attributes. Such
a scenario template, once populated with respect to a particular event,
serves as a concise overview of the event. It also provides valuable
information for applications such as information extraction (IE), text
summarization, etc.
Manually defining scenario template is
expensive
and we aim to automatize this template generation process. We argue
that context is valuable to identify semantically similar text spans,
from which template slots could be generalized. To leverage context, we
convert news articles into a graphical representation and then apply a
generic context-sensitive clustering (CSC) framework to get meaningful
clusters of text spans by examining the intrinsic and extrinsic
similarities between them. We use the Expectation-Maximization
algorithm to guide the clustering process. The experiments show that:
1) our approach generates high quality clusters, and 2) information
extracted from the clusters is adequate to build high coverage
templates.
|
Slides
(May not be available outside of NUS) |
| 2 Apr
(**note special date) |
Chen Jinxiu
(NUS,
Institute for Infocomm Research I2R) /
Automatic
Relation Extraction among Named Entities from Text Contents
This thesis studies
the task of Relation
Extraction, which has received more and more attention in recent
years. The task of relation extraction is to identify various semantic
relations between named entities from text contents. With the rapid
increase of various textual data, relation extraction will play an
important role in many areas, such as Question Answering, Ontology
Construction, and Bioinformatics.
The goal of our research is to reduce the
manual effort and
automate the process of relation extraction. To realize this
intention, we investigate semi-supervised learning and unsupervised
learning solutions to rival supervised learning methods to resolve the
problem of relation extraction with minimal human cost and still
achieve comparable performance to supervised learning methods.
First, we presented a Label Propagation
(LP) based semi-supervised
learning algorithm for relation extraction problem to learn from both
labeled and unlabeled data. It represents labeled and unlabeled
examples and their distances as the nodes and the weights of edges of
a graph, then propagating the label information from any vertex to
nearby vertices through weighted edges iteratively, finally inferring
the labels of unlabeled examples after the propagation process
converges.
Secondly, we introduced an unsupervised
learning algorithm based
on model order identification for automatic relation extraction. The
model order identification is achieved by resampling based stability
analysis and used to infer the number of relation types between entity
pairs automatically.
Thirdly, we further investigated
unsupervised learning solution
for relation disambiguation using graph based strategy. We defined the
unsupervised relation disambiguation task for entity mention pairs as
a partition of a graph so that entity pairs that are more similar to
each other, belong to the same cluster. We apply spectral clustering
to resolve the problem, which is a relaxation of such NP-hard discrete
graph partitioning problem. It works by calculating eigenvectors of an
adjacency graph's Laplacian to recover a submanifold of data from a
high dimensionality space and then performing cluster number
estimation on such spectral information.
The thesis evaluates the proposed methods
for extracting relations
among named entities automatically, using the ACE corpus. The
experimental results indicate that our methods can overcome the
problem of not having enough manually labeled relation instances for
supervised relation extraction methods. The results show that when
only a few labeled examples are available, our LP based relation
extraction can achieve better performance than SVM and another
bootstrapping method. Moreover, our unsupervised approaches can
achieve order identification capabilities and outperform the previous
unsupervised methods. The results also suggest that all of the four
categories of lexical and syntactic features used in the study are
useful for the relation extraction task.
|
| 28 Mar |
Che Wanxiang (Harbin Institute of
Technology,
Institute for Infocomm Research I2R) / A
Hybrid
Convolution Tree Kernel for Semantic Role Labeling
... and
...
Sun Chengjie (Harbin Institute of Technology, Institute for
Infocomm Research I2R) / Using Maximum
Entropy to
Recognize Name Origin in Machine Transliteration
1st talk: As a kind
of Shallow Semantic Parsing, Semantic
Role Labeling (SRL) is being paid more attention and illustrating a
good prospect of application on wide natural language processing
problems. So I will show a demo at first to explain what is the
semantic role labeling is. Usually, feature-based methods with
feature vector are used for semantic role labeling as the state of the
art methods. However, these methods, which are widely used in natural
language processing field, are difficult in modeling structure
features, e.g. the useful Path features for semantic role labeling. As
an extension of the feature-based methods, kernel-based methods are
able to do this efficiently in a much higher dimension. Convolution
tree kernel, a special kind of kernel, has been used in semantic role
labeling. The conventional convolution tree kernel which selected the
tree portion of a predicate and one of its arguments as feature space
is named as predicate-argument feature (PAF). However, the integral
view of PAF is not suitable for the semantic role labeling. A hybrid
convolution tree kernel is proposed to model syntactic tree structure
features more effectively. The hybrid kernel consists of two
individual convolution kernels: a Path kernel, which captures
predicate-argument link features, and a Constituent Structure kernel,
which captures the syntactic structure features of arguments.
Evaluation on the data sets of CoNLL-2005 SRL shared task shows that
our novel hybrid convolution tree kernel significantly outperforms the
previous tree kernels. We future provide a composite kernel combining
our hybrid tree kernel with the polynomial kernel using standard flat
feature vector. The experimental results show that the composite
kernel achieves better performance than each of the individual
methods.
and
2nd talk: Name origin recognition is to
identify
the original source of a name. It is a necessary step for name
translation/transliteration because of different origins need
different translation strategies. It is more important when
translating across languages with different alphabets and sound
inventories. Previous works used rule based methods or statistics
based methods to solve this problem. In this work, we cast name origin
recognition as a multi-class classification task and propose to use
Maximum Entropy model to solve it. Experiments show that our approach
can achieve an overall accuracy 98.35% for name written in English and
98.10% for name written in Chinese, which are much better than
previous methods.
|
Slides (1st talk)
Slides (2nd talk)
(.pdf, open to all hosts in TLD .sg) |
| 28 Feb, 3-4pm @ TR9 (S16 #03-09) **Note
special time and place. |
Mstislav Maslennikov (NUS) / A
Multi-resolution Framework for Information Extraction from Free Text
Extraction of
relations between entities is an important
part of IE on free text. Previous methods are mostly based
on statistical correlation and dependency relations
between entities. This paper re-examines the problem
at the multi-resolution layers of phrase, clauses and
sentences using dependency and discourse relations.
Our multi-resolution framework uses clausal relations
in 2 ways: 1) to filter noisy dependency paths;
and 2) to increase reliability of dependency path
extraction. The resulting system outperforms the
previous approaches by 3%, 7%, 4% on MUC4,
MUC6 and ACE RDC domains respectively. |
Slides
(.pdf) |
| 22 Feb (Note special date, time and place
(2-3pm, SR 5, S16 Lvl 4)) |
Graeme Hirst
(University of Toronto) / Fine-grained differences and
similarities in meanings
Writing or speaking
requires making choices from words and syntactic
constructions that have similar but not identical meanings. Are two
parties "foes" or "enemies"? Did John meet Mary or was Mary met by
John? An important component of language understanding is recognizing
the implications of the nuances in the speaker's or writer's choices.
I will describe our research on computational aspects of linguistic
nuance, focusing on the differentiation of near-synonyms and on the
consequences that arise for knowledge representation formalisms. In
addition, I will discuss how contemporary views of meaning in
computational linguistics need to be broadened to take into account
the choices that the speaker or writer makes. |
Slides
(.pdf, Internal to NUS only) |
| 5 Feb (**10:00-11:00am, note special time) |
Yin Xinyi (NUS) / Random Walk and
Web Information Processing for Mobile Devices
Accessing web pages
from a mobile device is becoming very valuable, especially for
people constantly on the move. However, the small screen, limited
memory, and the slow
wireless connection make the surfing experience on mobile devices
unacceptable to most
people. In this thesis, we aim to solve three fundamental challenges in
the mobile Internet:
web page content ranking, web content classification, and web article
summarization.
We propose a new method to solve these three fundamental challenges. As
a
web page is too complex to analyze as a whole, we will first divide the
entire web page
into basic elements such as text blocks, pictures, etc. Next, based on
the relationship
between the elements, we will connect the elements with edges to make a
graph. Finally,
we will use random walk methods to provide solution for the three
challenges.
The main contribution of this thesis is a graph and a random walk based
framework for
the Internet information process. It is shown to be very simple and
effective. For example,
our experiments of web page ranking show that from randomly selected
websites, the
system need only deliver 39% of the objects in a web page in order to
fulfill 85% of a
viewer's desired viewing content. In the experiments of web content
classification, the
system generates good performance with the F value for main content and
advertisement
(A) as high as 0.93 and 0.82 respectively. In the experiments of text
summarization, with
the use of the well-accepted dataset for single document summarization,
the graph and
random walking based text summarization system outperformed the results
of all
participants of the conference |
Slides
(.htm) |
| 30 Jan (10:00-11:00 am, note special time) |
Upali Kohomban
(NUS) / Application of Generic Sense Classes in Word Sense
Disambiguation
Word Sense
Disambiguation (WSD)
is a problem in Natural Language Processing concerned on identifying
correct meaning of a word used in a given context. Over time,
supervised machine learning has consistently shown better performance
in WSD, compared to unsupervised learning. However, supervised approach
for WSD has been facing the serious problem of knowledge acquisition
bottleneck, or the difficulty of acquiring enough labeled training data
for learning classifiers. This problem is exasperated by several facts,
including the large number of fine-grained senses in contemporary
lexicons, need of training data for individual polysemous word, and the
high cost of manually sense-labeling training examples. Our research
focuses on an approach to find a workaround to this problem, by
exploiting the usage similarities of different words. We propose using
a generalized and coarse-grained set of senses at classifier level, and
then using lexicon-induced heuristics to convert the resulting classes
into fine-grained senses. The generic nature of the sense classes
allows us to use labeled training examples from different words to be
used for learning the classes, effectively increasing the amount of
available training data. We discuss how the noise due to generalization
can be reduced by using a semantic similarity based weighting strategy,
and show, using WordNet lexicographer files as generic classes, that
this approach can yield state of the art WSD performance with sparse
training data. Further, we argue that the human-created, taxonomy based
class schema such as WordNet lexicographer files are not ideal for
supervised learning, as they are not necessarily coherent with the
contextual usage patterns, which are available for the classifier as
features. In addition, they have undesirable properties that result in
high losses during the class to fine-grained sense conversion. We
propose using clustering techniques to automatically create generic
sense classes that are aimed for better performance of WSD as an
end-task, and show that such classes can improve the WSD performance
over manually created classes. |
Slides
(.htm) |
Jump to: 2008
2007
2006
2005
2004
| Date |
Speaker / Title |
Notes /
Slides |
|
2006
|
| 27 Dec (10:30-11:30 am, note special time) |
Ng Hwee Tou
(NUS) / One Class per Named Entity: Exploiting Unlabeled Text
for Named Entity Recognition
In this talk, I will
present a simple yet novel method of exploiting
unlabeled text to further improve the accuracy of a high-performance
state-of-the-art named entity recognition (NER) system. The method
utilizes the empirical property that many named entities occur in one
name class only. Using only unlabeled text as the additional resource,
our improved NER system achieves an F1 score of 87.13%, an improvement
of 1.17% in F1 score and a 8.3% error reduction on the CoNLL 2003
English NER official test set. This accuracy places our NER system
among the top 3 systems in the CoNLL 2003 English shared task. This
work was done jointly with Wong Yingchuan. |
Slides
(.pdf) |
| 14 Dec (2:00-3:00 pm) |
Lee Dongwon
(IST, PSU) / Name Disambiguation in Digital Libraries
When the names of
people are used as unique identifiers, it often
causes problems -- different people may share the same name spelling
or a person may have several names spelled or used. As the searching
by person' name is one of the most common query types in Digital
Libraries and WWW (about 30%), it becomes increasingly important to
have clean name data in such systems. In this talk, I will first
present various types of ambiguous names drawn from real Digital
Libraries. Then, I will discuss various approaches for identifying and
fixing such ambiguous names -- syntactic, semantic, and google-based
approaches.
This talk borrows materials from my recent
work in IQIS'05 JCDL'06,
ICDM'06, and ICDE'07, that are the results of joint work with several
students and collaborators:
Ergin Elmacioglu (Penn State), Min-Yen Kan
(NUS), Jaewoo Kang (Korea
U.), Nick Koudas (U. Toronto), Byung-Won On (Penn State), Jian Pei
(Simon Fraser U.), Divesh Srivastava (AT&T Labs -- Research),
Yee Fan
Tan (NUS)
|
Slides
(external link to .ppt) |
| 20 Oct (2-2:30pm, **Special Date, Time and
Place, SR 4: SoC 1 06-12) |
Chia Tee Kiah
(NUS) / Probabilistic Lattice-Based Spoken Document Retrieval
Spoken Document
Retrieval involves finding from within a collection
of spoken documents (e.g. voice mails, news broadcasts) the
documents which satisfy a given information need. One way to
represent a spoken document for this task is the lattice -- a
directed acyclic graph whose paths correspond to a hypothesis of the
words spoken in the document. In this talk I present a method for
using word statistics derived from lattices in a probabilistic
retrieval algorithm to perform spoken document retrieval. Results
which compare the performance of this approach with using only the
1-best speech recognizer transcription are also presented.
|
Slides
(.pdf) |
| 19 Oct (11am-12n, **Special Date and Time) |
Liu Ting (Harbin Institute of Technology) /
Language Technology Platform (LTP) and WSD
based on Equivalent
Pseudoword
I will present the
architecture of a XML based Chinese processing
platform for web application. It is named as Language Technology
Platform (LTP). There are five main points of it: a suite of DLL
modules for DOM Tree, Language Technology Markup Language (LTML), a
suite of visualization tools, language corpora based on LTML and web
service for LTP. Current LTP has integrated ten key Chinese processing
modules on morphology, word sense, and syntax and document analysis. A
suite of systematism tools is supplied for beginners of natural
language processing and information retrieval. Based on it, they can
study on the relationship between all levels and some advanced topics.
Currently, the platform has been shared to more than 60 research labs
in the world. Another topic of my talk is about WSD. I will present a
new approach based on Equivalent Pseudowords (EPs) to tackle Word
Sense Disambiguation (WSD) in Chinese language. EPs are particular
artificial ambiguous words, which can be used to realize unsupervised
WSD. A Bayesian classifier is implemented to test the efficacy of the
EP solution on Senseval-3 Chinese test set. The performance is better
than state-of-the-art results with an average F-measure of 0.80. The
experiment verifies the value of EP for unsupervised WSD.
|
No slides available |
| 19 Oct (10-11am, **Special Date and Time) |
Wong Kam-Fai
(SEEM, CUHK) / A Phonetic-Based Approach to Chinese Chat Text
Normalization
Web 2.0 is the
latest trend in the Word Wide Web. In the first part of
my seminar, I shall review the social characteristics of this paradigm
and how suitable it is for the Asian community. In the second part, I
shall focus on a particular communication means on Web 2.0, namely
chatting, e.g. via ICQ, chat rooms, etc. A unique dialect is commonly
used for chatting. I refer it as the Chat Language (CL). CL is
different from natural languages due to its anomalous and dynamic
natures. These render conventional NLP tools inapplicable for
analyzing CL. The language changes frequently rendering contemporary
chat language corpora quickly out-dated. To address this dynamic
language problem in Chinese, we propose a phonetic language model to
map between chat terms and standard words via phonetic transcription,
i.e. Chinese Pinyin in our case. Different from grapheme mapping,
phonetic mapping can be constructed from available standard Chinese
corpus. For term normalization, i.e. to translate a chat term to its
natural language counterpart, we extend the source channel model by
incorporating the phonetic mapping model. Experimental results show
that this method is effective and robust. |
No slides available |
| 14 Aug (Mon, 3:00-4:30 pm, ** Special date
and time) |
David
Chiang (ISI/USC) / An introduction to synchronous
grammars
Synchronous grammars
are rapidly gaining importance for modeling
machine translation and other complex language transformations. It has
therefore become useful to understand their basic formal properties.
Many advances in NLP in the 1990s exploited basic algorithms for
probabilistic finite-state transducers, whose theory is well
understood and widely taught. The analogous theory for trees is less
widely known but well developed, with roots going back to the 1960s.
In this tutorial, we aim to (1) cover the literature of synchronous
grammars, (2) describe how they relate to current NLP applications,
such as machine translation, and (3) discuss some new theoretical and
algorithmic problems raised by these applications, and some recent
solutions.
This talk is part of a tutorial given with
Kevin Knight at ACL 2006
|
No slides available |
| 24 Jul (2:00-3:00 pm) (** Special
date and
time) |
John Prager (IBM T.J. Watson Labs) / Improving
Question-Answering Precision by asking More and Better
Questions
If we define a QA
system as a system which takes a natural-language
question, searches a text corpus and returns a ranked a list of
answers,
then we can broadly discern two ways in which accuracy can be
increased:
intrinsically, by generating better candidate lists (by e.g. more
accurate
entity recognition, deeper parsing, better pattern-matching and/or more
judicious choice of keywords in search), or extrinsically, by
re-evaluating
and re-shaping such answer lists by reference to other QA methods or
other
data sources. This talk is about approaches of each kind that we are
using
at IBM Research to improve the accuracy of our QA system. I will first
describe the semantic information we build into the search-engine index
from
running text analytics on the corpus. In addition to text tokens, we
index
types, typed tokens and relations. I will present the results of
several
evaluations demonstrating how such "Semantic Search" can increase
precision.
As far as extrinsic methods go, leading QA
systems employ a variety of means
to boost accuracy. Such methods include redundancy (getting the same
answer
from multiple documents/sources), inferencing (proving the answer from
information in texts plus background knowledge) and sanity-checking
(verifying that answers are consistent with known facts). To our
knowledge,
however, no other QA system deliberately asks additional questions in
order
to derive constraints on the answers to the original questions. We
present
two variations on this idea. The first is the method of
QA-by-Dossier-with-Constraints (QDC), which is an extension of the
simpler
method of QA-by-Dossier, in which definitional questions ("Who/what is
X?")
are addressed by asking a set of questions about anticipated properties
of
X. In QDC, the collection of Dossier candidate answers is subjected to
satisfying a set of naturally-arising constraints. For example, for a
"Who
is X?" question, the system will ask about birth, accomplishment and
death
dates, which if they exist, must occur in that order, and also obey
other
constraints such as lifespan. Temporal, spatial and kinship
relationships
seem to be parti | |