Corpora
— written, spoken, transcribed data for natural language analysis and use
The twenty newsgroup collection is often used for machine learning benchmarks. It was installed locally at SoC to test the bow machine learning package.
Installed at
corpora/text-corpora/20_newsgroups/
by kanmy
on Jan 13, 2003.
Maintained by kanmy.
Language: English.
Four downloaded stoplists available from the web. See the README.html file in the directory for more information.
Installed at
corpora/text-corpora/stopwordLists/
by kanmy
on May 28, 2003.
Maintained by kanmy.
Language: English.
Status: restricted.
Data for bootstrapping Information Extraction.
Installed at
corpora/learning-datasets/7sectors
from
source
on Mar 19, 2003.
Link structure of Spanish, U.K., Taiwanese and Australian Universities. See the local copy of the original description HTML file (
http://cybermetrics.wlv.ac.uk/database/) from University of Wolverhampton.
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development. Visit the
ANC site for more details.
Installed at
corpora/text-corpora/anc
from
ANC site
on Aug 24, 2004.
Status: Unknown.
TREC QA (AQUAINT) Data for 2002/2003. A corpus comprising of data from the New York Times, Xinhua news service and the Associated Press. See the index.html file in the directory for more details.
Installed at
corpora/text-corpora/aquaint
by kanmy
on May 06, 2003.
Maintained by kanmy.
Language: English.
Status: Access is restricted to TREC participants only.
This is a mostly cleaned corpus of 80 computational linguistic articles that have been marked up for argumentative zoning relations. You can learn more about this from
Simone's home page or from
Yee Seng Chan's (search for "zoning") Digital Library course project.
Installed at
corpora/text-corpora/zoning or corpora/metadata/zoning or tools/citationTools/zoning
from
Simone Teufel's site
by kanmy
on Apr 09, 2005.
Maintained by kanmy.
Language: English.
Status: this is a pre-distribution copy from Simone Teufel. It is not for public use. Contact the maintainer if you would like to use this resource.
A web document clustering dataset, provided free of charge from the University of Reading.
Installed at
corpora/text-corpora/banksearchdataset
from
University of Reading
by kanmy
on Aug 07, 2003.
Maintained by kanmy.
Language: Any.
Status: Freely downloadable from the web.
Tagged and untagged queries.
Installed at corpora/queries/bbcVideo
on Oct 13, 2004.
Status: Restricted..
BLOG06 consists of a crawl of 100,649 RSS and Atom feeds, over an 11 week period (a total of 77 days). The collection consists of one directory for each day of the collection. From the Information Retrieval Group - Test Collections, University of Glasgow.
Installed at corpora/text-corpora/trec/blogdata
on Nov 08, 2007.
Status: Unknown.
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. See the home page of the BNC at
http://www.natcorp.ox.ac.uk/ for more details. We have a five year license for this product.
Installed at
corpora/text-corpora/BNC-World
from
University of Oxford
by kanmy
on May 11, 2004.
Maintained by kanmy.
Language: English.
Status: Limited for research purposes, see the maintainers for details if you wish to utilize this corpus. The texts and documentation are installed but the SARA utility has not been compiled nor set up.
The Penn Chinese Treebank is an ongoing project, that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0. More information about the project is available on the Penn Chinese Treebank website at:
http://www.cis.upenn.edu/%7Echinese/ .
Installed at
corpora/languages/chinese/text-corpora/treebank
from
Penn Chinese Treebank
by kanmy
on Apr 21, 2004.
Maintained by kanmy.
Language: Chinese.
Status: restricted access to researchers (as per LDC policy).
OAI records are in two formats: (1) oai_dc.tar.gz - Includes the dublin core metadata standard, and (2) oai_citeseer.tar.gz - The dublic core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.
Installed at corpora/text-corpora/citeseerOAI
on Jun 09, 2004.
Status: Unknown.
This is the data from Andrew McCallum's home page on the scientific search engine CORA. It includes the citation matching, research paper classification and information extraction datasets.
This is a subsection of the WebKB text classification corpus containing both hyperlink and the documents with judgments on the webpages into two categories, course and non-course. The relevant web page has been downloaded into root directory.
Installed at
corpora/learning-datasets/course-cotrain-data or corpora/text-corpora/course-cotrain-data
from
source
by kanmy
on Apr 14, 2003.
Maintained by kanmy.
Language: English.
unknown MARC data
Installed at corpora/metadata/cuMARC
on Jul 11, 2007.
Status: Restricted.
These are the XML records of the entire
DBLP database. The copy here is dated from Jul 18, 2005.
Installed at
corpora/metadata/dblp or corpora/text-corpora/DBLP
from
source
by kanmy
on Jul 19, 2005.
Maintained by kanmy.
Language: English.
Status: freely available for all to use.
Data (mostly testing data) from the Document Understanding Conference for the years 2001-2007. This is a summarization competition, held by NIST of the USA. You might also check out the DUC-processed files, see
localInstallations.html. See the
DUC web site for details.
Installed at
corpora/text-corpora/duc/
from
DUC
by qiul
on Oct 05, 2007.
Maintained by kanmy.
Language: English.
Status: restricted to academic research. You have to sign an individual agreement with NIST before the data can be released to you. See the maintainer for details.
The 2,477 million queries for Excite on Dec 20, 1999. For research purposes only. Anyone connected to corporate research may not use this research. Access is restricted.
Installed at
corpora/queries/excite/
by kanmy
on Feb 07, 2003.
Maintained by kanmy.
Language: English.
Status: Access is restricted.
This FTP publication contains the Hong Kong News Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T46, isbn 1-58563-169-8. The Hong Kong News Parallel Text was created when the LDC collected parallel Cantonese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
Installed at
corpora/text-corpora/hksar_news
by kanmy
on Dec 15, 2003.
Maintained by kanmy.
Language: English/Chinese.
Another subset of the WebKB text classification corpus as used in the ILP 98 paper. See the root directory README for more details.
Installed at
corpora/learning-datasets/ilp or corpora/text-corpora/ilp
by kanmy
on Apr 14, 2003.
Maintained by kanmy.
Language: English.
The ISL Meeting Corpus Part 1 is a first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from 8 to 64 minutes and averages at 34 minutes. The audio files are available as ISL Meeting Speech Part 1. See the home page for the corpus at:
http://wave.ldc.upenn.edu/Catalog/docs/LDC2004T10/.
Installed at
corpora/text-corpora/meeting-transcripts/isl_meeting_transcripts
from
source
by kanmy
on Jun 03, 2004.
Maintained by kanmy.
Language: English.
Status: An LDC corpus. Use restricted to LDC members.
Jansen search logs.
Installed at corpora/queries/jansenSearchLogs
on Mar 01, 2006.
Status: Restricted.
Recommendation Data Set from
the PhD thesis of Lawrence Kai Shih, November 17th 2003. All the files are sql commands that can be imported directly into mysql. The data is collected from the 176-person user study.
Installed at corpora/relevance-judgments/webpageSegmentation
on Nov 18, 2003.
Status: Unknown.
Installed at
corpora/text-corpora/gigaword
from
source
by kanmy
on Nov 21, 2003.
Maintained by kanmy.
Language: English.
Status: restricted access within the department only, as per LDC's policy.
This MARC data was captured from the NUS LINC training system in September - Oct 2004. See the README in directory for more details.
Installed at corpora/metadata/nusMARC
on Sep 28, 2004.
Status: Restricted; not for use outside of NUS..
This dataset consists of 5801 pairs of sentences gleaned over a period of 18 months from thousands of news sources on the web. Accompanying each pair is judgment reflecting whether multiple human annotators considered the two sentences to be close enough in meaning to be considered close paraphrases. For more information, please visit
Microsoft Research Paraphrase Corpus web site.
Installed at
corpora/text-corpora/MSRParaphraseCorpus
from
Microsoft Research Paraphrase Corpus
by qiul
on Sep 29, 2005.
Maintained by qiul.
Status: protected under Microsoft Research Shared Source license agreement ("MSR-SSLA").
Miscellaneous list of items from unclear source. See the directory listing for details.
Installed at
corpora/gazetteers/miscLists
by qiul
on Mar 11, 2005.
Maintained by qiul.
Status: Limited to research purposes only.
MITRE's CBC4Kids corpus of online news stories for teenagers.
Installed at corpora/text-corpora/CBC4Kids
on Dec 09, 2003.
Status: Unknown.
Installed at
text-corpora/mobyShakespeare
from
source
by kanmy
on Jul 03, 2003.
Maintained by kanmy.
Language: English.
Status: in the public domain, do with it as you please.
Unknown details.
Installed at corpora/text-corpora/MovieReview
on Mar 01, 2004.
Two datasets used for collaborative filtering research. The first one consists of 100,000 ratings for 1682 movies by 943 users. The second one consists of approximately 1 million ratings for 3900 movies by 6040 users. Before using these datasets, please review the included readme files for the usage license. More information is avaliable from the GroupLens webpage:
http://www.grouplens.org/.
Installed at
corpora/relevance-judgments/collab-filtering/movielens
from
source
by kanmy
on Jun 01, 2004.
Maintained by kanmy.
Status: Publicly available from their web site.
This corpus contains 530 news articles manually annotated using an annotation scheme for opinions and other private states (e.g., beliefs, emotions, sentiment, speculation, etc). The annotation of the corpus was performed by 5 trained annotators over a period of about 15 months.
Installed at
corpora/text-corpora/MPQA/
by cuihang
on Mar 05, 2004.
Maintained by cuihang.
Status: restricted access.
Message Understanding Conference 6 data, from the
Linguistic Data Consortium. See the README file in the source directory for details.
Installed at
corpora/text-corpora/muc6
by kanmy
on Oct 01, 2003.
Maintained by kanmy.
Language: English.
Status: restricted to research use only, as per LDC policy.
All the dataset files related to the MultiLing 2011 Pilot at TAC. This includes source texts, human summaries, system summaries, and evaluation data. The dataset is derived from publicly available WikiNews (http://www.wikinews.org/) English texts. The source texts were under CC Attribution Licence V2.5 (cf. http://creativecommons.org/licenses/by/2.5/). Texts in other languages have been translated by native speakers of each language.
Installed at corpora/text-corpora/tac/2011/summarization/Multi Lingual Summarization
by Praveen bysani
on Apr 10, 2012.
Language: Arabic, Czech, English , French,Greek , Hebrew, Hindi.
Contains text from the Wall Street Journal, Reuters, New York Times and the LA Times-Washington Post News Service.
Installed at
corpora/text-corpora/nantc
by kanmy
on Jan 21, 2003.
Maintained by kanmy.
Language: English.
Status: Only NUS members can access this corpus, as per LDC's policies.
NPIC is a research project which performs image classification (especially for synthetic i.e., non-photographic images). NPIC does its work by supervised machine learning on datasets noisily created from image search engine results. This is the image corpus built for NPIC. It is specifically for synthetic (i.e., non-photographic) image classification.
Installed at
corpora/image/npic
from
NPIC site
on May 23, 2006.
This is a list of about ~700K online public access catalog queries collected by the Nanyang Technological University (NTU) OPAC server in 2002.
Installed at
corpora/queries/ntuOPAC
by kanmy
on Jun 30, 2005.
Maintained by kanmy.
Language: mostly English.
Status: for research staff only. Not for re-distribution or commericial use. Contact the maintainer for details.
About 800 K queries from the simple keyword interface for the LINC online catalog system of NUS. On-going collection of queries likely. Provided by NUS Libraries.
Installed at
corpora/queries/nusInnopac/
by kanmy
on Apr 10, 2003.
Maintained by kanmy.
Language: English.
Status: For research purposes only.
The ODP is a large, open-source, human-edited directory similar to Yahoo!. The data is distributed under GNU GPL and is provided here for IR research purposes. See their
web page for more details.
Installed at
corpora/metadata/odp
by kanmy
on Jan 03, 2003.
Maintained by kanmy.
Language: English.
Status: data is distributed under GNU GPL and provided for IR research purposes.
OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic data, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus. (Manual corrections have not been made.) See the home page for more details and for their online search interface:
http://logos.uio.no/opus/
Installed at
corpora/text-corpora/parallel/opus-v0.2
by kanmy
on Feb 05, 2005.
Maintained by kanmy.
Language: Many.
Status: Openly available from the web page.
Installed at
corpora/text-corpora/Pascal
by qiul
on Sep 24, 2005.
Maintained by qiul.
Status: freely available for all to use.
The goal of the project is to develop a large scale corpus annotated with information related to discourse structure. Penn Discourse Treebank Version 2.0 contains annotations of discourse relations and their arguments on the one million word Wall Street Journal (WSJ) data in
Treebank-2 (LDC95T7).
Installed at corpora/text-corpora/PennDiscourseTreebank2.0
on Feb 29, 2008.
Status: Unknown.
The Penn Treebank contains Wall Street Journal text that has been tagged, parsed by both machine and linguists. It is a benchmark corpus for parsing and part-of-speech tagging tasks. Contains binaries for grepping on tree nodes (e.g., tgrep).
Installed at
corpora/text-corpora/treebank
by kanmy
on Jan 21, 2003.
Maintained by kanmy.
Language: English.
Status: Only NUS members can access this corpus, as per LDC's policies.
The PropBank project is creating a corpus of text annotated with information about basic semantic propositions. Predicate-argument relations are being added to the syntactic trees of the
Penn Treebank. See
http://www.cis.upenn.edu/~ace/ for details.
Installed at
corpora/text-corpora/PropBank
by cuihang
on Aug 22, 2003.
Maintained by cuihang.
Language: English.
Status: restricted.
Unknown source.
Installed at corpora/queries/questionAnswering
on Aug 07, 2007.
Status: Restricted.
Unknown details
Installed at corpora/remedia_release
on Jun 21, 2002.
Installed at
corpora/learning-datasets/reuters21578
from
source
by kanmy
on Jan 19, 2003.
Maintained by kanmy.
Language: English.
Installed at corpora/text-corpora/rcv1
on Sep 08, 2004.
Status: Unknown.
Collection of about 10.1K messages of SMS service corpus collected by How Yijue as part of her honors year thesis work. Please see How Yijue's thesis for more documentation.
Installed at
corpora/text-corpora/sms/
from
source
by kanmy
on Apr 28, 2004.
Maintained by kanmy.
Language: mostly English.
Status: open to all under a license similar to the Open Directory Project license.
Summary corpus linked to the HKSAR news corpus. Produced and studied extensively by one of the JHU Workshops in 2001. More information about the corpus is at:
http://www.summarization.com/summbank/".
Installed at
corpora/text-corpora/summbank
by kanmy
on Dec 15, 2003.
Maintained by kanmy.
Language: English/Chinese.
Status: Restricted to LDC members, is open only for general academic research.
A list of 23K+ English surnames compiled from the rootsweb mailing list list. See the local README file for more information.
Installed at
corpora/gazetteers/surnames/
by kanmy
on May 06, 2005.
Maintained by kanmy.
Language: English.
Status: Available on the web, locally post-processed for use.
Installed at
corpora/queries/trec*
by kanmy
on Jan 09, 2003.
Maintained by kanmy.
Language: English.
Status: Currently available for research purposes, cleared by TREC administrators by TREC maintainers.
The PH Corpus is a cleaned up, segmented version of the Mandarin Chinese corpus compiled by Guo Jin. It contains 2,447,7719 words of news text published by Xinhua News Agency between January 1990 and March 1991.
Installed at
corpora/languages/chinese/text-corpora/ph
from
source
on Oct 12, 2004.
Language: Chinese.
The TIPSTER Text research collections were used extensively for the Text Retrieval Conferences (TREC). Still a good source of text corpora for the research community.
Installed at
corpora/text-corpora/tipster
by kanmy
on Jan 21, 2003.
Maintained by kanmy.
Language: English.
Status: Only NUS members can access this corpus, as per LDC's policies.
The TDT dataset is used for Topic Detection & Tracking (TDT) research. Currently, TDT2, used for 1998 TDT test; TDT3, used for 1999 ~ 2001 TDT tests; and TDT4, used for 2002 ~ 2003 TDT tests are installed. Please refer to
http://www.nist.gov/speech/tests/tdt/index.htm for details of TDT research.
Installed at corpora/text-corpora/TDT
by zhangya
on Jun 22, 2005.
Maintained by zhangya.
Language: English & Chinese.
Status: Only NUS members can access this corpus, as per LDC's policies.
Questions used in TREC 2003 QA main task, including factoid, list and definition questions, as well as their judgments.
Installed at
corpora/queries/trec12.questions
by cuihang
on Nov 21, 2003.
Maintained by cuihang.
Unknown source.
Installed at corpora/relevance-judgments/trecWeb
on Feb 13, 2003.
Status: Unknown.
United Nations Code for Trade and Transport Locations
Installed at corpora/gazetteers/un-locode
on Oct 18, 2005.
These are two 10 GB and 2 GB corpora used by the TREC web track. Compiled by CSIRO. See the directory for more information. More details on the corpus can be found on the TREC website and at the CSIRO website.
Installed at
corpora/text-corpora/wt[10|2]g
by kanmy
on Aug 08, 2003.
Maintained by kanmy.
Language: English.
Status: Restricted access. Anyone wishing to use this corpus must sign an individual license agreement before proceeding.
Crawled web pages of biographies.
Installed at
corpora/text-corpora/biographies
by cuihang
on Jun 19, 2003.
Maintained by cuihang.
Status: restricted.
Web1T
—
#Corpora @Sunfire
Unknown details.
Installed at corpora/text-statistics/web1T
on Jul 20, 2007.
Statistics on the Stanford WebBase corpus as compiled by UC Berkeley. Scripts and files that compute the IDF value of words over 133 M web pages are included. Big file!
Installed at
corpora/text-statistics/webBase/
by kanmy
on Jun 06, 2003.
Maintained by kanmy.
Language: Any.
Status: open to all.
Installed at
corpora/learning-datasets/webkb or corpora/text-corpora/webkb
by kanmy
on Apr 14, 2003.
Maintained by kanmy.
Language: English.
Unknown details.
Installed at corpora/text-corpora/wikipedia
on Oct 10, 2006.
The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. For more information, consult the TREC data home page,
http://trec.nist.gov/data.html.
Installed at
corpora/text-corpora/trec/ohsu-trec/
by kanmy
on Nov 07, 2003.
Maintained by kanmy.
Status: open for all to use, as publicly available for download from NIST's web site.
The World Gazetteer provides a comprehensive set of population data and related statistics. See
http://world-gazetteer.com/ for details.
Installed at corpora/gazetteers/worldgazetteer
on Dec 21, 2004.
Proceedings
— proceedings and workshop notes from previous research congresses in IR and NLP
ACL 2003
—
#Proceedings @WING(cte) @Sunfire
Proceedings of the 41st Annual Meeting for the Association for Computational Linguistics (ACL-2003) Sapporo Conventional Center, Sapporo, Japan, 7-12 July 2003.
Installed at
proceedings/acl-2003
by kanmy
on Jul 21, 2003.
Maintained by kanmy.
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain.
Installed at proceedings/acl-2004
on Aug 07, 2004.
Unknown details.
Installed at proceedings/acl-anthology
on Aug 16, 2006.
Proceedings of the ACL-EACL Conference, Student Research Workshop, Workshops and local information.
Installed at
proceedings/aclEacl-2001
by kanmy
on Jan 03, 2003.
Maintained by kanmy.
Proceedings of the 10th ACM International Conference on Multimedia (MM2002) - Juan-les-Pins, France, December 1 - 6 2002.
Installed at
proceedings/ACM-Multimedia-2002
by cuihang
on May 26, 2003.
Maintained by kanmy.
Proceedings of the 12th ACM International Conference on Multimedia, October 10-16, 2004, New York, NY, USA.
Installed at proceedings/ACM-Multimedia-2004
on Oct 26, 2004.
Proceedings of the 13th ACM International Conference on Multimedia, November 6-11, 2005, Singapore.
Installed at proceedings/ACM-Multimedia-2005
on Dec 23, 2009.
CHI 2009
—
#Proceedings @WING(cte) @Sunfire
Installed at
proceedings/chi-2009
by kanmy
on May 07, 2009.
Maintained by kanmy.
Status: restricted to local use, copyrighted by ACM.
Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004.
Installed at proceedings/cikm04
on Nov 10, 2004.
Proceeedings of the 20th International Conference on Computational Linguistics at the University of Geneva, Switzerland, on August 23rd-27th, 2004.
Installed at proceedings/COLING-2004
on Aug 31, 2004.
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia, from 17th-21st July 2006.
Installed at
proceedings/coling-acl-2006
by qiul
on Jul 26, 2006.
Maintained by qiul.
Proceedings of the 11th European Association for Computational Linguistics 2006 meeting and associated workshops. Trento Italy, April 3-7 2006.
Installed at
proceedings/eacl-2006
by kanmy
on Apr 11, 2006.
Maintained by kanmy.
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Sydney, Australia, from 22nd-23rd July 2006.
Installed at
proceedings/emnlp-2006
by qiul
on Jul 26, 2006.
Maintained by qiul.
Human Language Technology Conference / Conference on Empirical Methods in Natural Language Processing, held in Vancouver, B.C., Canada, October 6-8, 2005.
Installed at
proceedings/EMNLP_HLT-2005
by qiul
on Oct 16, 2005.
Maintained by qiul.
These are the proceedings of the HCI International conference held in Caesar's Palace, Las Vegas, USA on July 22-27, 2005. HCII is formed of 7 different meetings that are colocated: * Symposium on Human Interface (Japan) 2005 * 6th International Conference on Engineering Psychology & Cognitive Ergonomics * 3rd International Conference on Universal Access in Human-Computer Interaction * 1st International Conference on Virtual Reality * 1st International Conference on Usability and Internationalization * 1st International Conference on Online Communities and Social Computing * 1st International Conference on Augmented Cognition.
Installed at
proceedings/HCII-2005
by kanmy
on Jul 29, 2005.
Maintained by kanmy.
The Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL-2004) - Boston, USA, 2-7 May 2004.
Installed at
proceedings/HLT-NAACL-2004
by kanmy
on May 31, 2004.
Maintained by kanmy.
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, April 22-27, 2007, Rochester, New York, USA.
Installed at proceedings/hlt-naacl-2007
on Apr 30, 2007.
Proceedings of the Third International Joint Conference on Natural Language Processing, January 7-12, 2008, Hyderabad, India.
Installed at proceedings/ijcnlp-2008
on Jan 10, 2008.
The proceedings for the Language Resources and Evaluation Conference, held in the Canary Islands, Spain, in May 2002. Contains workshop and poster session papers as well.
Installed at
proceedings/lrec-2002
by kanmy
on Jan 21, 2003.
Maintained by kanmy.
The proceedings of the Language Resources and Evaluation Conference, held in Lisbon, Portugal, in May 2004. Contains workshop and poster session papers as well.
Installed at
proceedings/lrec-2004
by qiul
on Jun 03, 2004.
Maintained by qiul.
Proceedings of the sixth international conference on Language Resources and Evaluation, 28-30 May 2008, in Marrakech.
Installed at proceedings/lrec-2008
on Jun 01, 2008.
Proceedings of the KDD 01 workshop
Installed at
nowhere
by kanmy
on Aug 23, 2003.
Maintained by kanmy.
Proceedings of the KDD 02 workshop
Installed at
nowhere
by kanmy
on Aug 23, 2003.
Maintained by kanmy.
The Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001) - Carnegie Mellon University - Pittsburgh, PA USA 2-7 June 2001.
Installed at
proceedings/naacl-2001
by kanmy
on Jan 03, 2003.
Maintained by kanmy.
Proceedings of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD-03), Seoul, KOREA, April 30 - May 2, 2003.
Installed at proceedings/pakdd-2003
on May 16, 2003.
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007.
Installed at proceedings/sigir-2007
on Oct 24, 2007.
Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009.
Installed at proceedings/sigir-2009
on Dec 23, 2009.
Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, June 17-18, 2004, Maison de la Chimie, Paris, France.
Installed at proceedings/webDB-2004
on Aug 04, 2004.
WWW 2003
—
#Proceedings @WING(cte) @Sunfire
The Twelfth International World Wide Web Conference (WWW-2003) - Budapest, HUNGARY, 20-24 May 2003. The proceedings contain 77 referred papers, 207 posters and 38 alternate track papers.
Installed at
proceedings/WWW-2003
by cuihang
on May 26, 2003.
Maintained by kanmy.
WWW 2004
—
#Proceedings @WING(cte) @Sunfire
The Thirteenth International World Wide Web Conference (WWW-2004) - New York, USA, 17-22 May 2004.
Installed at
proceedings/WWW-2004
by kanmy
on May 31, 2004.
Maintained by kanmy.
Tools
— a large list of language analysis and generation tools, including parsers, chunkers, part-of-speech taggers, etc
The 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data (Linguistic Data Consortium (LDC) catalog number LDC2009T05 and isbn 1-58563-508-1):
NIST MetricsMATR is a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative, even revolutionary, MT metrics that correlate highly with human assessments of MT quality. See index.html for more details.
Installed at tools/evalTools/metricsMATR08
on Mar 09, 2009.
Unknown details.
Installed at tools/internetTools/linkAnalysis
on May 14, 2003.
Alignment-Based Learning (ABL) is a symbolic grammar inference framework that
has succesfully been applied for several unsupervised machine learning tasks
in Natural Language Processing (NLP). Given sequences of symbols only, a
system that implements ABL induces structure by aligning and comparing the
input sequences. As a result, the input sequences are augmented with the
induced structure. See README or
http://www.ics.mq.edu.au/~menno/research/software/abl/ for more details.
Installed at tools/frameworks/abl-1.0
on Dec 21, 2006.
Ant
—
#Tools @WING(cte) @Sunfire
The build utility for java projects. From
http://ant.apache.org/. You may need to unset your CLASSPATH to get this tool running properly.
Installed at
tools/buildTools/apache-ant/
by kanmy
on Dec 20, 2004.
Maintained by kanmy.
Language: Any.
Status: Open source available software.
Installed at tools/finiteState/fsm
on Dec 12, 2003.
Utility to draw finite state tranducers, acceptors, and machines. See their homepage at
http://www.research.att.com/sw/tools/graphviz/. Installation notes: really a pain to install, requires gd library package and a working jpeg lib (had to install jpeg 6b patch).
Installed at
tools/drawingTools/graphviz/
by kanmy
on Nov 07, 2003.
Maintained by kanmy.
Status: installed, untested.
BoosTexter is a machine learning algorithm that computes a classifier from simple single level decision trees (a.k.a. decision stumps) via boosting.
Installed at
tools/leaners/BoosTexter
by kanmy
on Jan 19, 2003.
Maintained by kanmy.
Language: Any.
Status: installed, not tested. Use restricted to research only.
Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students. The library provides facilities for: Recursively descending directories, finding text files. Finding `document' boundaries when there are multiple documents per file. Tokenizing a text file, according to several different methods. Including N-grams among the tokens. Mapping strings to integers and back again, very efficiently. Building a sparse matrix of document/token counts. Pruning vocabulary by word counts or by information gain. Building and manipulating word vectors. Setting word vector weights according to Naive Bayes, TFIDF, and several other methods. Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning. Scoring queries for retrieval or classification. Writing all data structures to disk in a compact format. Reading the document/token matrix from disk in an efficient, sparse fashion. Performing test/train splits, and automatic classification tests. Operating in server mode, receiving and answering queries over a socket. The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL). Home Page:
http://www-2.cs.cmu.edu/~mccallum/bow.
Installed at
tools/learners/bow-20020213
by kanmy
on Dec 30, 2002.
Maintained by kanmy.
Language: Any.
Status: installed but currently broken on the local system.
From
Ashley Clark's debian linux package. Compiles fine on Solaris. A converter for C code to colorize and write markup in HTML.
Installed at
tools/htmlTools/c2html/
by kanmy
on Aug 02, 2003.
Maintained by kanmy.
Language: Any.
Status: GNU GPL.
The classic decision tree learner by Quinlan. Superceded by his 5.0 commericial product. Handles numerical and categorical features. More information from
http://www.cse.unsw.edu.au/~quinlan/.
Installed at
tools/learners/c4.5
by kanmy
on Jan 19, 2003.
Maintained by kanmy.
Language: Any.
Status: installed and tested. Works fine.
CFUF is A graph-based implementation of the FUF language implemented in C and embedded within a Scheme interpreter. Developed by Michael Elhadad and Mark Kharitonov
Installed at tools/generators/cfuf
on Jun 13, 2004.
Installed at
nowhere
by kanmy
on Dec 30, 2002.
Maintained by kanmy.
Language: English.
Status: Currently installed and working.
Installed at
tools/parsers/COLLINS-PARSER
by kanmy
on Dec 30, 2002.
Maintained by kanmy.
Language: English.
Status: Currently installed and working. See also in this file the daemonized version of the Collins parser.
Coloring works by processing a input HTML file or a URL. The output is the original file but adds extra javascript and alters <A HREF>s tags so that the text can be annotated. A user can then annotate this file by using a javascript-enabled browser by simply highlighting spans (starting on a word and ending on a word) and selecting an appropriate
annotation from the annotation pane. The user can also annotate images with the same tags by clicking on them directly. See README in directory for more details.
Installed at tools/annotationTools/coloring
on Nov 08, 2008.
Described in Gupta et al.'s paper in WWW 2003.
Installed at
tools/htmlTools/proxy
by kanmy
on Jun 03, 2003.
Maintained by kanmy.
Status: Restricted license for research purposes only, contact the maintainer for access to this tool.
The modified Collins parser as made available by Min-Yen Kan of NUS. Modified to allow the parser to load the hash tables once and stay resident (as a background daemon process) so that parser can parse multiple files, without having to re-load the hash tables each time. See the on-line
README for details.
Installed at
tools/parsers/daemonCollins
by kanmy
on Aug 04, 2003.
Maintained by kanmy.
Language: English.
Status: Currently installed and working. See also in this file the original version of the Collins parser.
DUCView tool is pertinent to the creation of a model pyramid from multiple human summaries. It is not relevent if you are interested in peer annotation, that is, in evaluating a new summary against the pyramid. Specifically for DUC 2005, participants will receive already annotated pyramids and will do only peer annotation. See
DUCView site for more details.
Installed at tools/evalTools/DucView
on Jul 22, 2005.
The Autobib project proposes and implements a framework of extracting and integrating bibliographic information on the Web automatically using Hidden Markov Models. Here, you will find code and documentations related to this project, and you can also browse the experimental bibliographic data and check for its quality. This project is done in the Computer Science Department at Duke University, under the supervision of Prof. Jun Yang.
Installed at
tools/internetTools/autobib
from
source
by kanmy
on Jul 13, 2005.
Maintained by kanmy.
Language: English.
Status: freely available data.
This is a toolkit of perl scripts to manipulate and (hopefully)
recover a hierarchy of headers/topics from HTML files. The resulting
output is a document topic tree (or variously called a document map,
or document structure tree). The toolkit here is an extraction-based
method that looks for what seems like stand-alone phrases that may be
headers. The toolkit is constructed in a serial pipeline fashion.
Installed at tools/htmlTools/extractDTT
on Jun 06, 2003.
FUF
—
#Tools @WING(cte) @Sunfire
Installed at
tools/generators/fuf
by kanmy
on Dec 28, 2002.
Maintained by kanmy.
Language: Any.
Status: untested on the local system. Runs in LISP.
GATE
—
#Tools @WING(cte) @Sunfire
The General Architecture for Text Engineering from
University of Sheffield 's NLP group there. Has a GUI for tools that do named entity tagging, part of speech tagging, co-reference, and other things, all in a nice GUI. Is a bit slow; is implemented in java. You will want to see the online documentation at their site. The information extraction system, ANNIE (A Newly-New Information Extraction) comes with part of the installation.
Installed at
tools/frameworks/gate
by kanmy
on Jul 03, 2003.
Maintained by kanmy.
Language: Any.
Status: Is under GPL, so it is free for all. Works fine.
These are tools to deal with Google search. These are developed for local deployment with NUS SoC only. For more
information contact the authors.
Installed at tools/internetTools/googleTools
on May 30, 2003.
API for accessing the Google search results, preferable to screen / page scraping. You need to register with Google in order to use this service. They require individual registration. Home page at:
http://www.google.com/apis/
Installed at
tools/internetTools/googleapi
by kanmy
on Jan 10, 2003.
Maintained by kanmy.
Language: Any.
Status: tested, okay on the local system.
The Grok build system is based on Jakarta Ant, which is a Java
building tool originally developed for the Jakarta Tomcat project but
now used in many other Apache projects and extended by many
developers.
Installed at tools/frameworks/grok
on Oct 23, 2001.
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples. See
http://htk.eng.cam.ac.uk/ for more information.
Installed at
tools/frameworks/htk/
by kanmy
on Jan 22, 2005.
Maintained by kanmy.
Status: restricted use, you have to be a registered user on the HTK site in order to use this software. Please abide by the usage agreements before using this software.
Xerox part-of-speech tagger. XPOST is a hidden Markov model based part-of-speech tagger. Given a sentence, each token is assigned a part-of-speech ambiguity class from a lexicon (e.g. "package" is in the ambiguity class {noun,verb}). Words not in the lexicon are subjected to suffix analysis. A probabilistic model that assesses the likelihood of particular part-of-speech assignments based on word order is then applied to disambiguate the available choices. The final output is a sentence with each word tagged with the most likely part-of-speech tag. XPOST can process all the languages for which word order predicts part-of-speech tag. FTP site at:
ftp://ftp.parc.xerox.com/pub/tagger/. Use within Common LISP.
Installed at
tools/taggers/xpost-1.2
by kanmy
on Dec 30, 2002.
Maintained by kanmy.
Language: English.
Status: currently tested and working.
The Language detector is used to detect the language of an HTML webpage. See README for more details.
Installed at tools/languages/languageDetector
on Dec 04, 2003.
The chunker partitions plain text into sequences of semantically related words. The type of partition is also computed. The installed version is in perl. See README for more details.
Installed at
tools/chunkers/shallow-parser
from
CCG
on Jul 20, 2004.
JavaRAP is an implementation of the classic Resolution of Anaphora Procedure (RAP) given by
Lappin and Leass (1994). It resolves third person pronouns, lexical anaphors, and identifies pleonastic pronouns. The original purpose of the implementation is to provide anaphora resolution result to our
TREC 2003 Q&A system. See
the site for more details.
Installed at
tools/anaphoraResolvers/JavaRAP or tools/coreference/JavaRAP
from
JavaRAP site
on Mar 30, 2007.
The KEA Keyphrase extractor. Meant to build keywords from a document, much like the keywords used in the indexing terms for scientific papers. Uses the Lovins stemmer. Described in more detail at
http://www.nzdl.org/Kea/.
Installed at
tools/chunkers/KEA-2.0
by kanmy
on Sep 18, 2003.
Maintained by kanmy.
Language: English.
Status: Installed but not tested. Distributed under GNU GPL by the New Zealand DL group.
Klex is a finite-state lexical transducer for the Korean language, with the lexical string on the upper side and the inflected surface string on the lower side. Klex was developed on the XFST (Xerox Finite State Tool) software platform. Developed by Na-Rae Han. Homepage at:
http://www.cis.upenn.edu/~nrh/klex.html.
Installed at
tools/languages/korean/morphologyTools/klex
by kanmy
on Apr 21, 2004.
Maintained by kanmy.
Language: Korean.
Status: restricted access to researchers (as per LDC policy).
Klex: Finite-State Lexical Transducer for Korean was produced by Linguistic Data Consortium (LDC) catalog number LDC2004L01 and ISBN 1-58563-283-x. Klex is a finite-state lexical transducer for the Korean language,
with the lexical string on the upper side and the inflected surface
string on the lower side. Klex was developed on the XFST (Xerox Finite
State Tool) software platform, developed and distributed by the Xerox
Corporation. The most common application for such lexical transducers is
morphological analysis and generation.
Installed at tools/languages/korean/morphologyTools/klex
on May 11, 2004.
Language: Korean.
Installed at tools/frameworks/HIT_IRLab_LTP_Sharing_Package_Full_v1.1
on Nov 19, 2006.
This is a chunker and statistical for simplex noun phrases (SNP). We present a linguistically-motivated technique for the recognition and grouping of simplex noun phrases (SNPs) called LinkIT. Our system has two key features: (1) we efficiently gather minimal NPs, i.e. SNPs, as precisely and linguistically defined and motivated in our paper ; (2) we apply a refined set of postprocessing rules to these SNPs to link them within a document. The identification of SNPs is performed using a finite state machine compiled from a regular expression grammar, and the process of ranking the candidate significant topics uses frequency information that is gathered in a single pass through the document. The paper
Document Processing with LinkIT , was published in RIAO 2000. Also mentioned in
Automatic identification and organization of index terms for interactive browsing.
Installed at
tools/chunkers/LinkIT
by kanmy
on Dec 06, 2003.
Maintained by kanmy.
Status: restricted to academic use.
Installed at
tools/stemmers/Lovins_Java
by kanmy
on Sep 18, 2003.
Maintained by kanmy.
Language: English.
Status: GNU GPL: perl, Java versions installed and working, C version downloaded, but doesn't currently compile.
Adwait Ratnaparkhi's Maximum-Entropy based tagger, as per his 1997 ACL paper. This tools outputs the format expected by Collins' parser (also locally installed). Note that you have to use standard input to pass the input texts in.
Installed at
tools/taggers/mxTag
by kanmy
on Jul 03, 2003.
Maintained by kanmy.
Language: English.
Status: Restricted to research, educational and academic use only. Currently works without any problems.
A tool to help in the development of tagged corpora. Uses a Tcl interface. See the AWB home page for more details at
http://www.mitre.org/tech/alembic-workbench/. Usage notes: go to the directory and source the awb.cshrc or awb.bashrc file before running the awb utility.
Installed at
tools/frameworks/awb/
by kanmy
on Nov 07, 2003.
Maintained by kanmy.
Status: For research purposes only. Cannot be used for commercial development.
Tools for inflectional morphological analysis and generation, and for determining the orthography of the indefinite article are now available. Written by John Carroll of the University of Sussex. See the
home page for more information.
Installed at
tools/morphers/morph/
by kanmy
on Jun 15, 2004.
Maintained by kanmy.
Language: English.
Status: free for academic and research purposes from Carroll's tool home page.
UrlBasedFocusedCrawler, BreadthFirstCrawler, PageTextBasedFocusedCrawler. See README for more details.
Installed at tools/internetTools/nguyent6Spider
on May 23, 2005.
Installed at
tools/parsers/nlparser
by tanyeefa
on Jun 17, 2005.
Maintained by tanyeefa.
Language: English.
Status: Currently installed and working. Free for use for any non-commercial purposes.
The OpenNLP build system is based on Jakarta Ant, which is a Java
building tool originally developed for the Jakarta Tomcat project but
now used in many other Apache projects and extended by many
developers.
Installed at tools/frameworks/opennlp
on Mar 25, 2002.
The opennlp.maxent package is a mature Java package for training and using maximum entropy models. The documentation has some details about maximum entropy and using the opennlp.maxent package. It is updated only periodically, so check out the Sourceforge page for Maxent for the latest news. You can also ask questions and join in discussions on the forums.
Installed at
tools/learners/maxent
from
Sourceforge
by kanmy
on Oct 14, 2010.
Maintained by kanmy.
Status: publicly available from sourceforge.
Citation parser tool; Part of the OpCit project. See
site for details.
Installed at tools/citationTools/opcit_modules
on Jul 30, 2003.
Status: Unknown.
Perl version 5.8.0. Was installed because I couldn't find it on sf3. Have downloaded and quickinstalled a slew of modules for NLP/IR research. See the
complete listings of installed modules here. See the documentation on installing new Perl modules at the end of this file; email the maintainer for more information on installing the files. Modules of particular interest to NLP/IR people include the WordNet::QueryData, WordNet::Similarity modules.
Installed at
tools/languages/programming/perl-5.8.0
by kanmy
on Jun 02, 2003.
Maintained by kanmy.
Perl version 5.8.2. Have downloaded and quickinstalled a slew of modules for NLP/IR research, mostly mirroring the 5.8.0 installation. See the
complete listings of installed modules here . See the documentation on installing new Perl modules at the end of this file; email the maintainer for more information on installing the files. See also notes for Perl 5.8.0 below.
Installed at
tools/languages/programming/perl-5.8.2
by kanmy
on Dec 23, 2003.
Maintained by kanmy.
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. Detailed description and a host of downloadable versions of it in different languages can be found at
Porter Stemming Algorithm.
Installed at
tools/stemmers/Porter
by qiul
on Sep 19, 2003.
Maintained by qiul.
Language: English.
Status: ANSI C thread-safe version installed and working.
Versions 0.1 and 2.2 are installed. This is a Postscript to text converter, developed by the NZDL group. I believe this is the converter used by Google for PDF files too.
Installed at
tools/formatTools/prescript
by kanmy
on Jun 25, 2003.
Maintained by kanmy.
Language: Any.
Status: Currently installed but NOT working.
Python
—
#Tools @WING(cte) @Sunfire
The python programming language. An older version central to sf3/sunfire can be found at /opt/sfw/bin/python.
Installed at
tools/languages/programming/python
by kanmy
on Sep 08, 2003.
Maintained by kanmy.
Status: Public-domain, downloaded from Sourceforge.
Deprecated with version 2.5.2
Installed at
tools/languages/programming/python-2.3
by kanmy
on Sep 08, 2003.
Maintained by kanmy.
Status: Public-domain, downloaded from Sourceforge.
ROUGE is an automated summarization evaluation program used by NIST in the DUC conferences to evaluate summarization systems. It is based on the BLEU machine translation scoring metric. See
http://www.isi.edu/~cyl/ROUGE/ for more information.
Installed at
tools/evalTools/rouge
by kanmy
on Sep 21, 2005.
Maintained by kanmy.
Status: open to the research community.
The ruby programming language.
Installed at
tools/languages/programming/ruby-1.8.7
by kanmy
on Oct 23, 2008.
Maintained by kanmy.
Status: Public-domain, downloaded from Sourceforge.
An open-source Java package containing implementations for approximate string-matching techniques, such as Jaccard, Jaro and TF-IDF. Home page:
http://secondstring.sourceforge.net/
Installed at
tools/citationTools/secondstring
by tanyeefa
on Aug 27, 2005.
Maintained by tanyeefa.
Status: released under the University of Illinois/NCSA Open Source License.
Min-Yen Kan's linear topical segmentation program, as described in Coling-ACL 1998.
Installed at
tools/segmenters/segmenter/
by kanmy
on Jul 21, 2003.
Maintained by kanmy.
Language: Any languages with word delimiters.
Status: working, available for research use only.
Installed at
tools/parsers/SPADE
by cuihang
on Feb 17, 2003.
Maintained by cuihang.
Status: works well, but it requires running under bash shell instead of C-Shell.
SMART is an implementation of the vector-space model of
information retrieval proposed by Salton back in the 60's. The
primary purpose of SMART is to provide a framework in which to
conduct information retrieval research. Standard versions of
indexing, retrieval, and evaluation are provided.
Installed at tools/frameworks/ir/smart-11.0
on Nov 07, 2003.
Installed at
tools/taggers/SNOW_UIUC
by cuihang
on Jun 19, 2003.
Maintained by cuihang.
Language: English.
Status: Currently installed and working.
SOAP::Lite for Perl is a collection of Perl modules which provides a simple
and lightweight interface to the Simple Object Access Protocol (SOAP) both
on client and server side. To learn about SOAP, go to
http://www.soaplite.com/#LINKS for more
information.
Installed at tools/internetTools/perlModules/SOAP-Lite
on Apr 16, 2002.
SVMlight is an implementation of Vapnik's Support Vector Machine for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in . . The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently. Home page:
http://svmlight.joachims.org/
Installed at
tools/learners/svmLight-5.0
by kanmy
on Dec 30, 2002.
Maintained by kanmy.
Language: Any.
Status: Works.
Installed at
tools/languages/programming/pl
by kanmy
on Mar 10, 2005.
Maintained by kanmy.
Status: LGPL. Free for use.
A software system providing a simple command language, and a set of widgets for use in building GUIs. Home page:
http://www.tcl.tk/. The reason for installing Tcl/Tk is because WordNet 2.1 requires Tcl/Tk to install, and only Tcl is found on sf3 (but not Tk).
Installed at
tools/languages/programming/tcltk
by tanyeefa
on Jul 24, 2005.
Maintained by tanyeefa.
Status: Installed and untested. You may use Tcl/Tk in any way you wish, even in commercial applications.
A tool to change non conformant HTML to compliant HTML code. From Sourceforge, based on the original version from Dave Raggett.
Installed at
tools/htmlTools/tidy
by kanmy
on Jun 03, 2003.
Maintained by kanmy.
TinySVM is an implementation of Support Vector Machines (SVMs), for the problem of pattern recognition. This installation includes the shared library under the lib/ subdirectory.
Details from Taku Kudoh's web page (http://cl.aist-nara.ac.jp/~taku-ku/software/TinySVM/) and the doc/index.html file for more information on his tool. Usage notes: as TinySVM's binaries are named the exact same as those created by SVM light, the executables are not included in the rpnlpir group account's path.
Installed at
tools/learners/TinySVM
by kanmy
on Nov 07, 2003.
Maintained by kanmy.
Status: installed, compiled, tested. For public use, under GNU LGPL.
Installed at
tools/taggers/RULE_BASED_TAGGER_V1.14
by kanmy
on Dec 28, 2002.
Maintained by kanmy.
Language: English.
A HMM tool from Tapas Tanungo's
software page. Implementation of Forward-Backward, Viterbi, and Baum-Welch algorithms.
Installed at
tools/learner/HMM/umdhmm-v1.02
by cuihang
on Sep 15, 2003.
Maintained by cuihang.
Status: works well.
Calculates the Precision, Recall, F1-measure and improved Pk measure (refered to in Pevzner and Hearst paper on "An Evaluation Metric for Text Segmentation" as WindowDiff measure). See README for usage information.
Installed at tools/evalTools/segmentation/URLSegEval
on Mar 10, 2005.
Weka
—
#Tools @WING(cte) @Sunfire
Installed at
tools/learners/weka
by tanyeefa
on Jul 05, 2005.
Maintained by tanyeefa.
Language: English.
Status: Currently installed and working. Released under GPL, free for public use.
This is a small utility that locates a given URL within the WT10G collection. See README for details.
Installed at
tools/corpusTools/wt
by kanmy
on Oct 12, 2004.
Maintained by kanmy.
xmlAbbrevCoref is a program that further annotates a XML with part-of-speech and named-entity tags with simple abbreviation expansions and lemmatization of simplex NP entities. It has been written expressly to patch a hole in the TREC 2003 run for coreference resolution, it is *not* meant to be state-of-the-art by any stretch of the imagination. See README for details.
Installed at
tools/coreference/xmlAbbrevCoref
by kanmy
on Jul 21, 2003.
Maintained by kanmy.
Status: copyright Min-Yen Kan and the School of Computing, NUS.
YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995. Installed from
http://cl.aist-nara.ac.jp/~taku-ku/software/yamcha/.
Installed at
tools/chunkers/yamcha/
by kanmy
on Nov 07, 2003.
Maintained by kanmy.
Status: installed, compiled, tested. For public use, under GNU LGPL.