WING.NUS Resources

Last updated: Apr 10, 2012 20:13:27 SGT

This directory and account holds centralized software and tools for natural language processing (NLP) and information retrieval (IR) research and teaching at the School of Computing at the National University of Singapore. The account is hosted off of sf3 such that students and researchers will be able to get at these tools. Access is granted to all, however, if you'd like to provide and/or install tools, you must first email the administrators.

The tools here are compiled for Solaris (5.8). Installers, please keep the list of tools up to date, by checking the guidelines. Thank you. This file will also be available from the web, so if you are checking to see whether a certain package is installed locally here, you can do a find in your browser window on this webpage.

If you're looking for other pages of this sort you might try the listing of related NLP/IR software sites. We are also considering making versions of these tools readily installable from a single CD, where licensing is not an issue. Please contact us if you are interested in the availability of this software.

This site and listing is supervised by Min-Yen Kan.

Search:
Categories:              
Locations:      

Corpora

written, spoken, transcribed data for natural language analysis and use

20 newsgroups

#Corpora @WING(cte) @Sunfire
The twenty newsgroup collection is often used for machine learning benchmarks. It was installed locally at SoC to test the bow machine learning package.
Installed at corpora/text-corpora/20_newsgroups/ by kanmy on Jan 13, 2003. Maintained by kanmy. Language: English.

4 Stopword lists

#Corpora @WING(cte) @Sunfire
Four downloaded stoplists available from the web. See the README.html file in the directory for more information.
Installed at corpora/text-corpora/stopwordLists/ by kanmy on May 28, 2003. Maintained by kanmy. Language: English. Status: restricted.

7 Sectors Corpus

#Corpora @Sunfire
Data for bootstrapping Information Extraction.
Installed at corpora/learning-datasets/7sectors from source on Mar 19, 2003.

Academic Web Link Databases

#Corpora @WING(cte) @Sunfire
Link structure of Spanish, U.K., Taiwanese and Australian Universities. See the local copy of the original description HTML file (http://cybermetrics.wlv.ac.uk/database/) from University of Wolverhampton.
Installed at corpora/link-databases/academicWebLinkDatabase/ from University of Wolverhampton by kanmy on May 14, 2003. Maintained by kanmy. Status: Free of charge, open to all.

ANC (American National Corpus)

#Corpora @Sunfire
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development. Visit the ANC site for more details.
Installed at corpora/text-corpora/anc from ANC site on Aug 24, 2004. Status: Unknown.

AQUAINT (TREC) QA evaluation corpus

#Corpora @WING(cte) @Sunfire
TREC QA (AQUAINT) Data for 2002/2003. A corpus comprising of data from the New York Times, Xinhua news service and the Associated Press. See the index.html file in the directory for more details.
Installed at corpora/text-corpora/aquaint by kanmy on May 06, 2003. Maintained by kanmy. Language: English. Status: Access is restricted to TREC participants only.

Argumentative Zoning Corpus (pre-distribution)

#Corpora @WING(cte) @Sunfire
This is a mostly cleaned corpus of 80 computational linguistic articles that have been marked up for argumentative zoning relations. You can learn more about this from Simone's home page or from Yee Seng Chan's (search for "zoning") Digital Library course project.
Installed at corpora/text-corpora/zoning or corpora/metadata/zoning or tools/citationTools/zoning from Simone Teufel's site by kanmy on Apr 09, 2005. Maintained by kanmy. Language: English. Status: this is a pre-distribution copy from Simone Teufel. It is not for public use. Contact the maintainer if you would like to use this resource.

Bank Search Dataset

#Corpora @WING(cte) @Sunfire
A web document clustering dataset, provided free of charge from the University of Reading.
Installed at corpora/text-corpora/banksearchdataset from University of Reading by kanmy on Aug 07, 2003. Maintained by kanmy. Language: Any. Status: Freely downloadable from the web.

BBCVideo

#Corpora @Sunfire
Tagged and untagged queries.
Installed at corpora/queries/bbcVideo on Oct 13, 2004. Status: Restricted..

BLOG06 Test Collection

#Corpora @Sunfire
BLOG06 consists of a crawl of 100,649 RSS and Atom feeds, over an 11 week period (a total of 77 days). The collection consists of one directory for each day of the collection. From the Information Retrieval Group - Test Collections, University of Glasgow.
Installed at corpora/text-corpora/trec/blogdata on Nov 08, 2007. Status: Unknown.

British National Corpus, World Edition

#Corpora @WING(cte) @Sunfire
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. See the home page of the BNC at http://www.natcorp.ox.ac.uk/ for more details. We have a five year license for this product.
Installed at corpora/text-corpora/BNC-World from University of Oxford by kanmy on May 11, 2004. Maintained by kanmy. Language: English. Status: Limited for research purposes, see the maintainers for details if you wish to utilize this corpus. The texts and documentation are installed but the SARA utility has not been compiled nor set up.

Chinese Treebank

#Corpora @WING(cte) @Sunfire
The Penn Chinese Treebank is an ongoing project, that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0. More information about the project is available on the Penn Chinese Treebank website at: http://www.cis.upenn.edu/%7Echinese/ .
Installed at corpora/languages/chinese/text-corpora/treebank from Penn Chinese Treebank by kanmy on Apr 21, 2004. Maintained by kanmy. Language: Chinese. Status: restricted access to researchers (as per LDC policy).

CiteSeer OAI Records

#Corpora @Sunfire
OAI records are in two formats: (1) oai_dc.tar.gz - Includes the dublin core metadata standard, and (2) oai_citeseer.tar.gz - The dublic core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.
Installed at corpora/text-corpora/citeseerOAI on Jun 09, 2004. Status: Unknown.

Cora datasets

#Corpora @WING(cte) @Sunfire
This is the data from Andrew McCallum's home page on the scientific search engine CORA. It includes the citation matching, research paper classification and information extraction datasets.
Installed at corpora/text-corpora/cora from Andrew McCallum's home page by kanmy on May 11, 2004. Maintained by kanmy. Language: English. Status: publicly available.

Cotraining Web KB Data

#Corpora @WING(cte) @Sunfire
This is a subsection of the WebKB text classification corpus containing both hyperlink and the documents with judgments on the webpages into two categories, course and non-course. The relevant web page has been downloaded into root directory.
Installed at corpora/learning-datasets/course-cotrain-data or corpora/text-corpora/course-cotrain-data from source by kanmy on Apr 14, 2003. Maintained by kanmy. Language: English.

cuMARC

#Corpora @Sunfire
unknown MARC data
Installed at corpora/metadata/cuMARC on Jul 11, 2007. Status: Restricted.

DBLP XML records

#Corpora @WING(cte) @Sunfire
These are the XML records of the entire DBLP database. The copy here is dated from Jul 18, 2005.
Installed at corpora/metadata/dblp or corpora/text-corpora/DBLP from source by kanmy on Jul 19, 2005. Maintained by kanmy. Language: English. Status: freely available for all to use.

DUC 2001-2007 data

#Corpora @WING(cte) @Sunfire
Data (mostly testing data) from the Document Understanding Conference for the years 2001-2007. This is a summarization competition, held by NIST of the USA. You might also check out the DUC-processed files, see localInstallations.html. See the DUC web site for details.
Installed at corpora/text-corpora/duc/ from DUC by qiul on Oct 05, 2007. Maintained by kanmy. Language: English. Status: restricted to academic research. You have to sign an individual agreement with NIST before the data can be released to you. See the maintainer for details.

Excite Query Logs

#Corpora @WING(cte) @Sunfire
The 2,477 million queries for Excite on Dec 20, 1999. For research purposes only. Anyone connected to corporate research may not use this research. Access is restricted.
Installed at corpora/queries/excite/ by kanmy on Feb 07, 2003. Maintained by kanmy. Language: English. Status: Access is restricted.

Hong Kong News Parallel Text

#Corpora @WING(cte) @Sunfire
This FTP publication contains the Hong Kong News Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T46, isbn 1-58563-169-8. The Hong Kong News Parallel Text was created when the LDC collected parallel Cantonese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
Installed at corpora/text-corpora/hksar_news by kanmy on Dec 15, 2003. Maintained by kanmy. Language: English/Chinese.

ILP learning dataset

#Corpora @WING(cte) @Sunfire
Another subset of the WebKB text classification corpus as used in the ILP 98 paper. See the root directory README for more details.
Installed at corpora/learning-datasets/ilp or corpora/text-corpora/ilp by kanmy on Apr 14, 2003. Maintained by kanmy. Language: English.

ISL Meeting transcripts

#Corpora @Sunfire
The ISL Meeting Corpus Part 1 is a first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from 8 to 64 minutes and averages at 34 minutes. The audio files are available as ISL Meeting Speech Part 1. See the home page for the corpus at: http://wave.ldc.upenn.edu/Catalog/docs/LDC2004T10/.
Installed at corpora/text-corpora/meeting-transcripts/isl_meeting_transcripts from source by kanmy on Jun 03, 2004. Maintained by kanmy. Language: English. Status: An LDC corpus. Use restricted to LDC members.

Jansen Search Logs

#Corpora @Sunfire
Jansen search logs.
Installed at corpora/queries/jansenSearchLogs on Mar 01, 2006. Status: Restricted.
Recommendation Data Set from the PhD thesis of Lawrence Kai Shih, November 17th 2003. All the files are sql commands that can be imported directly into mysql. The data is collected from the 176-person user study.
Installed at corpora/relevance-judgments/webpageSegmentation on Nov 18, 2003. Status: Unknown.

LDC English Gigaword Corpus

#Corpora @WING(cte) @Sunfire
A large newspaper article corpus from the LDC, overlaps with WSJ and the AQUAINT corpora. Here's a link to its description from the LDC: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05.
Installed at corpora/text-corpora/gigaword from source by kanmy on Nov 21, 2003. Maintained by kanmy. Language: English. Status: restricted access within the department only, as per LDC's policy.
This MARC data was captured from the NUS LINC training system in September - Oct 2004. See the README in directory for more details.
Installed at corpora/metadata/nusMARC on Sep 28, 2004. Status: Restricted; not for use outside of NUS..

Microsoft Research Paraphrase Corpus

#Corpora @Sunfire
This dataset consists of 5801 pairs of sentences gleaned over a period of 18 months from thousands of news sources on the web. Accompanying each pair is judgment reflecting whether multiple human annotators considered the two sentences to be close enough in meaning to be considered close paraphrases. For more information, please visit Microsoft Research Paraphrase Corpus web site.
Installed at corpora/text-corpora/MSRParaphraseCorpus from Microsoft Research Paraphrase Corpus by qiul on Sep 29, 2005. Maintained by qiul. Status: protected under Microsoft Research Shared Source license agreement ("MSR-SSLA").

Misc Lists

#Corpora @Sunfire
Miscellaneous list of items from unclear source. See the directory listing for details.
Installed at corpora/gazetteers/miscLists by qiul on Mar 11, 2005. Maintained by qiul. Status: Limited to research purposes only.
MITRE's CBC4Kids corpus of online news stories for teenagers.
Installed at corpora/text-corpora/CBC4Kids on Dec 09, 2003. Status: Unknown.

Moby corpus' complete works of Shakespeare

#Corpora @WING(cte) @Sunfire
The Moby corpus' version of the unabridged works of William Shakespeare. The Moby project has a number of other lexica, see below and at the source home page: http://www.dcs.shef.ac.uk/research/ilash/Moby/.
Installed at text-corpora/mobyShakespeare from source by kanmy on Jul 03, 2003. Maintained by kanmy. Language: English. Status: in the public domain, do with it as you please.

Movie Review

#Corpora @Sunfire
Unknown details.
Installed at corpora/text-corpora/MovieReview on Mar 01, 2004.

MovieLens Collaborative Filtering dataset

#Corpora @WING(cte) @Sunfire
Two datasets used for collaborative filtering research. The first one consists of 100,000 ratings for 1682 movies by 943 users. The second one consists of approximately 1 million ratings for 3900 movies by 6040 users. Before using these datasets, please review the included readme files for the usage license. More information is avaliable from the GroupLens webpage: http://www.grouplens.org/.
Installed at corpora/relevance-judgments/collab-filtering/movielens from source by kanmy on Jun 01, 2004. Maintained by kanmy. Status: Publicly available from their web site.
This corpus contains 530 news articles manually annotated using an annotation scheme for opinions and other private states (e.g., beliefs, emotions, sentiment, speculation, etc). The annotation of the corpus was performed by 5 trained annotators over a period of about 15 months.
Installed at corpora/text-corpora/MPQA/ by cuihang on Mar 05, 2004. Maintained by cuihang. Status: restricted access.

MUC 6 co-reference data

#Corpora @WING(cte)
Message Understanding Conference 6 data, from the Linguistic Data Consortium. See the README file in the source directory for details.
Installed at corpora/text-corpora/muc6 by kanmy on Oct 01, 2003. Maintained by kanmy. Language: English. Status: restricted to research use only, as per LDC policy.

Multi-lingual summarization dataset

#Corpora @WING(cte)
All the dataset files related to the MultiLing 2011 Pilot at TAC. This includes source texts, human summaries, system summaries, and evaluation data. The dataset is derived from publicly available WikiNews (http://www.wikinews.org/) English texts. The source texts were under CC Attribution Licence V2.5 (cf. http://creativecommons.org/licenses/by/2.5/). Texts in other languages have been translated by native speakers of each language.
Installed at corpora/text-corpora/tac/2011/summarization/Multi Lingual Summarization by Praveen bysani on Apr 10, 2012. Language: Arabic, Czech, English , French,Greek , Hebrew, Hindi.

North American News Text Corpus

#Corpora @WING(cte) @Sunfire
Contains text from the Wall Street Journal, Reuters, New York Times and the LA Times-Washington Post News Service.
Installed at corpora/text-corpora/nantc by kanmy on Jan 21, 2003. Maintained by kanmy. Language: English. Status: Only NUS members can access this corpus, as per LDC's policies.
NPIC is a research project which performs image classification (especially for synthetic i.e., non-photographic images). NPIC does its work by supervised machine learning on datasets noisily created from image search engine results. This is the image corpus built for NPIC. It is specifically for synthetic (i.e., non-photographic) image classification.
Installed at corpora/image/npic from NPIC site on May 23, 2006.

NTU OPAC query logs

#Corpora @WING(cte) @Sunfire
This is a list of about ~700K online public access catalog queries collected by the Nanyang Technological University (NTU) OPAC server in 2002.
Installed at corpora/queries/ntuOPAC by kanmy on Jun 30, 2005. Maintained by kanmy. Language: mostly English. Status: for research staff only. Not for re-distribution or commericial use. Contact the maintainer for details.

NUS Libraries query logs

#Corpora @WING(cte) @Sunfire
About 800 K queries from the simple keyword interface for the LINC online catalog system of NUS. On-going collection of queries likely. Provided by NUS Libraries.
Installed at corpora/queries/nusInnopac/ by kanmy on Apr 10, 2003. Maintained by kanmy. Language: English. Status: For research purposes only.

Open Directory Project web page data

#Corpora @WING(cte) @Sunfire
The ODP is a large, open-source, human-edited directory similar to Yahoo!. The data is distributed under GNU GPL and is provided here for IR research purposes. See their web page for more details.
Installed at corpora/metadata/odp by kanmy on Jan 03, 2003. Maintained by kanmy. Language: English. Status: data is distributed under GNU GPL and provided for IR research purposes.

OPUS Parallel corpus (v0.2)

#Corpora @WING(cte) @Sunfire
OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic data, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus. (Manual corrections have not been made.) See the home page for more details and for their online search interface: http://logos.uio.no/opus/
Installed at corpora/text-corpora/parallel/opus-v0.2 by kanmy on Feb 05, 2005. Maintained by kanmy. Language: Many. Status: Openly available from the web page.

PASCAL Entailment Datasets

#Corpora @Sunfire
These are the Development Set, Test Set and Annotated Test Set of the first and second PASCAL Recognising Textual Entailment Challenge.
Installed at corpora/text-corpora/Pascal by qiul on Sep 24, 2005. Maintained by qiul. Status: freely available for all to use.

Penn Discourse Treebank Version 2.0

#Corpora @Sunfire
The goal of the project is to develop a large scale corpus annotated with information related to discourse structure. Penn Discourse Treebank Version 2.0 contains annotations of discourse relations and their arguments on the one million word Wall Street Journal (WSJ) data in Treebank-2 (LDC95T7).
Installed at corpora/text-corpora/PennDiscourseTreebank2.0 on Feb 29, 2008. Status: Unknown.

Penn Treebank

#Corpora @WING(cte) @Sunfire
The Penn Treebank contains Wall Street Journal text that has been tagged, parsed by both machine and linguists. It is a benchmark corpus for parsing and part-of-speech tagging tasks. Contains binaries for grepping on tree nodes (e.g., tgrep).
Installed at corpora/text-corpora/treebank by kanmy on Jan 21, 2003. Maintained by kanmy. Language: English. Status: Only NUS members can access this corpus, as per LDC's policies.

PropBank

#Corpora @WING(cte) @Sunfire
The PropBank project is creating a corpus of text annotated with information about basic semantic propositions. Predicate-argument relations are being added to the syntactic trees of the Penn Treebank. See http://www.cis.upenn.edu/~ace/ for details.
Installed at corpora/text-corpora/PropBank by cuihang on Aug 22, 2003. Maintained by cuihang. Language: English. Status: restricted.

Question Answering

#Corpora @Sunfire
Unknown source.
Installed at corpora/queries/questionAnswering on Aug 07, 2007. Status: Restricted.

remedia_release

#Corpora @Sunfire
Unknown details
Installed at corpora/remedia_release on Jun 21, 2002.

Reuters 21578 Classic text categorization corpus

#Corpora @WING(cte) @Sunfire
The classic text categorization corpus. Found from http://www.daviddlewis.com/resources/testcollections/reuters21578/.
Installed at corpora/learning-datasets/reuters21578 from source by kanmy on Jan 19, 2003. Maintained by kanmy. Language: English.

Reuters Corpus

#Corpora @Sunfire
Installed at corpora/text-corpora/rcv1 on Sep 08, 2004. Status: Unknown.
Collection of about 10.1K messages of SMS service corpus collected by How Yijue as part of her honors year thesis work. Please see How Yijue's thesis for more documentation.
Installed at corpora/text-corpora/sms/ from source by kanmy on Apr 28, 2004. Maintained by kanmy. Language: mostly English. Status: open to all under a license similar to the Open Directory Project license.

Summbank

#Corpora @WING(cte) @Sunfire
Summary corpus linked to the HKSAR news corpus. Produced and studied extensively by one of the JHU Workshops in 2001. More information about the corpus is at: http://www.summarization.com/summbank/".
Installed at corpora/text-corpora/summbank by kanmy on Dec 15, 2003. Maintained by kanmy. Language: English/Chinese. Status: Restricted to LDC members, is open only for general academic research.

Surname List

#Corpora @WING(cte) @Sunfire
A list of 23K+ English surnames compiled from the rootsweb mailing list list. See the local README file for more information.
Installed at corpora/gazetteers/surnames/ by kanmy on May 06, 2005. Maintained by kanmy. Language: English. Status: Available on the web, locally post-processed for use.

Text Retrieval Conference (TREC) English Queries

#Corpora @WING(cte) @Sunfire
The Text Retrieval Conference (TREC) has been held for numerous years. The queries for the competition are housed here. The TREC English queries home page is at: http://trec.nist.gov/data/topics_eng/index.html.
Installed at corpora/queries/trec* by kanmy on Jan 09, 2003. Maintained by kanmy. Language: English. Status: Currently available for research purposes, cleared by TREC administrators by TREC maintainers.

The PH Corpus

#Corpora @Sunfire
The PH Corpus is a cleaned up, segmented version of the Mandarin Chinese corpus compiled by Guo Jin. It contains 2,447,7719 words of news text published by Xinhua News Agency between January 1990 and March 1991.
Installed at corpora/languages/chinese/text-corpora/ph from source on Oct 12, 2004. Language: Chinese.

Tipster Text Research Collection, Vol 1-3.

#Corpora @WING(cte) @Sunfire
The TIPSTER Text research collections were used extensively for the Text Retrieval Conferences (TREC). Still a good source of text corpora for the research community.
Installed at corpora/text-corpora/tipster by kanmy on Jan 21, 2003. Maintained by kanmy. Language: English. Status: Only NUS members can access this corpus, as per LDC's policies.

Topic Detection & Tracking

#Corpora @WING(cte) @Sunfire
The TDT dataset is used for Topic Detection & Tracking (TDT) research. Currently, TDT2, used for 1998 TDT test; TDT3, used for 1999 ~ 2001 TDT tests; and TDT4, used for 2002 ~ 2003 TDT tests are installed. Please refer to http://www.nist.gov/speech/tests/tdt/index.htm for details of TDT research.
Installed at corpora/text-corpora/TDT by zhangya on Jun 22, 2005. Maintained by zhangya. Language: English & Chinese. Status: Only NUS members can access this corpus, as per LDC's policies.

TREC 2003 QA Main Task Questions and Judgments

#Corpora @WING(cte) @Sunfire
Questions used in TREC 2003 QA main task, including factoid, list and definition questions, as well as their judgments.
Installed at corpora/queries/trec12.questions by cuihang on Nov 21, 2003. Maintained by cuihang.

trecWeb

#Corpora @Sunfire
Unknown source.
Installed at corpora/relevance-judgments/trecWeb on Feb 13, 2003. Status: Unknown.

UN/LOCODE

#Corpora @Sunfire
United Nations Code for Trade and Transport Locations
Installed at corpora/gazetteers/un-locode on Oct 18, 2005.

Web corpora wt10g and wt2g

#Corpora @WING(cte) @Sunfire
These are two 10 GB and 2 GB corpora used by the TREC web track. Compiled by CSIRO. See the directory for more information. More details on the corpus can be found on the TREC website and at the CSIRO website.
Installed at corpora/text-corpora/wt[10|2]g by kanmy on Aug 08, 2003. Maintained by kanmy. Language: English. Status: Restricted access. Anyone wishing to use this corpus must sign an individual license agreement before proceeding.

Web Pages of Biographies

#Corpora @Sunfire
Crawled web pages of biographies.
Installed at corpora/text-corpora/biographies by cuihang on Jun 19, 2003. Maintained by cuihang. Status: restricted.

Web1T

#Corpora @Sunfire
Unknown details.
Installed at corpora/text-statistics/web1T on Jul 20, 2007.

WebBase statistics

#Corpora @WING(cte) @Sunfire
Statistics on the Stanford WebBase corpus as compiled by UC Berkeley. Scripts and files that compute the IDF value of words over 133 M web pages are included. Big file!
Installed at corpora/text-statistics/webBase/ by kanmy on Jun 06, 2003. Maintained by kanmy. Language: Any. Status: open to all.

WebKB webpages and judgments

#Corpora @WING(cte) @Sunfire
This is the WebKB text classification corpus. The relevant home page is in the root directory and can be found at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/. It contains a corpus of 4000+ web pages and their classification into 7 categories.
Installed at corpora/learning-datasets/webkb or corpora/text-corpora/webkb by kanmy on Apr 14, 2003. Maintained by kanmy. Language: English.

Wikipedia (en)

#Corpora @Sunfire
Unknown details.
Installed at corpora/text-corpora/wikipedia on Oct 10, 2006.
The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. For more information, consult the TREC data home page, http://trec.nist.gov/data.html.
Installed at corpora/text-corpora/trec/ohsu-trec/ by kanmy on Nov 07, 2003. Maintained by kanmy. Status: open for all to use, as publicly available for download from NIST's web site.

World Gazetteer

#Corpora @Sunfire
The World Gazetteer provides a comprehensive set of population data and related statistics. See http://world-gazetteer.com/ for details.
Installed at corpora/gazetteers/worldgazetteer on Dec 21, 2004.

Grammars

hand crafted grammars for analysis and generation

Surge 2.2

#Grammars @WING(cte)
A comprehensive unification grammar for the English language generation. Widely used with FUF. Developed by Jacques Robin from Brazil. Home page: http://www.cs.bgu.ac.il/surge/index.htm.
Installed at grammars/surge-2.2 from source by kanmy on Dec 28, 2002. Maintained by kanmy. Language: English.

Lexicons

lexicons and ontologies for word senses, word relations and conflation
Files that describe the verb classes from Levin's seminal work on verb classification by their case frames and alternations. Flat text files.
Installed at lexicons/evca by kanmy on May 29, 2004. Maintained by kanmy. Language: English. Status: open to all (was made available on the LINGUIST LIST), copyright for the material is held by the University of Chicago Press, 1993.

CMU Pronunciation Dictionary

#Lexicons @WING(cte) @Sunfire
The Carnegie Mellon University Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions. This format is particularly useful for speech recognition and synthesis, as it has mappings from words to their pronunciations in the given phoneme set. The current phoneme set contains 39 phonemes, for which the vowels may carry lexical stress. See http://www.speech.cs.cmu.edu/cgi-bin/cmudict. See the README in the directory for more details.
Installed at lexicons/cmudict-0.6/ by kanmy on Dec 09, 2003. Maintained by kanmy. Language: English. Status: open for all to access.

Extended WordNet 2.0

#Lexicons @WING(cte) @Sunfire
In the eXtended WordNet the WordNet glosses are syntactically parsed, transformed into logic forms and content words are semantically disambiguated. Makes this data available in XML form. I have only installed the version that tracks WordNet 2.0. This is work by Moldovan et al. at U Texas. See their home page at: http://xwn.hlt.utdallas.edu/index.html.
Installed at lexicons/XWN by kanmy on Dec 29, 2003. Maintained by kanmy. Language: English. Status: open to all, see license at http://xwn.hlt.utdallas.edu/downloads.html.

Java WordNet Library (JWNL) 1.3 RC3

#Lexicons @WING(cte) @Sunfire
JWNL is a Java API for accessing the WordNet relational dictionary. WordNet is widely used for developing NLP applications, and a Java API such as JWNL will allow developers to more easily use Java for building NLP applications. Home page at http://jwordnet.sourceforge.net/. Usage notes: Please refer to the README-SOC.TXT file for some usage notes.
Installed at lexicons/jwnl by tanyeefa on Jun 19, 2005. Maintained by tanyeefa. Language: English. Status: Installed and working. BSD license..
Unknown details. Available from http://www.loc.gov/rr/print/tgm1/tgm1.txt.
Installed at lexicons/tgm on Oct 12, 2004.

Moby Lexica

#Lexicons @WING(cte) @Sunfire
The Moby lexicons containing: Hyphenator - 185,000 entries fully hyphenated. Moby Language - Word lists in five of the world's great languages. Moby Part-of-Speech - 230,000 entries fully described by part(s) of speech, listed in priority order. Moby Pronunciator - 175,000 entries fully International Phonetic Alphabet coded. Moby Thesaurus - 30,000 root words, 2.5 million synonyms and related words. Moby Words - 610,000+ words and phrases. The largest word list in the world. The source Moby website is at: University of Sheffield.
Installed at lexicons/moby/ by kanmy on Jul 03, 2003. Maintained by kanmy. Language: mostly English; but French, German, Japanese, and Italian also present. Status: public domain, do what you will with it.
OPTED is a public domain English word list dictionary, based on the public domain portion of "The Project Gutenberg Etext of Webster's Unabridged Dictionary" which is in turn based on the 1913 US Webster's Unabridged Dictionary. See OPTED site for more details.
Installed at lexicons/v003 from source by kanmy on Jun 19, 2004. Maintained by kanmy. Status: Unknown.

WordNet 1.7.1

#Lexicons @WING(cte) @Sunfire
Probably the most famous lexical ontology. Home page at http://wordnet.princeton.edu/. Documentation and papers available from its home page. Usage notes: Make sure either $WNHOME is properly set to /home/rsch/rpnlpir/lexicons/WordNet-1.7.1 or $WNSEARCHDIR is properly set to /home/rsch/rpnlpir/lexicons/WordNet-1.7.1/dict.
Installed at lexicons/WordNet-1.7.1 by kanmy on Dec 28, 2002. Maintained by kanmy. Language: English.

WordNet 2.0

#Lexicons @WING(cte) @Sunfire
An update to 1.7.1 featuring quite a lot of changes. Documentation and papers available from its home page. The change log can be found here. Usage notes: Make sure either $WNHOME is properly set to /home/rsch/rpnlpir/lexicons/WordNet-2.0 or $WNSEARCHDIR is properly set to /home/rsch/rpnlpir/lexicons/WordNet-2.0/dict.
Installed at lexicons/WordNet-2.0 by kanmy on Sep 25, 2003. Maintained by kanmy. Language: English.

WordNet 2.1

#Lexicons @WING(cte) @Sunfire
An update to 2.0 featuring quite a lot of changes. Documentation and papers available from its home page. The change log can be found here. Usage notes: Make sure either $WNHOME is properly set to /home/rsch/rpnlpir/lexicons/WordNet-2.1 or $WNSEARCHDIR is properly set to /home/rsch/rpnlpir/lexicons/WordNet-2.1/dict. Note that Tcl/Tk must be installed before WordNet 2.1 can be installed. There is no such requirement for the previous versions of WordNet.
Installed at lexicons/WordNet-2.1 by tanyeefa on Jul 24, 2005. Maintained by tanyeefa. Language: English.

WordNet 3.0

#Lexicons @WING(cte) @Sunfire
An update to 3.0, featuring a few changes to the graphical interface. WordNet 2.0, 2.1 have been reported to hang on sunfire, hence the installation of this newer version. Documentation and papers available from its home page. The change log can be found here. Usage notes: Make sure either $WNHOME is properly set to /home/rsch/rpnlpir/lexicons/WordNet-3.0 or $WNSEARCHDIR is properly set to /home/rsch/rpnlpir/lexicons/WordNet-3.0/dict. Note that Tcl/Tk must be installed before WordNet 3.0 can be installed. There is no such requirement for the previous versions of WordNet.
Installed at lexicons/WordNet-3.0 by kanmy on Sep 28, 2008. Maintained by kanmy. Language: English.

WordNet log likelihood statistics

#Lexicons @WING(cte) @Sunfire
Negative log likelihood statistics for WordNet 1.6 synsets. Can be coupled to compute (or partially compute) semantic similarity of words, similar to lexical chaining. See the directory's README file for more information.
Installed at lexicon/lexicon-statistics/ by kanmy on Jun 06, 2003. Maintained by kanmy. Language: English. Status: open to the public.

Libraries

customized libraries to link software to

LibWWW 5.4.0

#Libraries @WING(cte) @Sunfire
Libwww is a highly modular, general-purpose client side Web API written in C for Unix and Windows (Win32). It's well suited for both small and large applications, like browser/editors, robots, batch tools, etc. Pluggable modules provided with libwww include complete HTTP/1.1 (with caching, pipelining, PUT, POST, Digest Authentication, deflate, etc), MySQL logging, FTP, HTML/4, XML (expat), RDF (SiRPAC), WebDAV, and much more. The purpose of libwww is to serve as a testbed for protocol experiments. See the home page at http://www.w3.org/Library/.
Installed at tools/internetTools/lib/libwww by kanmy on Nov 29, 2003. Maintained by kanmy. Status: installed, untested. Configured with zlib, md5 and regexp support. See installation notes for more details. GPL code.

Proceedings

proceedings and workshop notes from previous research congresses in IR and NLP

ACL 2003

#Proceedings @WING(cte) @Sunfire
Proceedings of the 41st Annual Meeting for the Association for Computational Linguistics (ACL-2003) Sapporo Conventional Center, Sapporo, Japan, 7-12 July 2003.
Installed at proceedings/acl-2003 by kanmy on Jul 21, 2003. Maintained by kanmy.

ACL 2004

#Proceedings @Sunfire
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain.
Installed at proceedings/acl-2004 on Aug 07, 2004.

ACL Anthology

#Proceedings @Sunfire
Unknown details.
Installed at proceedings/acl-anthology on Aug 16, 2006.

ACL-EACL 2001

#Proceedings @WING(cte) @Sunfire
Proceedings of the ACL-EACL Conference, Student Research Workshop, Workshops and local information.
Installed at proceedings/aclEacl-2001 by kanmy on Jan 03, 2003. Maintained by kanmy.

ACM Multimedia 2002

#Proceedings @WING(cte) @Sunfire
Proceedings of the 10th ACM International Conference on Multimedia (MM2002) - Juan-les-Pins, France, December 1 - 6 2002.
Installed at proceedings/ACM-Multimedia-2002 by cuihang on May 26, 2003. Maintained by kanmy.

ACM Multimedia 2004

#Proceedings @Sunfire
Proceedings of the 12th ACM International Conference on Multimedia, October 10-16, 2004, New York, NY, USA.
Installed at proceedings/ACM-Multimedia-2004 on Oct 26, 2004.

ACM Multimedia 2005

#Proceedings @Sunfire
Proceedings of the 13th ACM International Conference on Multimedia, November 6-11, 2005, Singapore.
Installed at proceedings/ACM-Multimedia-2005 on Dec 23, 2009.

CHI 2009

#Proceedings @WING(cte) @Sunfire
4-9 April 2009, Boston, MA, USA. 27th CHI Conference. http://www.chi2009.org/
Installed at proceedings/chi-2009 by kanmy on May 07, 2009. Maintained by kanmy. Status: restricted to local use, copyrighted by ACM.

CIKM 2004

#Proceedings @Sunfire
Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004.
Installed at proceedings/cikm04 on Nov 10, 2004.

COLING 2004

#Proceedings @Sunfire
Proceeedings of the 20th International Conference on Computational Linguistics at the University of Geneva, Switzerland, on August 23rd-27th, 2004.
Installed at proceedings/COLING-2004 on Aug 31, 2004.

COLING-ACL 2006

#Proceedings @WING(cte) @Sunfire
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia, from 17th-21st July 2006.
Installed at proceedings/coling-acl-2006 by qiul on Jul 26, 2006. Maintained by qiul.

EACL 2006

#Proceedings @WING(cte) @Sunfire
Proceedings of the 11th European Association for Computational Linguistics 2006 meeting and associated workshops. Trento Italy, April 3-7 2006.
Installed at proceedings/eacl-2006 by kanmy on Apr 11, 2006. Maintained by kanmy.

EMNLP 2006

#Proceedings @Sunfire
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Sydney, Australia, from 22nd-23rd July 2006.
Installed at proceedings/emnlp-2006 by qiul on Jul 26, 2006. Maintained by qiul.

EMNLP-HLT 2005

#Proceedings @Sunfire
Human Language Technology Conference / Conference on Empirical Methods in Natural Language Processing, held in Vancouver, B.C., Canada, October 6-8, 2005.
Installed at proceedings/EMNLP_HLT-2005 by qiul on Oct 16, 2005. Maintained by qiul.

HCII 2005

#Proceedings @WING(cte) @Sunfire
These are the proceedings of the HCI International conference held in Caesar's Palace, Las Vegas, USA on July 22-27, 2005. HCII is formed of 7 different meetings that are colocated: * Symposium on Human Interface (Japan) 2005 * 6th International Conference on Engineering Psychology & Cognitive Ergonomics * 3rd International Conference on Universal Access in Human-Computer Interaction * 1st International Conference on Virtual Reality * 1st International Conference on Usability and Internationalization * 1st International Conference on Online Communities and Social Computing * 1st International Conference on Augmented Cognition.
Installed at proceedings/HCII-2005 by kanmy on Jul 29, 2005. Maintained by kanmy.

HLT/NAACL 2004

#Proceedings @WING(cte) @Sunfire
The Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL-2004) - Boston, USA, 2-7 May 2004.
Installed at proceedings/HLT-NAACL-2004 by kanmy on May 31, 2004. Maintained by kanmy.

HLT/NAACL 2007

#Proceedings @Sunfire
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, April 22-27, 2007, Rochester, New York, USA.
Installed at proceedings/hlt-naacl-2007 on Apr 30, 2007.

IJCNLP 2008

#Proceedings @Sunfire
Proceedings of the Third International Joint Conference on Natural Language Processing, January 7-12, 2008, Hyderabad, India.
Installed at proceedings/ijcnlp-2008 on Jan 10, 2008.

LREC 2002

#Proceedings @WING(cte) @Sunfire
The proceedings for the Language Resources and Evaluation Conference, held in the Canary Islands, Spain, in May 2002. Contains workshop and poster session papers as well.
Installed at proceedings/lrec-2002 by kanmy on Jan 21, 2003. Maintained by kanmy.

LREC 2004

#Proceedings @WING(cte) @Sunfire
The proceedings of the Language Resources and Evaluation Conference, held in Lisbon, Portugal, in May 2004. Contains workshop and poster session papers as well.
Installed at proceedings/lrec-2004 by qiul on Jun 03, 2004. Maintained by qiul.

LREC 2008

#Proceedings @Sunfire
Proceedings of the sixth international conference on Language Resources and Evaluation, 28-30 May 2008, in Marrakech.
Installed at proceedings/lrec-2008 on Jun 01, 2008.

Multimedia Data Mining 2001

#Proceedings
Proceedings of the KDD 01 workshop
Installed at nowhere by kanmy on Aug 23, 2003. Maintained by kanmy.

Multimedia Data Mining 2002

#Proceedings
Proceedings of the KDD 02 workshop
Installed at nowhere by kanmy on Aug 23, 2003. Maintained by kanmy.

NAACL 2001

#Proceedings @WING(cte) @Sunfire
The Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001) - Carnegie Mellon University - Pittsburgh, PA USA 2-7 June 2001.
Installed at proceedings/naacl-2001 by kanmy on Jan 03, 2003. Maintained by kanmy.

PAKDD 2003

#Proceedings @Sunfire
Proceedings of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD-03), Seoul, KOREA, April 30 - May 2, 2003.
Installed at proceedings/pakdd-2003 on May 16, 2003.

SIGIR 2007

#Proceedings @Sunfire
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007.
Installed at proceedings/sigir-2007 on Oct 24, 2007.

SIGIR 2009

#Proceedings @Sunfire
Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009.
Installed at proceedings/sigir-2009 on Dec 23, 2009.

WebDB 2004

#Proceedings @Sunfire
Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, June 17-18, 2004, Maison de la Chimie, Paris, France.
Installed at proceedings/webDB-2004 on Aug 04, 2004.

WWW 2003

#Proceedings @WING(cte) @Sunfire
The Twelfth International World Wide Web Conference (WWW-2003) - Budapest, HUNGARY, 20-24 May 2003. The proceedings contain 77 referred papers, 207 posters and 38 alternate track papers.
Installed at proceedings/WWW-2003 by cuihang on May 26, 2003. Maintained by kanmy.

WWW 2004

#Proceedings @WING(cte) @Sunfire
The Thirteenth International World Wide Web Conference (WWW-2004) - New York, USA, 17-22 May 2004.
Installed at proceedings/WWW-2004 by kanmy on May 31, 2004. Maintained by kanmy.

Tools

a large list of language analysis and generation tools, including parsers, chunkers, part-of-speech taggers, etc
The 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data (Linguistic Data Consortium (LDC) catalog number LDC2009T05 and isbn 1-58563-508-1): NIST MetricsMATR is a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative, even revolutionary, MT metrics that correlate highly with human assessments of MT quality. See index.html for more details.
Installed at tools/evalTools/metricsMATR08 on Mar 09, 2009.
Unknown details.
Installed at tools/internetTools/linkAnalysis on May 14, 2003.

Alignment-Based Learning

#Tools @Sunfire
Alignment-Based Learning (ABL) is a symbolic grammar inference framework that has succesfully been applied for several unsupervised machine learning tasks in Natural Language Processing (NLP). Given sequences of symbols only, a system that implements ABL induces structure by aligning and comparing the input sequences. As a result, the input sequences are augmented with the induced structure. See README or http://www.ics.mq.edu.au/~menno/research/software/abl/ for more details.
Installed at tools/frameworks/abl-1.0 on Dec 21, 2006.

Ant

#Tools @WING(cte) @Sunfire
The build utility for java projects. From http://ant.apache.org/. You may need to unset your CLASSPATH to get this tool running properly.
Installed at tools/buildTools/apache-ant/ by kanmy on Dec 20, 2004. Maintained by kanmy. Language: Any. Status: Open source available software.
Installed at tools/finiteState/fsm on Dec 12, 2003.
Utility to draw finite state tranducers, acceptors, and machines. See their homepage at http://www.research.att.com/sw/tools/graphviz/. Installation notes: really a pain to install, requires gd library package and a working jpeg lib (had to install jpeg 6b patch).
Installed at tools/drawingTools/graphviz/ by kanmy on Nov 07, 2003. Maintained by kanmy. Status: installed, untested.

BoosTexter

#Tools @WING(cte) @Sunfire
BoosTexter is a machine learning algorithm that computes a classifier from simple single level decision trees (a.k.a. decision stumps) via boosting.
Installed at tools/leaners/BoosTexter by kanmy on Jan 19, 2003. Maintained by kanmy. Language: Any. Status: installed, not tested. Use restricted to research only.

BOW machine learning toolkit

#Tools @WING(cte) @Sunfire
Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students. The library provides facilities for: Recursively descending directories, finding text files. Finding `document' boundaries when there are multiple documents per file. Tokenizing a text file, according to several different methods. Including N-grams among the tokens. Mapping strings to integers and back again, very efficiently. Building a sparse matrix of document/token counts. Pruning vocabulary by word counts or by information gain. Building and manipulating word vectors. Setting word vector weights according to Naive Bayes, TFIDF, and several other methods. Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning. Scoring queries for retrieval or classification. Writing all data structures to disk in a compact format. Reading the document/token matrix from disk in an efficient, sparse fashion. Performing test/train splits, and automatic classification tests. Operating in server mode, receiving and answering queries over a socket. The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL). Home Page: http://www-2.cs.cmu.edu/~mccallum/bow.
Installed at tools/learners/bow-20020213 by kanmy on Dec 30, 2002. Maintained by kanmy. Language: Any. Status: installed but currently broken on the local system.

c2html-0.9.5-1

#Tools @WING(cte) @Sunfire
From Ashley Clark's debian linux package. Compiles fine on Solaris. A converter for C code to colorize and write markup in HTML.
Installed at tools/htmlTools/c2html/ by kanmy on Aug 02, 2003. Maintained by kanmy. Language: Any. Status: GNU GPL.

C4.5 decision tree learner

#Tools @WING(cte) @Sunfire
The classic decision tree learner by Quinlan. Superceded by his 5.0 commericial product. Handles numerical and categorical features. More information from http://www.cse.unsw.edu.au/~quinlan/.
Installed at tools/learners/c4.5 by kanmy on Jan 19, 2003. Maintained by kanmy. Language: Any. Status: installed and tested. Works fine.

CFUF

#Tools @Sunfire
CFUF is A graph-based implementation of the FUF language implemented in C and embedded within a Scheme interpreter. Developed by Michael Elhadad and Mark Kharitonov
Installed at tools/generators/cfuf on Jun 13, 2004.

Charniak Parser

#Tools
Eugene Charniak's parser, as made available from his Brown homepage, at http://www.cs.brown.edu/people/ec/#software
Installed at nowhere by kanmy on Dec 30, 2002. Maintained by kanmy. Language: English. Status: Currently installed and working.

Collins Parser

#Tools @WING(cte) @Sunfire
The Collins parser as made available by Michael Collins of MIT. Michael Collins' home page: http://www.ai.mit.edu/people/mcollins/.
Installed at tools/parsers/COLLINS-PARSER by kanmy on Dec 30, 2002. Maintained by kanmy. Language: English. Status: Currently installed and working. See also in this file the daemonized version of the Collins parser.

Coloring HTML Annotation Tool

#Tools @Sunfire
Coloring works by processing a input HTML file or a URL. The output is the original file but adds extra javascript and alters <A HREF>s tags so that the text can be annotated. A user can then annotate this file by using a javascript-enabled browser by simply highlighting spans (starting on a word and ending on a word) and selecting an appropriate annotation from the annotation pane. The user can also annotate images with the same tags by clicking on them directly. See README in directory for more details.
Installed at tools/annotationTools/coloring on Nov 08, 2008.

CRUNCH HTML Content Extractor Proxy

#Tools @WING(cte) @Sunfire
Described in Gupta et al.'s paper in WWW 2003.
Installed at tools/htmlTools/proxy by kanmy on Jun 03, 2003. Maintained by kanmy. Status: Restricted license for research purposes only, contact the maintainer for access to this tool.

Daemonized Collins Parser

#Tools @WING(cte) @Sunfire
The modified Collins parser as made available by Min-Yen Kan of NUS. Modified to allow the parser to load the hash tables once and stay resident (as a background daemon process) so that parser can parse multiple files, without having to re-load the hash tables each time. See the on-line README for details.
Installed at tools/parsers/daemonCollins by kanmy on Aug 04, 2003. Maintained by kanmy. Language: English. Status: Currently installed and working. See also in this file the original version of the Collins parser.

DUCView

#Tools @Sunfire
DUCView tool is pertinent to the creation of a model pyramid from multiple human summaries. It is not relevent if you are interested in peer annotation, that is, in evaluating a new summary against the pyramid. Specifically for DUC 2005, participants will receive already annotated pyramids and will do only peer annotation. See DUCView site for more details.
Installed at tools/evalTools/DucView on Jul 22, 2005.

Duke University's Autobib

#Tools @WING(cte) @Sunfire
The Autobib project proposes and implements a framework of extracting and integrating bibliographic information on the Web automatically using Hidden Markov Models. Here, you will find code and documentations related to this project, and you can also browse the experimental bibliographic data and check for its quality. This project is done in the Computer Science Department at Duke University, under the supervision of Prof. Jun Yang.
Installed at tools/internetTools/autobib from source by kanmy on Jul 13, 2005. Maintained by kanmy. Language: English. Status: freely available data.
This is a toolkit of perl scripts to manipulate and (hopefully) recover a hierarchy of headers/topics from HTML files. The resulting output is a document topic tree (or variously called a document map, or document structure tree). The toolkit here is an extraction-based method that looks for what seems like stand-alone phrases that may be headers. The toolkit is constructed in a serial pipeline fashion.
Installed at tools/htmlTools/extractDTT on Jun 06, 2003.

FUF

#Tools @WING(cte) @Sunfire
Functional unification based natural language generation system developed by Michael Elhadad. Home page at: http://www.cs.bgu.ac.il/research/projects/surge/index.htm. Runs in LISP.
Installed at tools/generators/fuf by kanmy on Dec 28, 2002. Maintained by kanmy. Language: Any. Status: untested on the local system. Runs in LISP.

GATE

#Tools @WING(cte) @Sunfire
The General Architecture for Text Engineering from University of Sheffield 's NLP group there. Has a GUI for tools that do named entity tagging, part of speech tagging, co-reference, and other things, all in a nice GUI. Is a bit slow; is implemented in java. You will want to see the online documentation at their site. The information extraction system, ANNIE (A Newly-New Information Extraction) comes with part of the installation.
Installed at tools/frameworks/gate by kanmy on Jul 03, 2003. Maintained by kanmy. Language: Any. Status: Is under GPL, so it is free for all. Works fine.

Google Tools

#Tools @Sunfire
These are tools to deal with Google search. These are developed for local deployment with NUS SoC only. For more information contact the authors.
Installed at tools/internetTools/googleTools on May 30, 2003.

Google Web API

#Tools @WING(cte) @Sunfire
API for accessing the Google search results, preferable to screen / page scraping. You need to register with Google in order to use this service. They require individual registration. Home page at: http://www.google.com/apis/
Installed at tools/internetTools/googleapi by kanmy on Jan 10, 2003. Maintained by kanmy. Language: Any. Status: tested, okay on the local system.

Grok

#Tools @Sunfire
The Grok build system is based on Jakarta Ant, which is a Java building tool originally developed for the Jakarta Tomcat project but now used in many other Apache projects and extended by many developers.
Installed at tools/frameworks/grok on Oct 23, 2001.

Hidden Markov Model Tookit (HTK)

#Tools @WING(cte) @Sunfire
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples. See http://htk.eng.cam.ac.uk/ for more information.
Installed at tools/frameworks/htk/ by kanmy on Jan 22, 2005. Maintained by kanmy. Status: restricted use, you have to be a registered user on the HTK site in order to use this software. Please abide by the usage agreements before using this software.

HMM Tagger (Xerox tagger)

#Tools @WING(cte) @Sunfire
Xerox part-of-speech tagger. XPOST is a hidden Markov model based part-of-speech tagger. Given a sentence, each token is assigned a part-of-speech ambiguity class from a lexicon (e.g. "package" is in the ambiguity class {noun,verb}). Words not in the lexicon are subjected to suffix analysis. A probabilistic model that assesses the likelihood of particular part-of-speech assignments based on word order is then applied to disambiguate the available choices. The final output is a sentence with each word tagged with the most likely part-of-speech tag. XPOST can process all the languages for which word order predicts part-of-speech tag. FTP site at: ftp://ftp.parc.xerox.com/pub/tagger/. Use within Common LISP.
Installed at tools/taggers/xpost-1.2 by kanmy on Dec 30, 2002. Maintained by kanmy. Language: English. Status: currently tested and working.

HTML Language Detector

#Tools @Sunfire
The Language detector is used to detect the language of an HTML webpage. See README for more details.
Installed at tools/languages/languageDetector on Dec 04, 2003.

Illinois Chunker

#Tools @Sunfire
The chunker partitions plain text into sequences of semantically related words. The type of partition is also computed. The installed version is in perl. See README for more details.
Installed at tools/chunkers/shallow-parser from CCG on Jul 20, 2004.

JavaRAP

#Tools @Sunfire
JavaRAP is an implementation of the classic Resolution of Anaphora Procedure (RAP) given by Lappin and Leass (1994). It resolves third person pronouns, lexical anaphors, and identifies pleonastic pronouns. The original purpose of the implementation is to provide anaphora resolution result to our TREC 2003 Q&A system. See the site for more details.
Installed at tools/anaphoraResolvers/JavaRAP or tools/coreference/JavaRAP from JavaRAP site on Mar 30, 2007.

KEA 2.0

#Tools @Sunfire
The KEA Keyphrase extractor. Meant to build keywords from a document, much like the keywords used in the indexing terms for scientific papers. Uses the Lovins stemmer. Described in more detail at http://www.nzdl.org/Kea/.
Installed at tools/chunkers/KEA-2.0 by kanmy on Sep 18, 2003. Maintained by kanmy. Language: English. Status: Installed but not tested. Distributed under GNU GPL by the New Zealand DL group.
Klex is a finite-state lexical transducer for the Korean language, with the lexical string on the upper side and the inflected surface string on the lower side. Klex was developed on the XFST (Xerox Finite State Tool) software platform. Developed by Na-Rae Han. Homepage at: http://www.cis.upenn.edu/~nrh/klex.html.
Installed at tools/languages/korean/morphologyTools/klex by kanmy on Apr 21, 2004. Maintained by kanmy. Language: Korean. Status: restricted access to researchers (as per LDC policy).
Klex: Finite-State Lexical Transducer for Korean was produced by Linguistic Data Consortium (LDC) catalog number LDC2004L01 and ISBN 1-58563-283-x. Klex is a finite-state lexical transducer for the Korean language, with the lexical string on the upper side and the inflected surface string on the lower side. Klex was developed on the XFST (Xerox Finite State Tool) software platform, developed and distributed by the Xerox Corporation. The most common application for such lexical transducers is morphological analysis and generation.
Installed at tools/languages/korean/morphologyTools/klex on May 11, 2004. Language: Korean.

Language Technology Platform

#Tools @Sunfire
See http://ir.hit.edu.cn/ltp/ for details.
Installed at tools/frameworks/HIT_IRLab_LTP_Sharing_Package_Full_v1.1 on Nov 19, 2006.

LinkIT 1.0

#Tools @Sunfire
This is a chunker and statistical for simplex noun phrases (SNP). We present a linguistically-motivated technique for the recognition and grouping of simplex noun phrases (SNPs) called LinkIT. Our system has two key features: (1) we efficiently gather minimal NPs, i.e. SNPs, as precisely and linguistically defined and motivated in our paper ; (2) we apply a refined set of postprocessing rules to these SNPs to link them within a document. The identification of SNPs is performed using a finite state machine compiled from a regular expression grammar, and the process of ranking the candidate significant topics uses frequency information that is gathered in a single pass through the document. The paper Document Processing with LinkIT , was published in RIAO 2000. Also mentioned in Automatic identification and organization of index terms for interactive browsing.
Installed at tools/chunkers/LinkIT by kanmy on Dec 06, 2003. Maintained by kanmy. Status: restricted to academic use.

Lovins' Stemmer

#Tools @WING(cte) @Sunfire
Three different implementations of the stemmer are available from Frank Eibe's home page on the Lovins stemmer (http://www.cs.waikato.ac.nz/~eibe/stemmers/index.html). The software is downloadable from Sourceforge.
Installed at tools/stemmers/Lovins_Java by kanmy on Sep 18, 2003. Maintained by kanmy. Language: English. Status: GNU GPL: perl, Java versions installed and working, C version downloaded, but doesn't currently compile.
Adwait Ratnaparkhi's Maximum-Entropy based tagger, as per his 1997 ACL paper. This tools outputs the format expected by Collins' parser (also locally installed). Note that you have to use standard input to pass the input texts in.
Installed at tools/taggers/mxTag by kanmy on Jul 03, 2003. Maintained by kanmy. Language: English. Status: Restricted to research, educational and academic use only. Currently works without any problems.

MITRE's Alembic Workbench 4.40

#Tools @Sunfire
A tool to help in the development of tagged corpora. Uses a Tcl interface. See the AWB home page for more details at http://www.mitre.org/tech/alembic-workbench/. Usage notes: go to the directory and source the awb.cshrc or awb.bashrc file before running the awb utility.
Installed at tools/frameworks/awb/ by kanmy on Nov 07, 2003. Maintained by kanmy. Status: For research purposes only. Cannot be used for commercial development.
Tools for inflectional morphological analysis and generation, and for determining the orthography of the indefinite article are now available. Written by John Carroll of the University of Sussex. See the home page for more information.
Installed at tools/morphers/morph/ by kanmy on Jun 15, 2004. Maintained by kanmy. Language: English. Status: free for academic and research purposes from Carroll's tool home page.

nguyent6Spider

#Tools @Sunfire
UrlBasedFocusedCrawler, BreadthFirstCrawler, PageTextBasedFocusedCrawler. See README for more details.
Installed at tools/internetTools/nguyent6Spider on May 23, 2005.

nlparser (2005 May 26)

#Tools @WING(cte) @Sunfire
A natural language parser for English and Chinese. See the README file for more information. Home page: http://www.cs.brown.edu/software/
Installed at tools/parsers/nlparser by tanyeefa on Jun 17, 2005. Maintained by tanyeefa. Language: English. Status: Currently installed and working. Free for use for any non-commercial purposes.

OpenNLP

#Tools @Sunfire
The OpenNLP build system is based on Jakarta Ant, which is a Java building tool originally developed for the Jakarta Tomcat project but now used in many other Apache projects and extended by many developers.
Installed at tools/frameworks/opennlp on Mar 25, 2002.
The opennlp.maxent package is a mature Java package for training and using maximum entropy models. The documentation has some details about maximum entropy and using the opennlp.maxent package. It is updated only periodically, so check out the Sourceforge page for Maxent for the latest news. You can also ask questions and join in discussions on the forums.
Installed at tools/learners/maxent from Sourceforge by kanmy on Oct 14, 2010. Maintained by kanmy. Status: publicly available from sourceforge.

ParaCite

#Tools @Sunfire
Citation parser tool; Part of the OpCit project. See site for details.
Installed at tools/citationTools/opcit_modules on Jul 30, 2003. Status: Unknown.

Perl 5.8.0

#Tools @Sunfire
Perl version 5.8.0. Was installed because I couldn't find it on sf3. Have downloaded and quickinstalled a slew of modules for NLP/IR research. See the complete listings of installed modules here. See the documentation on installing new Perl modules at the end of this file; email the maintainer for more information on installing the files. Modules of particular interest to NLP/IR people include the WordNet::QueryData, WordNet::Similarity modules.
Installed at tools/languages/programming/perl-5.8.0 by kanmy on Jun 02, 2003. Maintained by kanmy.

Perl 5.8.2

#Tools @Sunfire
Perl version 5.8.2. Have downloaded and quickinstalled a slew of modules for NLP/IR research, mostly mirroring the 5.8.0 installation. See the complete listings of installed modules here . See the documentation on installing new Perl modules at the end of this file; email the maintainer for more information on installing the files. See also notes for Perl 5.8.0 below.
Installed at tools/languages/programming/perl-5.8.2 by kanmy on Dec 23, 2003. Maintained by kanmy.

Porter's Stemmer

#Tools @WING(cte) @Sunfire
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. Detailed description and a host of downloadable versions of it in different languages can be found at Porter Stemming Algorithm.
Installed at tools/stemmers/Porter by qiul on Sep 19, 2003. Maintained by qiul. Language: English. Status: ANSI C thread-safe version installed and working.

Prescript

#Tools @WING(cte) @Sunfire
Versions 0.1 and 2.2 are installed. This is a Postscript to text converter, developed by the NZDL group. I believe this is the converter used by Google for PDF files too.
Installed at tools/formatTools/prescript by kanmy on Jun 25, 2003. Maintained by kanmy. Language: Any. Status: Currently installed but NOT working.

Python

#Tools @WING(cte) @Sunfire
The python programming language. An older version central to sf3/sunfire can be found at /opt/sfw/bin/python.
Installed at tools/languages/programming/python by kanmy on Sep 08, 2003. Maintained by kanmy. Status: Public-domain, downloaded from Sourceforge.

Python 2.3

#Tools @Sunfire
Deprecated with version 2.5.2
Installed at tools/languages/programming/python-2.3 by kanmy on Sep 08, 2003. Maintained by kanmy. Status: Public-domain, downloaded from Sourceforge.

ROUGE 1.5.5, 1.5.4 and 1.4.2

#Tools @WING(cte) @Sunfire
ROUGE is an automated summarization evaluation program used by NIST in the DUC conferences to evaluate summarization systems. It is based on the BLEU machine translation scoring metric. See http://www.isi.edu/~cyl/ROUGE/ for more information.
Installed at tools/evalTools/rouge by kanmy on Sep 21, 2005. Maintained by kanmy. Status: open to the research community.

Ruby 1.8.7

#Tools @Sunfire
The ruby programming language.
Installed at tools/languages/programming/ruby-1.8.7 by kanmy on Oct 23, 2008. Maintained by kanmy. Status: Public-domain, downloaded from Sourceforge.

SecondString (20030401)

#Tools @WING(cte) @Sunfire
An open-source Java package containing implementations for approximate string-matching techniques, such as Jaccard, Jaro and TF-IDF. Home page: http://secondstring.sourceforge.net/
Installed at tools/citationTools/secondstring by tanyeefa on Aug 27, 2005. Maintained by tanyeefa. Status: released under the University of Illinois/NCSA Open Source License.

Segmenter 1.10

#Tools @WING(cte) @Sunfire
Min-Yen Kan's linear topical segmentation program, as described in Coling-ACL 1998.
Installed at tools/segmenters/segmenter/ by kanmy on Jul 21, 2003. Maintained by kanmy. Language: Any languages with word delimiters. Status: working, available for research use only.
SPADE is a discourse parser at sentence level written by Radu Soricut at USC/ISI. You can find details about the approach implemented by SPADE in the paper: Radu Soricut and Daniel Marcu (2003). Sentence Level Discourse Parsing using Syntactic and Lexical Information . See details and license in Daniel Marcu's web page http://www.isi.edu/licensed-sw/spade/.
Installed at tools/parsers/SPADE by cuihang on Feb 17, 2003. Maintained by cuihang. Status: works well, but it requires running under bash shell instead of C-Shell.

SMART

#Tools @Sunfire
SMART is an implementation of the vector-space model of information retrieval proposed by Salton back in the 60's. The primary purpose of SMART is to provide a framework in which to conduct information retrieval research. Standard versions of indexing, retrieval, and evaluation are provided.
Installed at tools/frameworks/ir/smart-11.0 on Nov 07, 2003.

SNoW POS Tagger

#Tools @Sunfire
A POS tagger from UIUC, can be found at http://l2r.cs.uiuc.edu/~cogcomp/eoh/pos.html
Installed at tools/taggers/SNOW_UIUC by cuihang on Jun 19, 2003. Maintained by cuihang. Language: English. Status: Currently installed and working.

SOAP::Lite for Perl

#Tools @Sunfire
SOAP::Lite for Perl is a collection of Perl modules which provides a simple and lightweight interface to the Simple Object Access Protocol (SOAP) both on client and server side. To learn about SOAP, go to http://www.soaplite.com/#LINKS for more information.
Installed at tools/internetTools/perlModules/SOAP-Lite on Apr 16, 2002.

SVM-light

#Tools @WING(cte) @Sunfire
SVMlight is an implementation of Vapnik's Support Vector Machine for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in . . The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently. Home page: http://svmlight.joachims.org/
Installed at tools/learners/svmLight-5.0 by kanmy on Dec 30, 2002. Maintained by kanmy. Language: Any. Status: Works.

SWI-Prolog 5.4.7

#Tools @WING(cte) @Sunfire
SWI-Prolog offers a comprehensive Free Software Prolog environment. See its home page at: http://www.swi-prolog.org/.
Installed at tools/languages/programming/pl by kanmy on Mar 10, 2005. Maintained by kanmy. Status: LGPL. Free for use.

Tcl/Tk 8.4.11

#Tools @Sunfire
A software system providing a simple command language, and a set of widgets for use in building GUIs. Home page: http://www.tcl.tk/. The reason for installing Tcl/Tk is because WordNet 2.1 requires Tcl/Tk to install, and only Tcl is found on sf3 (but not Tk).
Installed at tools/languages/programming/tcltk by tanyeefa on Jul 24, 2005. Maintained by tanyeefa. Status: Installed and untested. You may use Tcl/Tk in any way you wish, even in commercial applications.

Tidy 4 (Aug 00 distribution)

#Tools @WING(cte) @Sunfire
A tool to change non conformant HTML to compliant HTML code. From Sourceforge, based on the original version from Dave Raggett.
Installed at tools/htmlTools/tidy by kanmy on Jun 03, 2003. Maintained by kanmy.

Tiny SVM 0.09

#Tools @Sunfire
TinySVM is an implementation of Support Vector Machines (SVMs), for the problem of pattern recognition. This installation includes the shared library under the lib/ subdirectory. Details from Taku Kudoh's web page (http://cl.aist-nara.ac.jp/~taku-ku/software/TinySVM/) and the doc/index.html file for more information on his tool. Usage notes: as TinySVM's binaries are named the exact same as those created by SVM light, the executables are not included in the rpnlpir group account's path.
Installed at tools/learners/TinySVM by kanmy on Nov 07, 2003. Maintained by kanmy. Status: installed, compiled, tested. For public use, under GNU LGPL.
Brill's part-of-speech tagger, generating Penn treebank tags. Home page at: http://www.cs.jhu.edu/~brill/.
Installed at tools/taggers/RULE_BASED_TAGGER_V1.14 by kanmy on Dec 28, 2002. Maintained by kanmy. Language: English.

umdhmm-v1.02

#Tools @Sunfire
A HMM tool from Tapas Tanungo's software page. Implementation of Forward-Backward, Viterbi, and Baum-Welch algorithms.
Installed at tools/learner/HMM/umdhmm-v1.02 by cuihang on Sep 15, 2003. Maintained by cuihang. Status: works well.

URLSegEval

#Tools @Sunfire
Calculates the Precision, Recall, F1-measure and improved Pk measure (refered to in Pevzner and Hearst paper on "An Evaluation Metric for Text Segmentation" as WindowDiff measure). See README for usage information.
Installed at tools/evalTools/segmentation/URLSegEval on Mar 10, 2005.

Weka

#Tools @WING(cte) @Sunfire
A collection of machine learning algorithms for data mining tasks. Home page: http://www.cs.waikato.ac.nz/ml/weka/.
Installed at tools/learners/weka by tanyeefa on Jul 05, 2005. Maintained by tanyeefa. Language: English. Status: Currently installed and working. Released under GPL, free for public use.

WT10G URL Locator

#Tools @Sunfire
This is a small utility that locates a given URL within the WT10G collection. See README for details.
Installed at tools/corpusTools/wt by kanmy on Oct 12, 2004. Maintained by kanmy.

xmlAbbrevCoref

#Tools @Sunfire
xmlAbbrevCoref is a program that further annotates a XML with part-of-speech and named-entity tags with simple abbreviation expansions and lemmatization of simplex NP entities. It has been written expressly to patch a hole in the TREC 2003 run for coreference resolution, it is *not* meant to be state-of-the-art by any stretch of the imagination. See README for details.
Installed at tools/coreference/xmlAbbrevCoref by kanmy on Jul 21, 2003. Maintained by kanmy. Status: copyright Min-Yen Kan and the School of Computing, NUS.

YamCha Chunker v 0.27

#Tools @Sunfire
YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995. Installed from http://cl.aist-nara.ac.jp/~taku-ku/software/yamcha/.
Installed at tools/chunkers/yamcha/ by kanmy on Nov 07, 2003. Maintained by kanmy. Status: installed, compiled, tested. For public use, under GNU LGPL.