» Toggle Table of Contents

[ Back to the WING home page ] [ Back to ForeCite/CiteSeer web services ]
Download ] [ Web Service ] [ Web-based Demonstration ] [ Publications ] [ Gold Standard Input and Sample Output ] [ Group Members ] [ Troubleshooting ]
Picture of ParsCit Swami

ParsCit: An open-source CRF Reference String Parsing Package

This is the home page of the ParsCit project, which performs reference string parsing. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service (coming soon!). The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used here too).

Some definitions (thanks to Robert Dale):

Reference String:
A text string in the bibliography or reference section of a work, usually at the end of the document that refers to a unique document. Usually occurs with other reference strings that point to other documents. May also appear as footnotes.
Citation:
A text string (usually explicit) in the document body that points to a corresponding reference string at the end of the document. Several citations may co-refer to a single reference string.

This project deals with the problem of parsing the reference strings and parsing the metadata information found in the title page of the document. The first task is handled by a module with the project namesake, ParsCit, and the second task by a separate module ParsHed. Other projects related to ParsCit (some here in WING, some elsewhere) with identifying and linking citations to reference strings).


Download

You can download the open-source code for ParsCit here (coming soon). The source requires you to re-compile the CRFPP source code and assumes that perl is installed on your system and can be invoked using perl (must be in your path).

Web Service

More NLP services are now being made available on the web. Following this trend you can send your plain text citations to use via our web service. We will parse these for you free of charge (as and when time and processing power allows, these processes are done with lower priority).

N.B. We keep logs of what's parsed in these demos, to improve the accuracy and productivity of ParsCit. If you'd like these to be kept private or you find you use this service a lot, why not install a local copy of ParsCit for yourself? If you do, please let us know where you are so we acknowledge you here and can re-direct some traffic your way.

Web-based Demonstration

N.B.: We keep logs of what's parsed in these demos, to improve the accuracy and productivity of ParsCit. If you'd like these to be kept private, why not install a local copy of ParsCit for yourself?

You can also run ParsCit directly in your browser. The form below submits your text input (after suitable cleaning) to the ParsCit service to parse the input file or strings.

Demo #1: Parsing the citation contexts and the reference strings from a whole text file

NB - this demo does not handle PDF input at this time. You can use another web service or software to convert PDFs to text.

Method 1) Enter a URL to a file on the web (e.g., http://wing.comp.nus.edu.sg/~forecite/samples/E06-1050.txt or W06-0102.txt).

Method 2) Upload a .txt file (ASCII; UTF-8)

Method 3) Paste the whole file here:

For all three methods, parse the header metadata using ParsHed model at


Demo #2: Parsing individual strings only

Method 1) Enter a URL to a file on the web in the correct format (each line should be a separate citation; e.g., http://wing.comp.nus.edu.sg/~forecite/samples/E06-1050.cite).

Method 2) Upload a file (again, each line should be a separate citation)

Method 3) Enter a list of plain text citations (again, one per line):


Publications

International Referreed Conference Publications:

Others:

Gold Standard Input and Sample Output

Group Members

Troubleshooting

A list of common problems with ParsCit. If you find problems, email the lead developer at <kanmy@comp.nus.edu.sg>. Please use the subject "[ParsCit]" to ensure that it reaches our attention. If you have hand-corrected tagged data that you don't mind providing us, we can use that to further improve ParsCit's extracting capabilities. Nevertheless, there are problems with the output occasionally. Below are some common problems people have encountered.

Issue numbers don't get extracted.
We're looking into this. The training data does not make a distinction about volume and issue number. We'd like to fix that in a subsequent release.
Separation of author names and publishing year fails
In some reference data with non-standard sequences of first names and family names, e.g.
  Baltes, Paul, Ursula Staudinger, Ulmann Lindenberger (1999): Lifespan
  psychology: theory and application of intellectual functioning; in:
  Annual Review of Psychology, 50, 471-507
ParsCit's post processing step may not detect and deal with these problems reliably. We're working to fix these too.
I passed ParsCit plain text output but in another, non-English language. I didn't get good results or I got empty results. Can you help?
Aside from English, ParsCit can handle Italian and German to a limited extent, thanks to the multilingual training data. However, the demo web interface uploads non-ASCII (e.g., UTF-8 or UTF-16 data) as binary data and fails to execute ParsCit. However, if you download a copy of ParsCit, the libraries do work on such data. Here's a sample. We'd love to help make a more universal model that can accommodate reference strings in other languages. If you're willing to help contribute ground truth data, we love to hear from you!
How about retraining ParsCit for another language/domain?
You can put your supervised exemplar data into the same format as tagged_references.txt found in crfpp/traindata/. Once you have this file you can generate the appropriate model for ParsCit, by using three commands (assumes you are in the crfpp/traindata directory):

$ ../../bin/tr2crfpp.pl tagged_references.txt > parsCit.train.data
$ ../crf_learn parsCit.template parsCit.train.data model
$ mv model ../../resources/parsCit.model

The first command creates the input feature file that crfpp uses from the training data. The second creates the model using the crf_learn command. You can then move the model file to the resources/ subdirectory where it can be utilized. To replace the default model that comes with ParsCit, just execute the final command.

Kudos

ParsCit owes its continued maintenance and support from its user base. Here we'd like to thank them for their help. Thanks to Artemy Kolchinsky for fixes in Preprocess.pm (v090625). Thanks to Matteo Romanello for the humanities training datasets. Thanks to Dain Kaplan for helping us fix the Preprocess.pm bug. Thanks to Ayeh Bandeh-Ahmadi for correcting the warning in parseRefString.pl. Thanks to Nick Friedrich and Jöran Beel of scienstein.org for all fixes in the v081201 version of ParsCit.

Related Links

Other, open-source citation parsers:

Other related links. Contact Min below to get your other related software listed here. Thanks!


Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Fri Dec 24 01:48:05 SGT 2004 | Version: 1.0 | Last modified: Sat Jul 4 03:49:51 2009