ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package
This is the home page of the ParsCit project, which performs two
tasks: 1) reference string parsing, sometimes also called citation
parsing or citation extraction, and 2) logical structure parsing of
scienfific documents. It is architected as a supervised machine
learning procedure that uses Conditional Random Fields as its learning
mechanism. You can download the code below, parse strings online, or
send batch jobs to our web service. The code contains both the
training data, feature generator and shell scripts to connect the
system to a web service (used on this web site).
Some definitions (thanks to Robert Dale for Citations and Reference
Strings):
- Reference String:
- A text string in the bibliography or
reference section of a work, usually at the end of the document that
refers to a unique document. Usually occurs with other reference
strings that point to other documents. May also appear as
footnotes.
- Citation:
- A text string (usually explicit) in the
document body that points to a corresponding reference string at the
end of the document. Several citations may co-refer to a single
reference string.
- Document Logical Structure:
- A hierarchy of logical
components, for example, titles, authors, affiliations, abstracts,
sections, etc., according to (Mao, Rosenfeld &
Kanungo,2003). Our logical structure is more comprehensive,
comprising not only header metadata and references, but also the
logical structure of the internals of the document -- sections,
subsections, figures, tables, equations, footnotes and
captions.
This project deals with the problem of parsing the reference
strings and parsing the logical structure of a document. The first
task is handled by a module with the project namesake, ParsCit, and
the second task by a separate module SectLabel.
License
This software is licensed under the Lesser GNU Public
License (LGPL), which means you are free to use it for any
purpose, including embedding in commercial products.
Download
You can download the open-source code for ParsCit here. The source requires you to re-compile the CRFPP source code
and assumes that perl is installed on your system and can be invoked
using perl (must be in your path).
- Current version 110505b: Added XML::Twig for XML processing. ParsCit now uses input provided by SectLabel. See CHANGELOG.txt .
The (partially ported) Windows version is here (provided by Yumichika). See the CHANGES FOR WINDOWS.txt
We have also pushed a copy of the ParsCit current distribution into GitHub:knmnyn/parscit.
The Windows version has also been pushed to GitHub:wing-nus/parscit.
While we'll strive to keep the GitHub version as updated as possible, the versions on this page will remain the most authoritative for major updates.
- Other versions:
101101: Incorporated BiblioScript and BibUtils software. See CHANGELOG.txt;
100401d: Added SectLabel (logical structure parsing) software from the NUS team, and Iconip training data from Cheong Chi Hong for ParsCit with new ParsCit model retrained. See CHANGELOG.txt;
090625b: Added documentation for complete re-installation. Improved ParsHed with added line-level CRF model together and post-processing module by NUS team, WSDL file and client for service at NUS and minor bug fixes for ParsCit. See CHANGELOG.txt;
090316: Incorporation of ParsHed (header parsing) software from the NUS team. See CHANGELOG.txt;
081201: Bug fixes and incorporation of byte position offset from the Scienstein.org team. See CHANGELOG.txt;
080917: Minor changes (improved models and mulilingual support), see CHANGELOG.txt;
080402: First public release. Comes with precompiled linux binaries for CRF++;
080310: Beta release.
- CRF++: A conditional random fields toolkit that you may need to install, if the compiled one does not work for you. We recommend that you use version 0.51.
Web Service
More NLP services are now being made available on the web.
Following this trend you can send your plain text citations to use via
our web service. We will parse these for you free of charge (as and
when time and processing power allows, these processes are done with
lower priority).
N.B. We keep logs of what's parsed in these demos, to
improve the accuracy and productivity of ParsCit. If you'd like these
to be kept private or you find you use this service a lot, why not
install a local copy of ParsCit for yourself? If you do, please
let us know where you are so we acknowledge you here and can re-direct
some traffic your way.
Web-based Demonstration
N.B.: We keep logs of what's parsed in these demos, to
improve the accuracy and productivity of ParsCit. If you'd like these
to be kept private, why not install a local copy of ParsCit for
yourself?
You can also run ParsCit directly in your browser. The form below
submits your text input (after suitable cleaning) to the ParsCit
service to parse the input file or strings.
Note that if system loads gets high, your demo call may not be executed. If you want to run this program in batch, please download your own copy.
Demo #1: Parsing the header, logical structure and/or reference strings (and citation contexts) from a text file
Demo #2: As above but using XML input (XML must conform to Omnipage output). This demo is slow so please be patient.
Demo #3: Parsing individual reference strings only (just extract_citations)
Publications
Journal Papers:
- Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan (forthcoming)
Logical Structure Recovery in Scholarly Articles with Rich
Document Features. Forthcoming in the International
Journal of Digital Library Systems.
[ pre-print .pdf ]
International Referreed Conference Publications:
- Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008)
ParsCit: An open-source CRF reference string parsing
package. In Proceedings of the Language Resources and
Evaluation Conference (LREC 08), Marrakesh, Morrocco, May.
[ .pdf ]
[ Poster (.png) ]
Others:
- Yong Kiat Ng. (2004) Citation Parsing Using Maximum Entropy
and Repairs. Undergraduate thesis. National University of
Singapore.
[ .pdf ]
Gold Standard Input and Sample Output
Group Members
- Min-Yen Kan - Project leader, NUS
- Isaac G. Councill, The Pennsylvania State University
- C. Lee Giles, The Pennsylvania State University
- Minh-Thang Luong - Research Assistant (alumnus), NUS
- Yong Kiat Ng - Final year undergraduate student (graduated, 2004), NUS
- Thuy Dung Nguyen - Research Assistant (alumnus), NUS
- Huy Nhat Hoang Do - Research Assistant, NUS
FAQ
- What platforms does ParsCit work on?
- ParsCit works on all major platforms: Windows, Linux and MacOS.
The installation requires ruby and perl and the CRF++ embedded
package also requires standard UNIX utilities like sed. You
should have a working knowledge of UNIX and some experience in
installing UNIX tools. Due to our time constraints, we may not be
able answer your particular problems with installation. Do let us
know if there was something important that you had to do to get
your particular download and installation working; we'll
incorporate it into the Troubleshooting section below.
- What is the difference of SectLabel and previous ParsHed?
- SectLabel is a newly-developed module that further extends
ParsHed in functionality. It not only classifies header metadata,
but analyzes full documents to output the logical structure of
the internals of the document -- sections, subsections, figures,
tables, equations, footnotes and captions.
For compatibility
issues, the ParsHed module is still retained in our source code
and command line options.
- How do I retrain ParsCit for a different language? I saw code in
lib/ParsCit/PreProcess' to find the beginning of the bibliography
section, and changed that but it doesn't work.
- The current version does not depend on those regular expressions
anymore, they are for previous versions (e.g., v101101). ParsCit
now first labels each line using the SectLabel module and
discovers which lines to parse references for based on the first
step's output. You need to retrain SectLabel for this, by
providing labeled data about what class of line each line in your
training data is. It's also possible to "downgrade" the current
version to go back to use the rule-based method for identifying
the reference section.
- What is the "genericHeader" in the output of SectLabel? What is
the difference between "genericSect.tagged" and "SectLabel.tagged"?
- Generic headers, such as introduction, methodology, and
evaluation, represent generic purposes of different sections in a
scholarly article. We map all section names to generic ones
(i.e., "5. Text Features" to "Methodology"). This promotes
comparative viewing of sections with identical purpose across
articles. For the second question, actually, Generic section is
a component of SectLabel. It is responsible for classifying the
section headers of a paper into the generic categories such as
Introduction, Methodology, Result, etc. For details refer to our
IJDLS journal paper.
- Why is there an option to input file in XML format? Which DTD
should it follow?
- SectLabel is a robust logical document structure inference
system that can handle both rich input (produced by OCR software
such as font or spatial features) to boost recognition
performance, but still be able to perform inference on
impoverished input (plain text) with degraded
performance. Currently, the XML input must be in the form of
output from Nuance OmniPage (version 16)'s XML format, and hence,
should follows the DTD by OmniPage. Note: The ParsCit team is not
affiliated with Nuance in any way nor does it endorse
OmniPage.
- I need to run ParsCit but I can't get well-formed text from my
PDF documents. Can you help?
- No, we cannot help you with this. We don't perform OCR or text
extraction from PDF documents. You will have to find your own
source for doing the extraction or conversion. We've found
Omnipage useful in our own project work (hence the possibility of
XML input), but we don't endorse any product.
- The OmniPage XML doesn't seem to be well-formed. Is that OK?
- Yes. The sample "XML" provided in the links (for Demo 2) are
actual outputs for a sequence of XML pages (one XML file per
page). If you use OmniPage to save an XML file for input to
ParsCit, make sure to save individual pages as separate files,
then concatenate them to send to ParsCit. You may want to
download the sample links for inspection (as they are
concatenations of several XML files, your browser will likely
complain about them not being well-formed.
- I ran Demos 1 and 2 with the default "all" settings, but sections
don't seem to be detected.
- There's no problem. The demo just hides the SectLabel output
by default. Click "Show SectLabel output" to reveal it.
- I ran ParsCit using the OmniPage XML output, but encountered malformed UTF8 character errors.
- OmniPage normally outputs XML results in UTF-16 format, a conversion into UTF-8 will solve the problem, see below:
iconv --from-code UTF-16 --to-code UTF-8 omnipageOutput.xml > newOmnipageOutput.xml
Troubleshooting
A list of common problems with ParsCit. If you find problems,
email the lead developer at <kanmy@comp.nus.edu.sg>. Please use
the subject "[ParsCit]" to ensure that it reaches our attention. If
you have hand-corrected tagged data that you don't mind providing us,
we can use that to further improve ParsCit's extracting capabilities.
Nevertheless, there are problems with the output occasionally. Below
are some common problems people have encountered.
- ParsCit v110505 seems to be a lot slower when used on Omnipage
output than the previous versions, why?
- You are correct. We are now using XML::Twig to do the XML
processing correctly, rather than do it ad-hoc ourselves, but this
requires constructing an exhaustive DOM tree for the Omnipage input.
This is the timesink that you are experiencing.
- I'm running ParsCit on Windows but I can't get it to work, even
after installing a perl interpreter. Specifically, the
citeExtract.pl program dies complaining that it Can't open
"/tmp/...." at line 175.
- ParsCit hasn't been fully tested on windows at NUS, so we can't
vouch for whether it will run correctly. In this specific error
case, the "/tmp/" directory (a standard place for temporary files in
UNIX systems) is normally not available in Windows, and may generate
problems. You may need to change the code and/or create an
appropriate directory for ParsCit to generate such files.
- I tried downloading and running ParsCit but I get complaints
about /bin/sed and crf not being found. Help?
- Please read the INSTALL.txt directions. You need to recompile
CRF++ for your platform. The paths included with the install are
for our version, you need to recompile to have the paths point
correctly.
- When running citeExtract.pl I get some errors complaining about
the wrong ELF class of the binaries. How can I fix this?
- This seems to be a problem with the compiled executables of
CRF++ bundled with the software. Follow the INSTALL instructions
but after step 1 do:
$ cp -Rf * ../../.libs
$ cp crf_learn ../../.libs/lt-crf_learn
$ cp crf_test ../../.libs/lt-crf_test
- I'm trying to install parscit v110505 using the instructions in the install file, and when I get to the point where you're supposed to recompile CRF, it exists with an error:
In file included from node.h:13:0,
from node.cpp:9:
path.h:26:52: error: 'size_t' has not been declared
make[1]: *** [node.lo] Error 1
make[1]: Leaving directory `/home/agarnett/parscit/crfpp/CRF++-0.51'
make: *** [all] Error 2
The install file mentions that this may fail the first time; unfortunately for me, it keeps failing. any help?
- The error is from CRF++ package (not from ParsCit), there are two ways to fix it:
1. Add the line. #include<iostream> in node.cpp and compile crf++ again, or;
2. Go to http://crfpp.googlecode.com/svn/trunk/doc/index.html and download the latest version. The instruction is the same. Hope this helps.
- Issue numbers don't get extracted.
- This issue should be fixed as of the v110505
release. There is now some heuristic postprocessing code to
take care of breaking single or multiple tokens for issues and
volumes.
- Separation of author names and publishing year fails
- In some reference data with non-standard sequences of
first names and family names, e.g.
Baltes, Paul, Ursula Staudinger, Ulmann Lindenberger (1999): Lifespan
psychology: theory and application of intellectual functioning; in:
Annual Review of Psychology, 50, 471-507
ParsCit's post processing step may not detect and deal with these
problems reliably. We're working to fix these too.
- I passed ParsCit plain text output but in another, non-English
language. I didn't get good results or I got empty results. Can
you help?
- Aside from English, ParsCit can handle Italian and German to a
limited extent, thanks to the multilingual training data.
However, the demo web interface uploads non-ASCII (e.g., UTF-8 or
UTF-16 data) as binary data and fails to execute ParsCit.
However, if you download a copy of ParsCit, the libraries do work
on such data. Here's a sample. We'd love to help make
a more universal model that can accommodate reference strings in
other languages. If you're willing to help contribute ground
truth data, we love to hear from you!
- How about retraining ParsCit for another language/domain?
- You can put your supervised exemplar data into the same format
as tagged_references.txt found in crfpp/traindata/. Once you have
this file you can generate the appropriate model for ParsCit, by
using three commands (assumes you are in the crfpp/traindata
directory):
$ ../../bin/tr2crfpp.pl tagged_references.txt > parsCit.train.data
$ ../crf_learn parsCit.template parsCit.train.data model
$ mv model ../../resources/parsCit.model
The first command creates the input feature file that crfpp uses
from the training data. The second creates the model using the
crf_learn command. You can then move the model file to the
resources/ subdirectory where it can be utilized. To replace the
default model that comes with ParsCit, just execute the final
command.
- Can I retrain the package for a different set of tags if I
change the tagset in the training data?
- Yes, you should be able to change the tagset to suit your
dataset. You can add, eliminate and change the tagset as you
wish. You need to retrain the parser system after creating your
tag data. For more details on the training process, see the
documentation for CRF++, that is on the web at sourceforge.
- When retraining I get a "bad_alloc" error. What gives?
- We're not entirely sure of this. CRF training is quite memory
intensive and running a large amount of training data tuples may
cause the embedded CRF++ package to fail. You can try with less
training data, or try training on a machine with a larger amount
of RAM.
- Does the web service actually work? I can't seem to run it.
- Occasionally our school's networking staff changes the firewall
settings, so the port for our group's web services may be blocked
(port 4000 on host wing.comp.nus.edu.sg). If you find you can't
reach our services (they time out), please let us know.
- I get funny errors with crf_test not being useful. How do I
fix this?
- The updated README.txt file in the 090625b
distribution fixes this. Basically you need to recompile CRF++
0.51 and place the libraries and the executables in the proper
place. See the README for details.
Kudos
ParsCit owes its continued maintenance and support from its user
base. Here we'd like to thank them for their help.
Thanks to David Judd who reconfigured how CRF++ is located with
respect to the main code. Thanks to Alex Garnett in spotting more
problems with CRF dependencies. Thanks to George E. Raptis and Eric
Tran for the port to Windows. Thanks to Zhu Ying-Bo
(yumichika@163.com) from the Language Computing and Web Mining Group,
Institute of Computer Science and Technology of Peking University for
the partial port to Windows. Thanks to Yustus Oktian for questions
about training for another language. Thanks to Madhur Kapoor for
asking questions about PDF conversion. Thanks to Behrang Qasemizadeh
for reporting problems with truncation of XML entities in XML output
(v110505). Thanks Tim Brody for his BiblioScript patch. Thanks to
David Jurgens for suggesting that remove temporary files after runs
(v110505). Thanks Nikolay Nikolov for suggesting the conversion of
OmniPage XML results from UTF-16 to UTF-8 to avoid encoding
problems. Thanks to Matteo Romanello for the suggestion and permission
to incorporate BiblioScript software (v101101). Many thanks to Kris
Jack for pointing out problems with the ELF binaries and an
appropriate fix. Thanks to Cheong Chi Hong for fixing problems with
Preprocess.pm (v100401) and contributing the ICONIP training data and
XML entity problems in reference string parsing (v100401). Thanks to
Priya Venkateshan for pointing out sudo/root installation
possibilities (v100401). Thanks to Mario Lipinski for reporting
punctuation stripping problems in reference string parsing (v100401).
Thanks to Artemy Kolchinsky for fixes in Preprocess.pm
(v090625). Thanks to Matteo Romanello for the humanities training
datasets. Thanks to Dain Kaplan for helping us fix the Preprocess.pm
bug. Thanks to Ayeh Bandeh-Ahmadi for correcting the warning in
parseRefString.pl. Thanks to Nick Friedrich and Jöran Beel of
scienstein.org for all fixes in the v081201 version of ParsCit. Also
thanks to Madian Khabsa for indicating problems with installation to
MacOS.
ParsCit is used by many projects worldwide, and not just in
experimental, research and academic places, but in commercial
snterprises as well. Mendeley
is using ParsCit to parse references from contributed papers, as is
the Citations in Economics
(CitEc) project.
Related Links
Other, open-source citation parsers:
Other related links. Contact Min below to get your other related
software listed here. Thanks!
Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Fri Dec 24 01:48:05 SGT 2004
| Version: 1.0
| Last modified:
Mon Mar 4 14:23:46 SGT 2013