2004/07/30
Linear Text Segmentation using a Dynamic Programming Algorithm
Athanasios Kehagias, Fragkou Pavlina, Vassilios Petridis
EACL 03
- Also see: http://citeseer.ist.psu.edu/580132.html, and http://acl.ldc.upenn.edu/eacl2003/html/main.htm, and their Int'l J of Intelligent Systems paper
- Relevant to: text segmentation, partition product models.
- Printed, highlighted and filed
An linear segmentation approach that takes into account within-segment cohesion (they call this homogenity) as well as global information about segment length. Characterize global segment length as a normal distribution with (mu, sigma). Uses a balancing parameter, gamma, to weight the two sources of evidence.
They experimented with Choi's dataset (who consolidated much earlier work on segmentation) and use Beeferman's Pk to rate its success. Note that Pk penalizes segmentation mistakes near true boundaries much less than other mistakes (definitely a good thing).
As this approach has a model component that relies on the global text segment length information, it is important to see whether its performance could be an incidental result of using a training/testing corpus that is rather uniform. From the data in Table 1 it appears that this might be the case, as the average segment length is quite uniform in sets 1, 2 and 3.
From the results, the optimal gamma is set quite low, between .08 and .4, indicating that the segment length model is used as a refinement to the stronger cohesion/homogenity component.
Two other results of the paper are intersting and need to be investigated in more depth. One is that the generalized density (parameter r) seems to improve performance. Why this might be isn't immediately obvious to me. Secondly, they note that the segment length is better modeled by words rather than sentences. This intuitively makes sense to me, but again, its not obvious why sentences perform significantly worse.
Finally, the authors' end with a memorable sentences which should be considered: Choi uses local optimization of global information, and Heinonen uses global optimiziation of local information.
Relevance:
- how do we adapt this segmentation algorithm to do hierarchical segmentation for documents on the web?
- what exactly is a PPM role in the algorithm?
