Applying Text Resegmentation during Text Collation

Janis Dähne, Jörg Ritter and Paul Molitor

Institute for Computer Science, Martin Luther University Halle-Wittenberg, Germany

The work was submitted for publication. The preliminary version of the paper can be requested from the authors.

Abstract of the paper

In almost all current approaches, collation of large texts is applied on a fixed given segmentation of the two texts witnesses to be compared and consists of two subsequent steps. First, the segments of the two texts are aligned, then the aligned segments are compared in detail. For larger manuscripts or books consisting of many pages, the paragraphs of the texts are usually taken as segments. However, if, for example, two texts are compared with each other, the second of which was created by revising the first text, this procedure can lead to poor local alignments. This occurs at points in the text where paragraphs have been split into two smaller paragraphs each to insert a new paragraph in between, or several consecutive sentences have been moved from one paragraph to the previous or next paragraph. Most paragraph collation tools cannot handle these scenarios properly because they align each paragraph with at most one paragraph of the other text. In this paper, we discuss this problem in detail and present a heuristic for resegmentation of both texts to be compared, with the aim of obtaining a better collation.

Why is it needed?

The task of finding differences and similarities between some texts is part of many scientific issues in the Humanities, e.g., textual criticism or investigating the origin of a transmission of a text. To address such humanities questions, a collation of the text witnesses under consideration is usually created first, today usually using digital tools like CollateX, TSAligner, LERA, VecAlign or SentAlign. For larger manuscripts and books, the collation step must be divided into two substeps, the alignment of the paragraphs and the subsequent detailed comparison of the aligned paragraphs.
There are different scenarios in which the (given) segmentation poses problems for the collation. For example, if a large part of a paragraph has been removed during the revision of a text. If the original text is to be collated with the revised text, it would make sense to split the paragraph of the original text to get matching pairs again. Another scenario is that the revision moved several consecutive sentences from one paragraph to the previous or next paragraph. The collation tool would not find a good collation if it is not allowed to (slightly) change the given segmentation.

Out approach tries to resolve these issues by changing the segmentation by moving sentences. It also tries to minimize the number of changes to the segmentation, leaving unproblematic parts untouched. The program also comes with the option to only color the segments, so that the humanities scholar can decide which changes to the segmentation are acceptable.

The heuristic

The heuristic consists of three steps: preprocessing, classification and reconstruction.

Preprocessing
In the preprocessing step we do a sentence alignment with a known method to literature. It is based on "Putting collation of text witnesses on a formal basis" by Dähne et al. (2021). They used a directed, weighted graph to represent the alignment problem and used dynamic programming to find the shortest path.

It has the advantage of having only one parameter c₃ (we call it simply c), which is used to limit the distances considered for the alignment.
Another advantage is, that it can be used at the segment and sentence level if the user has not provided an alignment.

Classification
In the classification step we classify the sentence units into three classes: fitting, non-fitting and not aligned.
fitting sentences are those that are aligned and properly placed in the segment alignment.
non-fitting sentences are those that are aligned but not properly placed in the segment alignment. They are aligned to sentences that are in a segment in a different segment alignment row.
not aligned sentences are those that are not aligned to any other sentence.

To simplify processing, consecutive sentences that are classified the same way are grouped into blocks.
Then special rules are applied to handle not aligned sentences and invalid non-fitting sentences.

Reconstruction
In the reconstruction step we rebuild the original segment alignment from the blocks while move blocks of non-fitting to a new segment alignment row.

Parameters
Our heuristic has no parameters if both sentence alignment and segment alignment are given.
If one alignment is missing (or both), the parameter c needs to be set.
The parameter c determines how good/bad the distance (JD) of aligned sentences can get.
For example, setting c = 0.7 will only allow distances less than 0.7.
If c is set to 1.0, all distances are allowed.

How to set the parameters
A good approach is to start with c = 0.8. This allows sentences to be aligned with similarity greater than 20% only. After that, search in the sentence alignment for the worst distances and check if the sentence should be aligned or not. If you find any bad pairs, decrease c and run the heuristic again. We have found that c= 0.8/0.7 is a good value for most texts, depending on how similar they are.

Details on the experiments

We tested our approach with different parameters and on six different texts. Our heuristic found potential for improvement in all text alignments.

As distance measure we used the Jaccard Distance (JD) on words to measure the distance between two segments (sentences). Dähne et al. (2021) showed in Putting collation of text witnesses on a formal basis that it gives good results for alignments (paragraphs). We speed up the calculation by utilizing Bit Sets/Bit Arrays and our experiments show that it is also a good distance measure for sentences (words).

We applied normalization to the texts in order to work with different texts and languages. On the text level we preprocessed the paragraphs and converted tabs and new lines to a single whitespace in order get better results from the sentence boundary detection. For the distance calculation in the sentence unit alignment we extracted words by splitting at whitespace and removed stop words for the specific language. We then normalized the remaining words by applying canonical decomposition (NFD) and removing all non-Letter and non-Space characters (Unicode properties).

Test data

We tested our approach on six texts, "Wally die Zweiflerin" (in the following we abbreviate it by "Wally") and "Aus der Knabenzeit" ("Knab"), "Der Sinn und Wert des Lebens" ("SWL"), "Der grüne Heinrich" ("Heinrich"), "Histoire philosophique et politique des établissements et du commerce des Européens dans les deux Indes" ("Histoire") and "Nahar". We choose them because they are freely available and vary in size.

Text	Edition	# sentences	# aligned rows	Description
Nahar	19221930	20011998	621 / 820	Novel by Ernst Weiß
SWL	19071914	10021325	30 / 461	Book by Rudolf Eucken
Heinrich	18541879	80278863	786 / 2943	Novel by Gottfried Keller
Histoire	17801820	19502224	305 / 506	Book by Guillaume Raynal (french)

The first column indicates the year of origin and the second column the number of sentences in the two respective witnesses. The third column shows the number of aligned segments (rows) in relation to the total alignment rows.

Experimental results

As the assessment of alignments is often subjective and depend on the humanities question under study, we can only present results here and give our own observations.

Text	Parameter c	Result
Nahar	0.8	Original colored Improved
SWL	0.7	Original colored Improved
Heinrich	0.8	Original colored Improved
Histoire	0.7	Original colored Improved

In general, the heuristic will identify more non-fitting blocks when the parameter c is set high, allowing sentences with poor distances to be aligned. If an existing sentence alignment is used, a parameter comparable to the “c” parameter should be used.

For the text "SWL," the heuristic identified only 11 non-fitting blocks. Although this number seems small, it's significant given there are only 30 aligned segments in the original alignment. This indicates that the two text versions are quite dissimilar and have probably undergone substantial rework. With two exceptions, the non-fitting blocks typically contain large segments of non-aligned blocks (consecutive non-aligned sentences, or NABs). This suggests that while the created segments do not match well, the non-fitting sentences within these segments are on average 48% similar. NABs are excluded from this measure because they don't match any part of the other text version. Moving NABs to their own segments would improve this, but we aim to change the original segment alignment as little as possible, respecting any manual changes users may have made. "SWL" is also the only text with invalid non-fitting blocks — sentences that are aligned, but in the original segment alignment there are aligned segments between the aligned sentences. The first such block is in row 205, with its proposed partner in row 250. Refer to the sentence alignment to find these pairs.

The text versions of "Nahar" match well, except for the beginning. After the first section, almost all segments are aligned. The segments are short, and so are the non-fitting blocks, which typically consist of only one or two sentences that are moved to the next segment. Excluding NABs, the average similarity of non-fitting blocks is very high at 89%, indicating a strong match.

The French text "Histoire" falls in the middle. It contains long passages that match well, but occasionally long passages with not matching segments. The heuristic found 34 non-fitting blocks with an average similarity of 46%. Similar to "SWL," the non-fitting blocks often contain large NABs.

The largest text, "Heinrich," contains segments of varying length (sometimes more than 10 sentences). The beginning contains relatively large non-fitting blocks (more than three or four sentences). After that, the first half of the text versions are quite similar, with most segments aligned. The second half contains only a few passages with matching segments. The heuristic found 298 non-fitting blocks with an average similarity of 60%. Manually identifying and correcting these parts would be time-consuming for humanities scholars, but our approach significantly reduces this workload.

To address the issue of NABs in non-fitting blocks, it may be beneficial to implement this approach in an interactive environment. This would allow users to manually decide how to handle NABs and non-fitting blocks on a case-by-case basis.

Downloads

All results Code

Contact

Martin Luther University Halle-Wittenberg | https://www.informatik.uni-halle.de/ti/mitarbeiter | https://www.informatik.uni-halle.de/ti/

Janis Dähne

Jörg Ritter

Paul Molitor