Document reconstruction is a blanket term that I am using for a single process which can hypothetically be applied in a few different ways. Basically, we want to perform a “semantic analysis” on a set of written texts such that we can: Determine their relatedness to one another thematically.
This will allow us to:
Hypothetically, we want to be able to apply the same procedure to The Federalist Papers as we would The Dresden Codex. What will be required is that we have some information about how the symbols are grouped together.
Imagine that we have been asked to help do some or all of the following: digitize, preserve, archive, catalogue and/or analyze, a set of textual documents belonging to an artist, scholar or institution. These may number in the hundreds of thousands and have varying degrees of organization. Potentially, in the case of personal papers, they may contain limited (or no) page numbering or dating.
What we want is a way for our digitization procedures to also yield some clues about how all of these papers (texts) fit together. What we want is an assist with the jumping off point of our organization process.
How do we do this? We use NLP to help form some initial grouping for us to examine.
Computers do not have access to conceptual content
the way that humans do. They technically
know
nothing and therefore, do not recognize what a given text is about,
nor can they
generate inferences about it.
For example, given the sentence Ramona walked her older dog Jasper, even a child can make a few potential inferences about the information. Here's a partial list.
Depending on the Context of Utterance
---the conversation or discourse context---a child
might also be able to tell you some other things like:
One thing that we definitely know is that our Machine does not have World Knowledge
either
and will not be able to engage in any Pragmatic reasoning.
However, they can do something super cool and very quickly. They can calculate word vectors (a measure of one word to another) and build a representational space based on these measurements. If we help our machines to do this properly, then we humans can use that output to extrapolate about the relatedness of two or more texts. (Word Vectors are covered in their own tutorial)
It's not all that easy to imagine what goes on when algorithms are run over millions of
sentences, in other words, what a machine learns
from its training. It is much easier
to think about a few sentences and then abstract away a bit.
Let's look at three sentences and think about what our NLP semantic analysis might discover.
One thing that our Machine will determine based on semantic analyses is that there is a high degree of relatedness between (1) and (2). This makes some sense. Both sentences contain all of the same words. The sentences contain information about identical entities, Ramona and Bea, albeit taking different grammatical roles. The sentences also have identical predicates.
Because of this, our machine will find them to be more related to each other than to (3). This
might seem counter-intuitive and of course it certainly is, if your expectation is that the Machine
applies some conceptual content
. But, this it can not do. On the other hand, a child will likely
tell you that (1) and (3) basically tell us
or mean
the same thing. Namely, that Bea intentionally
performed some action which had the effect of misleading Ramona. A child knows who the “agent” of
the bamboozling is in each case and also, how the predicates to play a trick and able to fool relate
conceptually to one another. That is, the child and the machine will make different choices about
which sentences are related. They do this because, fundamentally, they evaluate them differently.
Our hope as researchers using NLP as a tool in our work, is that with enough training over large enough corpora, the machine develops the proper statistical generalizations such that the two predicates, in some sense, become linked. This may happen, and it may not.
We should know going in that whether we are talking single sentences or whole manuscripts, we're going to have more success correlating things like (1) and (2), then we are with (1) and (3). Our goal is to create a methodology that gets us closer to the latter (1) to (3) identifications than not.