Thinking about Semantics

Cultural Heritage Document Reconstruction

Document reconstruction is a blanket term that I am using for a single process which can hypothetically be applied in a few different ways. Basically, we want to perform a “semantic analysis” on a set of written texts such that we can: Determine their relatedness to one another thematically.

This will allow us to:

Determine how individual texts might form a larger work.
Determine the degree to which two text collections relate thematically.
Determine how best to group a set of texts of unknown structure and/or ordering.
Overall, we want some organizational assistance.

Hypothetically, we want to be able to apply the same procedure to The Federalist Papers as we would The Dresden Codex. What will be required is that we have some information about how the symbols are grouped together.

Thought Experiment---Cultural Heritage application

Imagine that we have been asked to help do some or all of the following: digitize, preserve, archive, catalogue and/or analyze, a set of textual documents belonging to an artist, scholar or institution. These may number in the hundreds of thousands and have varying degrees of organization. Potentially, in the case of personal papers, they may contain limited (or no) page numbering or dating.

What we want is a way for our digitization procedures to also yield some clues about how all of these papers (texts) fit together. What we want is an assist with the jumping off point of our organization process.

How do we do this? We use NLP to help form some initial grouping for us to examine.

NLP Semantics

Computers do not have access to conceptual content the way that humans do. They technically know nothing and therefore, do not recognize what a given text is about, nor can they generate inferences about it.

For example, given the sentence Ramona walked her older dog Jasper, even a child can make a few potential inferences about the information. Here's a partial list.

Ramona has a pet dog.
Ramona has more than one dog. (If this is the older, then there must be a younger)
The dog's name is Jasper.
Ramona walked Jasper.

Depending on the Context of Utterance---the conversation or discourse context---a child might also be able to tell you some other things like:

Maybe it is (un)usual for Ramona to walk her dog. (Maybe someone else walks the dog)
Maybe there is something wrong with the other dog, she's always seen walking both.
etc

One thing that we definitely know is that our Machine does not have World Knowledge either and will not be able to engage in any Pragmatic reasoning.

However, they can do something super cool and very quickly. They can calculate word vectors (a measure of one word to another) and build a representational space based on these measurements. If we help our machines to do this properly, then we humans can use that output to extrapolate about the relatedness of two or more texts. (Word Vectors are covered in their own tutorial)

Thought Experiment 2—Sentence analysis

It's not all that easy to imagine what goes on when algorithms are run over millions of sentences, in other words, what a machine learns from its training. It is much easier to think about a few sentences and then abstract away a bit.

Let's look at three sentences and think about what our NLP semantic analysis might discover.

Beatrix played a trick on Ramona.
Ramona played a trick on Beatrix.
Beatrix was able to fool Ramona.

One thing that our Machine will determine based on semantic analyses is that there is a high degree of relatedness between (1) and (2). This makes some sense. Both sentences contain all of the same words. The sentences contain information about identical entities, Ramona and Bea, albeit taking different grammatical roles. The sentences also have identical predicates.

Because of this, our machine will find them to be more related to each other than to (3). This might seem counter-intuitive and of course it certainly is, if your expectation is that the Machine applies some conceptual content. But, this it can not do. On the other hand, a child will likely tell you that (1) and (3) basically tell us or mean the same thing. Namely, that Bea intentionally performed some action which had the effect of misleading Ramona. A child knows who the “agent” of the bamboozling is in each case and also, how the predicates to play a trick and able to fool relate conceptually to one another. That is, the child and the machine will make different choices about which sentences are related. They do this because, fundamentally, they evaluate them differently.

Our hope as researchers using NLP as a tool in our work, is that with enough training over large enough corpora, the machine develops the proper statistical generalizations such that the two predicates, in some sense, become linked. This may happen, and it may not.

We should know going in that whether we are talking single sentences or whole manuscripts, we're going to have more success correlating things like (1) and (2), then we are with (1) and (3). Our goal is to create a methodology that gets us closer to the latter (1) to (3) identifications than not.