NLP Series: Language tasks

My first week of university just ended. The biggest chunk went to our mandatory course called “Advanced Natural Language Processing”. We went through a lot of material around regular expressions and some basic data tools with Python. The most interesting part of my week was the introduction and thinking about “language tasks”. There is very little information if you google the term, so I needed to dig to understand it.

I think I sometimes get stuck on insisting to fully understand these kinds of fundamental theoretical concepts because I need to grasp the whole system and how everything relates if I am to learn. How do I envision an NLP project, what it actually is? How does one even start to think when it comes to making the concept of one? The language tasks seemed to point to that direction.

We are following the book by Daniel Jurafsky & James H. Martin, “Speech and Language Processing”. The book is freely available to read online. As the professor clear out for us, the book does start at the somewhat beginning, but it quickly catches up. We plan to cover the whole book during these short months. I am both thrilled and terrified as I have three other courses with a similar workload as well. And only 5-6 hours of childcare :). Did I mention I still don’t have a computer?

Anyway, language tasks.

What is the language task?

A language task is a task where the input or the output is a linguistic object(or both). Now before you say duh, think about how much you know about linguistic objects and when which one is useful as a non-linguist. I actually found that pretty tricky!

It took a lot of google query optimisations to get to some proper theory, and I was happy to find a paper that addresses this: “Language Tasks and Language Games: On Methodology in Current Natural Language Processing Research” by David Schlangen. To relate to the burning meme above once again, it took me an almost full read to realise that the reason why the name was familiar at the beginning, is because it is the same professor that I just had the first week with. Lots of hugs for my tired brain cells.

Happy that there is a paper because I can freely cite and include things here from it.

The paper proposes an even more specific definition that the one centring linguistic objects: “A language task is a mapping between an input space and an output or action space, at least one of which contains natural language expressions.” On the image below we can see how the task relates to the other parts in the overall NLP computation.

The paper offers a more mathematical definition as well(at the very end of this page, part of the Appendix). If you are of the mathematics type.

I saw this diagram in our slides for the week, but it somehow didn’t click until I read the paper.

Now, we have a task. A task has a description. The description can be:

intensional: an informal way to describe, or from the paper “making reference to theoretical or pre-theoretical constructs external to the definition ”.
In mathematics, an intensional definition means “a definition that gives the meaning of a term by specifying all the properties of the things to which the term applies.”
In one place, the intensional definition was explained as hypernym + the characteristics that distinguish members of the set referred to by the original term.
Examples for such tasks in the NLP world would be:
- “Find out the emotions in a given text by analysing the sentiment”
- “Identify types of words in a sentence”
Extensional: a more formal way to describe, “through pairs of action”. We can think about this one also in terms of input and output, and the ways to move between them. In the same place as above, extensional description “is usually a list naming every object (or at least enough of a list to create clarity in the reader’s mind) that belongs to the concept”. In our case, that is being more specific of what exactly constitutes this task.
If we extend the examples we gave above:
- “Using a dataset of collected social media posts and using lexicons as a reference, perform sentiment analysis on the text, with sentences classified by emotion as the output”.
- “Using a dataset of collected texts and a dataset with tagged words, have annotated words as an output”.

I think these descriptions can be improved, and I will come back to just that once I learn more about how to describe the tasks themselves.

The task description relates to a cognitive capability on a speaker of that language. We see that these cognitive capabilities are defined as a “set of capabilities of a competent language speaker”. Competency, in this case, means the unconscious knowledge of grammar, knowledge that allows communication. It is related to the concept of linguistic competence. If you speak a language well enough to communicate, you are competent!

If we look at the mini examples in the types of descriptions we can see that we refer to our human capabilities of grammar and contextual analysis knowledge, something we do very easily. If we look at the processors(ways to get to the output in the computation) we will see that the ways we approach this are also closely modelled on how we imagine that we do this as humans. One of the interesting questions of the week was: to what extent should we do this at all? How close should we model our task processors on the human? Currently, the only entity in the world that can reliably process language tasks is human. That used to be the case with maths too, but it is not any more.

Furthermore, as we can see in the diagram, the dataset “exemplifies” the task. The professor mentioned the dataset as “inductive specification”. In this case, the dataset helps us specify what exactly we are trying to process and with that, what rules will apply.

Summary

A language task helps us define what we are to do when starting an NLP project. The language tasks is ideally rooted in a user need, and sometimes, as we learned in our class, a need rooted in the imagination of the researcher.

To describe the task, we have two ways: intensional and extensional. The intensional way is more informal, freely using lots of language-related theoretical concepts. We can express the description in an extensional way too, which will more closely demonstrate what we need to do and what resources we have. Then we have the dataset(s) we use, to give further structure to our project and help us shape the way we will process the task.

The tasks are closely related to the language cognitive abilities of humans, as well as the ways we computationally process these tasks. The question of whether this is the ways to do this remains open.

I’ve carried around some ideas about what I’d like to do if I had NLP skills for a while now, and learning how to define them is already helping me see what actually makes sense, and is doable.

There are a lot more interesting things to talk about, like the types of tasks(understanding, interpretation, generation, reference and inference), the ways go about these tasks in a classical and more current ways machine learning, what is a good task etc. I hope to come back and talk about some of these points in the future.

I struggle the most with all the maths vocabulary and terms, that inevitably are part of all that we do. It feels like I need to learn a new language, bringing concepts that are a stretch to understand from where I stand at the moment. I hope, slowly, very slowly, it will all come together.

NLP Series: Language tasks

What is the language task?

Summary

16,497 pages later, these are my 10 favourite books of the year

Little magic workbook on embodiment: "My body, my home"