Natural language pro­cess­ing involves the reading and un­der­stand­ing of spoken or written language through the medium of a computer. This includes, for example, the automatic trans­la­tion of one language into another, but also spoken word recog­ni­tion, or the automatic answering of questions. Computers often have trouble un­der­stand­ing such tasks, because they usually try to un­der­stand the meaning of each in­di­vid­ual word, rather than the sentence or phrase as a whole. So for a trans­la­tion program, it can be difficult to un­der­stand the lin­guis­tic nuance in the word ‘Greek’ when it comes to the examples ‘My wife is Greek’ and ‘It’s all Greek to me’, for example. Through natural language pro­cess­ing, computers learn to ac­cu­rate­ly manage and apply overall lin­guis­tic meaning to text excerpts like phrases or sentences. But this isn’t just useful for trans­la­tion or customer service chat bots: computers can also use it to process spoken commands or even generate audible responses that can be used in com­mu­ni­ca­tion with the blind, for example. Sum­ma­riz­ing long texts or targeting and ex­tract­ing specific keywords and in­for­ma­tion within a large body of text also requires a deeper un­der­stand­ing of lin­guis­tic syntax than computers had pre­vi­ous­ly been able to achieve.

How does natural language pro­cess­ing work?

It doesn’t matter whether it’s pro­cess­ing an automatic trans­la­tion or a con­ver­sa­tion with a chat bot: all natural language pro­cess­ing methods are the same in that they all involve un­der­stand­ing the hi­er­ar­chies that dictate interplay between in­di­vid­ual words. But this isn’t easy – many words have double meanings. ‘Pass’ for example can mean a physical handover of something, a decision not to partake in something, and a measure of success in an exam or another test format. It also operates in the same con­ju­ga­tion as both a verb and a noun. The dif­fer­ence in meaning comes from the words that surround ‘pass’ within the sentence or phrase (I passed the butter/on the op­por­tu­ni­ty/the exam). These dif­fi­cul­ties are the main reason that natural language pro­cess­ing is seen as one of the most com­pli­cat­ed topics in computer science. Language is often littered with double meanings, so un­der­stand­ing the dif­fer­ences requires an extensive knowledge of the content in which the different meanings are used. Many users have first-hand ex­pe­ri­ence of failed com­mu­ni­ca­tion with chat bots due to their continued use as re­place­ments for live chat support in customer service. But despite these dif­fi­cul­ties, computers are improving their un­der­stand­ing of human language and its in­tri­ca­cies.  To help speed this process up, computer linguists rely on the knowledge of various tra­di­tion­al lin­guis­tic fields:

  • The term mor­phol­o­gy is concerned with the interplay between words and their re­la­tion­ship with other words
  • Syntax defines how words and sentences are put together
  • Semantics is the study of the meaning of words and groups of words
  • Prag­mat­ics is used to explain the content of spoken ex­pres­sions
  • And lastly, phonology covers the acoustic structure of spoken language and is essential for language recog­ni­tion

Part-of-Speech-Tagging (PoS)

The first step in natural language pro­cess­ing involves the technique of mor­phol­o­gy: It involves defining the functions of in­di­vid­ual words. Most people will be familiar with a sim­pli­fied form of this process from school, where we’re taught that words can be defined either as nouns, verbs, adverbs or ad­jec­tives. But de­ter­min­ing a word’s function for a computer isn’t such a simple task, because – as we’ve seen with the example of ‘pass’ earlier – the clas­si­fi­ca­tion of a word can depend on its role in a sentence and many words have changing functions.

There are various methods to help try and sort out the ambiguity of words with multiple functions and meanings: the oldest and most tra­di­tion­al method is based on the many text corpora like the Brown corpus or the British National corpus. These corpora consist of millions of words from prose texts that are tagged. Computers can learn rules for the way that different words have been tagged in these texts. For example, the Brown corpus has regularly been used to prove to computers that a verb no longer has a predicate function if there is an article before it.

More recently, modern tagging programs use self-learning al­go­rithms. This means that they au­to­mat­i­cal­ly derive rules from the text corpora as they read, using these to define further word functions. One of the best-known examples of a tagging method based on al­go­rithms like this is the Brill tagger, which first de­ter­mines the highest occurring word function in the sentence, in order to then use rules to define the other word functions around it. A rule could be something like: ‘If the first word of the sentence is a proper noun, then the second word is likely to be a verb’. This is a common theory, so in the sentence ‘Jason bought a book’, the word ‘bought’ can be defined as a verb.

Parse trees/tree diagrams

In the second step, knowledge derived from syntax is used to un­der­stand the structure of sentences. Here, the computer lin­guis­tics program uses tree diagrams to break a sentence down into phrases. Examples of phrases are nominal phrases, con­sist­ing of a proper noun or a noun and an article, or verbal phrases, which consist of a verb and a nominal phrase.

Dividing a sentence into phrases is known as ‘parsing’ and so the tree diagrams that result from it are known as parse trees. Each language has its own grammar rules, meaning that phrases are put together dif­fer­ent­ly in each one and that the hierarchy of different phrases vary. Grammar rules for a given language can be pro­grammed into a computer program by hand, or learned by using a text corpus to recognize and un­der­stand sentence structure.

Semantics

The third step in natural language pro­cess­ing takes de­vel­op­ment into the realm of semantics. If the word has the same tagging and syn­tac­ti­cal function, it’s still possible that the word has a variety of different possible meanings. This is best il­lus­trat­ed with a simple example: Chop the carrots on the board She’s the chairman of the board Now somebody with a good enough un­der­stand­ing of the English language would be able to recognize straight away that the first example refers to a chopping board and the second to a board of directors or similar. But this isn’t so easy for a computer. In fact, it’s actually rather difficult for a computer to learn the necessary knowledge required to decipher when the noun ‘board’ refers to a board used to chop veg­eta­bles and other food on, and when it refers to a col­lec­tion of people charged with making important decisions (or any of the other many uses of the noun ‘board’ for that matter). As a result, computers mostly attempt to define a word by using the words that appear before and after it. This means that a computer can learn that if the word ‘board’ is preceded or followed by the word ‘carrots’, then it probably refers to a chopping board, and if the word ‘board’ is preceded or followed by the word ‘chairman’, that it most likely refers to a board of directors. This learning process succeeds with the help of text corpora that show every possible meaning of the given word re­pro­duced correctly through many different examples. All in all, natural language pro­cess­ing remains a com­pli­cat­ed subject matter: computers have to process a huge amount of data on in­di­vid­ual cases to get to grips with the language, and the in­tro­duc­tion of words with a double meaning quickly increases the chance of a computer mis­in­ter­pret­ing a given sentence. There’s lots of room for im­prove­ment, most notably in the area of prag­mat­ics, because this concerns the context that surrounds a sentence. Most sentences conform to a context that requires a general un­der­stand­ing of the human world and human emotions, which can be difficult to teach a computer. Some of the greatest chal­lenges lie in attempts to un­der­stand moods like irony, sarcasm, and humorous metaphors – although attempts have already been made to try and classify these.

Natural language pro­cess­ing tools

If you’re in­ter­est­ed in trying out natural language pro­cess­ing yourself then there are plenty of practical tools and in­struc­tions online. Deciding which tool is best suited to your needs depends on which language and which natural language pro­cess­ing methods that you want to use. Here’s a rundown of some of the most famous open source tools:

  • The Natural Language Toolkit is a col­lec­tion of language pro­cess­ing tools in Python. The toolkit offers access to over 100 text corpora presented in many different languages including English, Por­tuguese, Polish, Dutch, Cat­alon­ian and Basque. The toolkit also offers different text editing tech­niques like Part-of-Speech tagging, parsing, to­k­eniza­tion (the de­ter­mi­na­tion of a root word; a popular prepa­ra­tion step for natural language pro­cess­ing), and the combining of texts (wrapping). The Natural Language Toolkit also features an in­tro­duc­tion into pro­gram­ming and detailed doc­u­men­ta­tion, making it suitable for students, faculty, and re­searchers. 
  • Stanford NLP Group Software: this tool is presented by one of the leading research groups in the world of natural language pro­cess­ing offers a variety of functions. It can be used to determine the basic forms of words (to­k­eniza­tion), the function of words (Parts-of-Speech tagging), and the structure of sentences (parsing). There are also ad­di­tion­al tools for more complex processes like deep learning, which studies the context of sentences. The basic functions are available in the Stanford CoreNLP. All programs of the Stanford NLP Group are written in Java and available for English, Chinese, German, French, and Spanish.
  • Vi­su­al­text is a toolkit that’s written in a specific pro­gram­ming language for natural language pro­cess­ing: NLP++. This scripting language was primarily developed to aid deep text analyzers who carry out analysis that is used to improve a computer’s global un­der­stand­ing (in other words in­for­ma­tion about en­vi­ron­ments and society).  Vi­su­al­text’s main focus is to help extract targeted in­for­ma­tion out of huge quan­ti­ties of text. So you could use Vi­su­al­text to summarize long texts, for example, but also to collect all in­for­ma­tion about par­tic­u­lar topics from different web pages and to present them as an overview. Vi­su­al­text is free of charge if used for non-com­mer­cial purposes.
Go to Main Menu