Natural Language | ||
Patrick Doyle | AI Qual Summary | May 14, 1997 |
To understand something is to transform it from one representation into another, where this latter representation is chosen to correspond to a set of available actions that could be performed, and for which a mapping is designed so that for each event an appropriate action will be performed.
Natural language is only one medium for human-machine interaction, but has several obvious and desirable properties:
There are several major reasons why natural language understanding is a difficult problem. They include:
A view of language in which utterances can be understood as acts rather than representations. In giving a command or making a promise, for example, a speaker is entering into an interaction pattern, playing a certain role, committing both the speaker and the hearer to future actions.
The implications of this view include:
We often use indirect speech acts (e.g., Do you know what time it is? or It's cold in here!)
Searle presents a linguistic theory of indirect speech acts that includes certain conversational postulates about conversation that are shared by all speakers. These include
Language processing can be divided into two tasks:
The steps in the process of natural language understanding are:
Signal processing is the task of taking a spoken bit of language and turning it into a sequence of words. The problems involved digitizing the signal, and then distinguishing word segments that can be assembled into whole words.
The linguistic elements handled by signal processing are phonemes. Identifying these sounds in real-time is relatively easy, since there isn't much data in auditory signals. However, assembling them into words can be difficult, e.g.:
Syntactic parsing determines the structure of the sentence being analyzed. Syntactic analysis involves parsing the sentence to extract whatever information the word order contains. Syntactic parsing is computationally less expensive than semantic processing.
A grammar is a declarative representation that defines the syntactic facts of a language. The most common way to represent grammars is as a set of production rules, and the simplest structure for them to build is a parse tree which records the rules and how they are matched.
Sometimes backtracking is required (e.g., The horse raced past the barn fell), and sometimes multiple interpretations may exist for the beginning of a sentence (e.g., Have the students who missed the exam -- ).
Example: Syntactic processing interprets the difference between "John hit Mary" and "Mary hit John."
After (or sometimes in conjunction with) syntactic processing, we must still produce a representation of the meaning of a sentence, based upon the meanings of the words in it. The following steps are usually taken to do this:
Example: Semantic processing determines the differences between such sentences as "The pig is in the pen" and "The ink is in the pen."
To understand most sentences, it is necessary to know the discourse and pragmatic context in which it was uttered. In general, for a program to participate intelligently in a dialog, it must be able to represent its own beliefs about the world, as well as the beliefs of others (and their beliefs about its beliefs, and so on).
The context of goals and plans can be used to aid understanding. Plan recognition has served as the basis for many understanding programs -- PAM is an early example.
Speech acts can be axiomatized just as other operators in written language, except that they require modal operators to describe states of belief, knowledge, et al.
Chomsky delineated four types of grammars, numbered 0 to 3.
Transformational grammars were introduced by Chomsky in 1957 in his article on syntactic structures. Utterance is characterized as the surface manifestation of a "deeper" structure representing "meaning" of the sentence. The deep structure can undergo a variety of transformations of form (e.g., word order, endings, etc.) on its way up, while retaining its essential meaning.
Transformational grammars have three components. The first is a phrase-structure grammar generating strings of morphemes representing simple, declarative, active sentences, each with an associated phrase marker or derivation tree. The second is a set of transformational rules rearranging these strings and adding or deleting morphemes to form representations of the full variety of legal sentences. Finally, a sequence of morphophonemic rules would map each sentence representation to a string of phonemes.
(The first level strikes me as highly similar to Schank's conceptual dependency representations -- POD)
A simple example:
Semantic grammars combine syntactic, semantic, and pragmatic knowledge into a single set of rules in the form of a grammar. This is usually just a context-free grammar in which the choice of nonterminal and production rules is governed by semantic as well as syntactic function. This is usually good for restricted natural-language interfaces.
[Halliday, 1961] Systemic grammars distinguish three general functions of language, all of which are ordinarily served by every act of speech.
The ideational function serves for the expression of content. It says something about the speaker's experience of the world. Analyzing a clause in terms of its ideational function involves asking questions like: What kind of process does the clause describe -- an action, a mental process, or a relation? Who is the actor? Are there other participants in the process?
The interpersonal function relates to the purpose of the utterance. The speaker may be asking a question, answering one, making a request, giving information, or expressing an opinion. The mood system of English grammar expresses these possibilities in terms of categories such as statement, question, command, and exclamation.
Lastly, the textual function reflects the need for coherence in language use. Concepts for use in textual terms include theme (the element the speaker chooses to put at the beginning of a clause) and change (distinction between what is new in a message and what is assumed to be given).
[Fillmore, 1968] Fillmore characterized the relationships between the verb and noun phrase as "semantically relevant syntactic relationships" and called them cases. The case assignment comes from the deep structure, even though the surface structure is different. (So, for example, the sentence "John opened the door with the key" is semantically equivalent to "The door was opened by John with the key.")
The structure that is built by the parse contains some semantic information, although further interpretation may be necessary. Grammar rules are written to describe syntactic rather than semantic regularities. But the structures the rules produce correspond to semantic relations rather than to strictly syntactic ones. For example:
Parsing using case grammars is usually expectation-driven. Once the verb of the sentence has been located, it can be used to predict the noun phrases that will occur and to determine the relationship of those phrases to the rest of the sentence.
A wide range of recent approaches to natural language processing can be described in terms of unification grammars. Most such systems share a common basis -- context-free grammars -- and are extended using the concept of unificiation (or matching) on partially-specified tree structures.
The principal problem with earlier augmented parsing systems (such as ATNs) is that they force certain search strategies; ATNs usually take a top-down approach. This works poorly for other approaches; for example, island-driven search centering on words recognized with high confidence. If the search isn't sequential, many of the registers in an ATN will not be set appropriately and arcs can't be followed as needed.
Restrictions can be placed on augmented systems to remedy this problem. In particular, if registers are not allowed to be reassigned once they have a value, many problems disappear because a system can then be developed that delays the evaluation of tests until all the necessary registers are set. Instead of treating register assignments like variable assignments in a programming language, unification-based systems use the notion of unificiation between logical variables.
The most notable advantages of this approach stem from the independence of the grammar from a particular parsing algorith, so each rule in the grammar can be better isolated as a unit of study and modification.
A simple unification system can be described as follows. It is an extension of a standard CFG in which each rule is annotated with a set of unification equations. Specifically, two register-value structures unify if each of the specified register values is consistent with the same register's value in the other structure. Registers specified in one and not in the other are simply copied.
Example:
unify to produce the structure
All unification equations are of the form structure =
structure. The structure can be defined precisely be defining
each slot name as a function from its containing constituent to its
value; this can be represented with directed, acyclic graphs in which
each constituent and value is a node, and the slots are labeled arcs.
As described above, ATNs are RTNs that allow tests on their arcs, and have registers to record input values or functions of register values during processing. The basic bottom-up ATN parsing algorithm is as follows.
[Schank] Schank developed conceptual dependency (CD) as a representation for the meanings of phrases and sentences. The basic tenet of CD theory is
For any two sentences that are identical in meaning, regardless of language, there should be only one representation of that meaning in CD.
Schank thus accepts the early machine translation concept of an interlingua or intermediate language. The other, more dubious (POD) idea in CD theory is
Any information in a sentence that is implicit must be made explicit in the representation of the meaning of the sentence.
Primitive acts in conceptual dependencies include: PTRANS (physical transfer), PROPEL (physical force), MTRANS (mental transfers, like tell), etc. There are 11 of these primitive acts [1977]. Relations among concepts are called dependencies, of which there are a fixed number (e.g., "R" for recipient-donor dependency.)
Here is the translation of the sentence John gives Mary a book:
o R +---> Mary John <===> ATRANS <--- book <---| +---> John
In the CD, "o" means that the book is the object of the ATRANS, and the three-pointed "R" relation is a recipient-donor dependency between Mary, John, and the book.
Conceptual dependencies also model primitive states, which are integer-valued variables such as HEALTH or MENTAL STATE. States and actions can be combined, so for example John told Mary that Bill was happy can be represented as
In CDs, syntactic and semantic knowledge are combined into a single
interpretation system that is driven by semantic knowledge. Syntactic
parsing is subordinated to semantic interpretation, which is usually
used to set up strong expectations for particular sentence structures.
Parsing is heavily driven by a set of expectations that are set up on
the basis of the sentence's main verb. Because the representation of
a verb in CD is at a lower level than that of a verb in a case
grammar, CD usually provides a greater degree of predictive power.
Simply using dictionaries doesn't solve the problem. The other troubles are
A proposed approach for translating between arbitrary languages was
to go through an intermediate "universal language" or
interlingua. However, without commonsense world knowledge,
certain kinds of translations are impossible. Bar-Hillel gives the
example of "The pen is in the box" as opposed to "The box is in the
pen." No translator without world understanding could parse these
sentences correctly.
Since speech is our most natural form of communication, using spoken language to access computers has long been an important research goal. Originally, in the 1960's, research was done on isolated word recognition before the later development of speech understanding systems.
A physical problem with such systems is noise. Microphone and background noise can cause interference with the recording of the spoken utterance. Another problem is that the speaker doesn't always pronounce the same word in the same way. Sometimes whole syllables are dropped or "swallowed" at word boundaries. In fact, finding the boundaries between words in a connected-speech signal is one of the difficult problems in speech understanding.
With increased information available to the hearer beyond the acoustic signal (such as context), expectations about the content can be formed.
In the early 1970's, the Advanced Research Projects Agency of the
U.S. Department of Defense funded a five-year program in speech
understanding research (the SUR project). The systems to be
designed were to accept normally spoken sentences in a constrained
domain with a 1,000 word vocabulary, and were to respond in reasonable
time with less than 10% error. (This was one of the few times that AI
programs had any design objectives specified before development.)
A versatile system for the parsing and generation of strings in NL. GSP can directly emulate several other syntactic processors, including Woods' ATN grammar. It is not in itself an approach to language processing, but a system in which various approaches can be described. The basic idea is a chart that represents both the grammar and the input sentence as a modified tree (modified by making it binary and interchanging nodes and arcs). Chart manipulation mechanisms operate on the grammar itself.
[Lindsay, 1963] Syntactic Appraiser and Diagrammer -- Semantic Analyzing Machine. Programmed by Robert Lindsay in 1963 at CMU. It used an basic English vocabulary (1,700 words) and followed a context-free grammar. It parsed input from left to right, built derivation trees, and passed them to SAM, which extracted the semantically relevant information to build family trees and find answers to questions.
[Bert Green, 1963] An information retrieval program with a large database of facts about all American League games over a given year. It accepted input questions from the user, limited to one clause with no logical connectives.
[Bertram Raphael, 1968]
Semantic Information Retrieval system, it was a prototype "understanding" machine, since it could accumulate facts and then make deductions about them in order to answer questions.
[Daniel Bobrow, 1968] STUDENT was a pattern-matching natural language program written by Bobrow as his doctoral thesis work at MIT. It could solve high-school level algebra story problems.
[Weizenbaum, 1966] The most famous pattern-matching natural language program, ELIZA was built at MIT in 1966. The program assumes the role of a Rogerian, or "nondirective," therapist in its dialog with the user.
It operated by matching the left sides of its rules against the user's last sentence, and using the appropriate right side to generate a response. Rules were indexed by keywords so only a few had to be matched against a particular sentence. Some rules had no left side, so they could apply anywhere with replies like "Tell me more about that." Note that these rules are "approximate" matchers. This accounts for ELIZA's major strength, its ability to say something reasonable most of the time, as well as its major weakness, the superficiality of its understanding and its ability to be led completely astray.
[William Woods, 1973] LUNAR answered questions about the rock samples brought back from the moon using two databases -- the chemical analyzes and the literature references. Specifically, it helped geologists access, compare, and evaluate chemical analysis data on moon rocks and soil composition obtained from the Apollo-11 mission. It operated by translating a question entered in English into an expression in a formal query language. The translation was done with an ATN parser coupled with a rule-driven semantic interpretation procedure.
[Winograd, 1972] SHRDLU carried on a dialog with a user in which the system simulated a robot manipulating a set of simple objects on a tabletop. Knowledge was represented as procedures within the system.
The design of the system was based on the belief that, to understand language, a program must deal in an integrated way with syntax, semantics, and reasoning. The basic viewpoint guiding its implementation was that meanings (of words, phrases, and sentences) can be embodied in procedural structures and that language is a way of activating appropriate procedures in the hearer.
The implemented SHRDLU system consisted of four basic elements: a parser, a recognition grammar for English, programs for semantic analysis (to change a sentence into a sequence of commands to the robot or into a query of its database), and a problem solver (which knows how to accomplish tasks in the blocks world).
SHRDLU's grammar is based on the notion of systemic grammar, a system of choice networks that specifies the features of a syntactic unit, how the unit functions, and how it influences other units.
By functionally integrating its knowledge of syntax and semantics, SHRDLU can avoid exploring alternative choices in an ambiguous situation. The parser is oriented toward making an original correct choice rather than performing exhaustive backtracking.
SHRDLU combined a sophisticated syntax analysis with a fairly general deductive system, operating in a "world" with visible analogs of perception and action. It was "comprehensive" in the sense that it dealt with all the aspects of language comprehension. Further naturalness came from the substantial body of programs dealing with linguistic phenomena of conversation such as context, pronouns, nouns, and ellipsis. Ideas incorporated were:
[Schank, 1973] Meaning Analysis, Response Generation, and Inference on English system, developed at Stanford in 1973. Provided an intuitive model of the process of natural language understanding; see Conceptual Parsing above.
MARGIE consisted of three components. The first, the conceptual analyzer, converted English sentences into a CD representation using production-like rules called "requests." The middle phase was an inference system that accepted CD propositions and deduced facts from it, given the current system memory. Inference knowledge was represented in a semantic net. (So, for example, if told John hit Mary the system might infer Mary might get hurt.)
The final component was a text-generation module that took internal CD representations and converted them into English-like output.
[Schank] Script Applier Mechanism. Makes use of frame-like data structures called scripts, which represent stereotyped sequences of events, to understand simple stories. Prototype frames make it possible to use expectations about the usual properties of known concepts and about what typically happens in a variety of familiar situations to help understand sentences about those objects and situations.
[Wilensky, 1978] Plan Applier Mechanism. Understands stories by determining the goals that are to be achieved in the story and attempting to match the actions of the story with the methods that it knows will achieve the goals.
Plans are the means by which goals are accomplished, and understanding plan-based stories involves discerning the goals of the actor and the methods by which the actor chooses to fulfill those goals. In a plan-based story, the understander must discern the goals of the main actor and the actions that accomplish those goals.
PAM tries to determine the main goal and the D-goals that will satisfy the goal. It analyzes input conceptualizations for their potential realization of one of the plan boxes that are called by one of the determined D-goals. PAM utilizes two kinds of knowledge structures in understanding goals: named plans and themes (such as LOVE, which contain background information upon which predictions can be based that individuals will have certain goals)..
The distinction between plan-based and script-based stories is simple: in a script-based story, parts or all of the story correspond to one or more scripts available to the story understander; in a plan-based story, the understander must discern the goals of the main actor and the actions that accomplish those goals.
[Hendrix, 1977] Built at SRI, it is an off-the-shelf system for
building "natural language front-ends" for applications in any domain.
It has a set of interactive functions for specifying a language and
parser.
Speech understanding for voice chess [CMU, 1976]. HEARSAY uses a blackboard architecture with knowledge sources posting constraints. HEARSAY-II understood a spoken speech query about computer science abstracts stored in a database. HEARSAY-III is a general blackboard architecture.
The HEARSAY project was meant to overcome the limitations of syntax-directed methods of parsing from left to right.
HEARSAY uses three knowledge sources: acoustics and phonetics, syntax of legal utterances, and semantics of the domain. Knowledge was constrained by using expected utterances.
The blackboard was divided into a number of levels corresponding to a hierarchical breakdown of speech into units such as segments, syllables, words, and phrases. Hypotheses about these sentence units were posted to the appropriate levels, along with a time frame that indicated when the unit was hypothesized to occur in the utterance.
Based on the idea of cooperating, independent knowledge sources, hypotheses were posted to a global blackboard. This is a modular architecture; knowledge sources don't address each other directly, thus allowing different combinations of knowledge sources and control strategies to be tried.
Speech understanding for document retrieval [CMU, 1980]. Uses a precompiled network knowledge structure. The net is searched using beam search.
The HARPY system was developed at CMU after extensive evaluation of HEARSAY-I and DRAGON. HARPY uses a single precompiled network knowledge structure. The network contains knowledge at all levels: acoustic, phonemic, lexical, syntactic, and semantic. The network contains all possible legal sentences in an efficient representation (though it also represents nonsense sentences as a result).
HARPY does a beam search so it never needs to backtrack, and only extends paths in the beam as long as they retain sufficiently high scores. HARPY was the most accurate and efficient performer of the all the ARPA SUR systems.
The system is very sensitive to missing or poorly-labeled segments since the network is composed of a sequential phonetic representation. Design cannot easily accommodate the pragmatics of the utterance, which other systems use to constrain search.
Hear What I Mean [BBN, 1976]. Answers questions in the role of a travel agent, using a database of facts about trips and expenses.
This was the most demanding task of any of the ARPA SUR domains. All processing passed through the control strategy component which used knowledge sources as subroutines (in contrast to HEARSAY which has autonomous knowledge sources).
HWIM's organization permits more direct experimentation with control strategies -- manipulating, for example, how many word islands are generated, direction of expansion, and interaction of system components.
SDC is the bottom-end, SRI the top end. The task was to have the computer act as a consultant in a maintenance workstation setting.
Important features included concern with the nature of real
man-machine discourse, a language definition system for specifying the
input language to be understood, techniques for focusing the system's
attention on certain aspects of the dialog, top-down process control
stressing phrase-level hypothesizing, knowledge representation using
partitioned semantic networks, and experimental evaluation of system
design parameters.
Ronny Kohavi, AI Qual Note #16: Natural Language.
Feigenbaum and Cohen, Handbook of Artificial Intelligence, Vol. I, Chapters 4-5; Vol. IV, Chapter 19.
Ginsberg, Essentials of Artificial Intelligence (?), Chapter 17.
See the AI Qual Reading List for further details on these sources.
Patrick Doyle | pdoyle@cs.stanford.edu |