Research Note 12-1
Stemming and NLP
August 6, 2003
** We want to give some background on the processes that are going on.***
One of the stemming methods, in previous work, was to remove letters until a different word was created, that word being in the "word database". In this case, the database supplies a type of "ground truth" as to what a "word" is. The Porter stemmer removes letters based on a pattern of stings of consonants and strings of consecutive vowels. These patterns are statistically suggestive of rule patterns that end up stemming well about 80% of the time. By using domain specific word databases (key-less hash tables) and allowing these word databases to be edited by hand by the user community, one can bring this measure of correctness to near perfect.
The word database becomes a community monitored "controlled vocabulary" when the user community is given control over this resource.
As we add the stemmer, we might also add an NLP parser, if we can show that this would be fairly easy. Using the NLP++ language, Satish and Nathan will be able demonstrate this ease of development and use.
Nathan is currently working for me at 12 hours per week on programming support. Satish will be back August 11th. Amnon Meyers, the developer of NLP++, is a friend who is willing to spend some of his time and provide a development license at no cost. We also have the (no cost) advice and support from Dr. Bill Rose and Dr. Bruce Lund (now at In-Q-Tel) who were doctoral students in computational linguistics while I was the Director of the Neural Network Research Facility at Georgetown University in the early 1990s.
So we will get the technical details in line with what has been learned by the machine translation and computational linguistics community over the past several decades. A linguist at Mitre and several ontologists, including John Sowa are also advising us (privately and in confidence).
This first rate advice is a gift.
Two other gifts are available, one is a potential business relationship (teaming agreement) with the terminology reconciliation process mediation enterprise software called SchemaServer from SchemaLogic. The second other gift is the general rule formation technology from ClearForest. This rule forming technology is well developed and well accepted in the intelligence community and by some major news organizations such as ABC. The rules are executed on the front end as a type of web harvester.
The purpose of both the stemmer and the NLP parser would be to produce a list of significant words and phrases that the domain expert could edit.
Then when the rollup takes place, this list would serve as a stemming list.
The advantages to this are numerous. First the user community is able to form re-usable domain specific "controlled vocabularies" that governed a high percentage of the stemming and phrase identification. Then the Porter stemmer would be used only when a word not in the list is found in the real time processing.
The other significant advantage would be the introduction of some NLP passes. Satish is supposed to finish September and October in the development of a mapping between NLP parsing and OWL ontology constructions. Only part of this work could be introduced within a NdCore deliverable in October, but the hooks for all of it could be developed and then some additional contract with the client made to deliver a NLP++ front end and controlled vocabulary management interface.
The NLP++ language is available to us for demonstration purposes. A deployment would require a NLP++ license form Amnon, but this is a reasonable path towards providing a flexible preprocessor. Moreover, the NLP++ is a set of general-purpose linguists tools and can handle Arabic language structure. Stemming in Arabic is quite different form stemming in English.
The relationship between a controlled vocabulary and an RDF+OIL = OWL ontology is fairly natural, but will require about a month of engineering. My sense is that we are internally funded to demonstrate how OWL, NLP++ and stemming would work in the experimental system. Our Statement of Work requires me to complete this R&D by mid September.
We have talked about getting the client to fund several months, say Nov and Dec, so that the experimental work can be moved into the NdCore 2.x by end of the year.
The core issue is the user interface design. One that allows the user community to build and use, differentially, these auxiliary services. The Entrieva/Semio interface is the best I have seen for this type of "ontology and linguistic services. Claude Vogel developed the Semio theory in the 1990s. I had some opportunity to evaluate this work in a contract to Office of Secretary of Defense (Harriet Riofrio) in 1998 - 99. I also have a good working relationship with the current owners of the Semio patents, Mr Tom Lewis, CEO of Entrieva. Entrieva is doing an excellent job at breaking into some commercial markets.
Bringing knowledge technologies into the market is what OntologyStream does. Sharing knowledge among the companies developing these "emerging" knowledge technologies benefits everyone.