Text Extraction using Natural Language Processing

A few months ago I was asked to look into an email processing problem. We needed to extract event related information from consumer-originated email. As a traditional programmer the first instinct was to think in terms of regular expressions and lookup tables! Experience quickly tempered that thought and I decided to look at Natural Language Processing.

There were several standard methodologies in place for natural language processing tasks and quite a few open source tools were available. The jargon was daunting: corpuses, entities, gazetteers, POS tags, transducers, and JAPE were just a few terms that I had to wade through. The thought of the alternative: debugging code with zillions of unreadable regular expressions kept me going!

I downloaded GATE and was able to quickly build a prototype parsing emails to get to our target data. GATE breaks down the task of processing text into small specialized chunks of work strung together in a “pipeline”. The tasks work by putting XML annotations in the text or enhancing/using the annotations put by a previous task. It is a simple and beautiful architecture living up to its acronym: General Architecture for Text Engineering.

Each task is called a Processing Resource (PR) in GATE. You can choose from a host of preinstalled resources, or find and install PRs from the internet or just go ahead and write your own. Let us look at a simple GATE pipeline for text processing.
The first PR in pipeline is a tokenizer: this takes the email text and converts it into a series of tokens like numbers, upper-case strings, space or punctuation, etc. The second PR splits the text into sentences based on space and punctuation tokens
We then have a Parts Of Speech (POS) tagger: it understands sentence grammar and breaks the sentence into nouns, verbs, adjectives, pronouns etc.

A gazetteer is another useful Processing Resource which marks the text which matches your lookup tables. Take a list of colleges for example. If one of these colleges appears in the text then it gets annotated as a College.

We are almost there! The last stage is the scary sounding JAPE transducer. This is nothing but a way of defining regular expressions over the GATE annotations using a rule based language. But didn’t we switch to NLP to avoid regular expressions?
JAPE is a very different beast as compared to standard regular expressions.

– It works on the annotations added by the pipeline which capture grammar and lookups instead of raw text strings.
– JAPE rules are applied in a declarative manner. Regular expressions are sequential and in many occasions the order in which they are applied affect the result.

JAPE is bit difficult to understand however the accuracy, stability and maintainability offered by the GATE pipeline are far better than using traditional programming approaches.

There are several features of NLP that make it an art rather than a science. For each type of processing task there are several different types of PRs that you can choose from. For example we found that people use a lot of abbreviations in email and regularly leave out full stops at the end of sentences. A standard sentence splitter fails in such cases. We turned to the RegEx sentence splitter where we were able to enhance the logic used by defining our own regular expressions to detect or ignore such cases.

In addition, the order of tasks in the pipeline can make a big difference to the accuracy. Moving the gazetteer up the chain and using its annotations in sentence splitting helps resolve problems where the PR might split the sentence where abbreviations like U.S.A. are used (the full stop at the end of A and the space following it causes a line break in a usage like U.S.A Today).
The Java interface to GATE is simple. Once you are happy with the pipeline, from the IDE you

– Save it as a .gapp file in the GATE IDE.
– Load the gapp file (in Java), load the documents to process into a collection (the “corpus)
– Execute the pipeline.

For each document you get an annotated XML file which you parse using a standard XML parser to look for the tags your application is interested in.
A major complexity that I have avoided discussing until now is performance. Look forward to the next post to know more!!

Common Myth Regarding ViewState in ASP.NET

Through this article I want to defeat a very common misconception about ViewState. Most ASP.NET developers think that the ASP.NET ViewState is responsible for holding the values of controls such as TextBoxes so that they are retained even after postback.

But that is not the case!

Let’s take an example to understand the above:

Place a web server TextBox control (tbControl) and a web server Label control (lblControl).

Set the “Text” property of label and textbox to “Initial Label Text” and “Initial TextBox Text” respectively and set the “EnableViewState” property of both the controls to false.

Place two button controls and set their text to “Change Label Text” and “Post to Server”. First button changes label’s text by handling button click event and second button only does the postback.

private void btnChangeLabel_Click(object sender, System.EventArgs e)
lblControl.Text = “Label’s Text Changed”;

On running this application, you can see the initial texts in the controls as you have set.

Now, change the text in TextBox and set it to “Changed TextBox Text”.

Now click the Post to Server button. What happens is that the textbox retains its value, in spite of the ViewState property being set to false.

The reason for this behavior is that ViewState is not responsible for storing the modified values for controls such as TextBoxes, dropdowns, CheckBoxList etc., that is, those controls which inherit from the IPostBackDataHandler interface.

After Page_Init(), there is an event known as LoadViewState, in which the Page class loads values from the hidden __VIEWSTATE from the field for those controls (e.g., Label) whose ViewState is enabled.

Then the LoadPostBackData event fires, in which the Page class loads the values of those controls which inherit from the IPostBackDataHandler interface, (e.g. TextBox) from the HTTP POST headers.

Now, on clicking “Change Label Text” button which changes label text programmatically (made by above mentioned event handler), then on clicking “Post to Server”, page reloads and programmatic change is lost i.e. label text changes to initial value – “Initial Label Text”.

This is because the Label control does not inherit from the IPostBackDataHandler interface. So the ViewState is responsible for persisting its value across postbacks.

Also since ViewState has been disabled, the Label loses its value after clicking the “Change Label Text” button.

Now enable ViewState for the Label control, and you can see the modified value (“Label’s Text Changed”) after clicking the same button.

So we conclude that controls which inherit from the IPostBackDataHandler interface retain their values even if the ViewState has been disabled. This is because the values are stored in HTTP POST headers.