Machine Learning for Text Extraction
Administrator - 12 July 2010 - 12min
Administrator - 12 July 2010 - 12min
In a previous post we looked at the use of Natural Language Processing techniques in text extraction. Several steps are involved in the processing as each document passes through a pipeline of chained tasks.
A deep pipeline can take several seconds for a document. So if one is dealing with thousands of documents an hour the processing requirements could make the system nonviable. Care needs to be taken to evaluate the trade-off between the improvements in accuracy caused by adding pipeline tasks with the additional processing power that it entails.
One reason for the slow speed in our email processing is that we are parsing the entire email and all emails regardless of whether they are of importance to use. In our case only 2% of the emails received will be of interest. So we would like to reduce the amount of text we process by ignoring the unwanted stuff. This process of weeding out irrelevant text should itself not take too long otherwise our purpose is lost!
Machine Learning (ML), which is a key area in AI, offers a solution. GATE comes with various machine learning Processing Resources implementing common ML algorithms like Support Vector Machine (SVM), Bayes classification and K-nearest neighbor (KNN). You “train” the algorithm using training sets of text samples.
Training is done by manually classifying sentences in a binary fashion: is this sentence of interest to me or not? Ideally you need thousands of representative sentences. The algorithm is then trained on this data: internally the various features and annotations are used to reverse engineer patterns based on the manual classification.
In production you first run your input text through the Machine Learning pipeline task. If it predicts that the text is of interest then you run it through the rest of the pipeline, otherwise ignore it. The problem is that this prediction is probabilistic. There could be two kinds of mistakes, one where it wrongly tells you that a dud document is of interest, causing wasted CPU cycles. A more troublesome mistake is when a valid document is marked as of no interest.
In our case for example this is an unacceptable error. We will miss reporting valid events to customers and they will no longer be able to rely on our service to do so. Unfortunately ML algorithms are such that these two types of errors cannot be reduced independently: if you want all valid documents you also get a lot of duds eating up your cpu cycles.
In addition ML can give you strange results. Bad data in your training sets can have a significant impact on your results. Debugging such issues is very difficult because of the non-deterministic nature of learning algorithms. A lot of trial and error is involved, mostly tedious work manually annotating documents, running different training sets and validating the results on real data.
However as in the deterministic NLP process using JAPE the result is magic. Once you have your training sets clean and complete the ML task can significantly weed out unwanted documents. Iteratively adding runtime learning to the system (where you enhance the training sets as you go along) can add dramatic improvements over time.
After the first experience with email parsing we are now using NLP in another project. We have a product for recruiters where resume parsing is an important piece. It currently parses candidate information using regular expressions and string matches.
The accuracy is around 80% for basic information which is a problem since 1 out of 5 fields is missed or wrong. Using a slightly different pipeline from the one described above and building in some heuristic in a custom PR we have been able to get to over 95% accuracy in the lab. In addition we are now extracting several other types of information which was considered too difficult to do using traditional programming.
Our experiences have made us look at other aspects of NLP like collaborative filtering and content-based recommendation engines as well as enhanced search using NL techniques. You might see a post on this soon!