Software testing is essentially an exercise of continuous exploration, learning and questioning. This exercise becomes very interesting and challenging at times, when application under test is as complex as Maps. You must have used applications like Google Maps, Yahoo Maps etc. Primary use of these applications is to help users in finding route. As an input to these applications, user gives source and destination and based on this information, maps give them directions to reach from source to destination. You might think from the description that application is simple, but it has got numerous challenges. As a tester you need to find out relevant queries and also quality of results produced by the system.
During Beta testing of the application, we got thousands of queries and input data which were used by the end users. To give an idea about the amount of data we had, for every city there are more than 8000 queries. For example, Hotels in Mumbai, Escort Mumbai, Taj Mumbai etc. Finding relevant data from these queries is a very difficult and time consuming task.
This data can be analyzed for relevant queries in two different ways, either apply human resources to analyze this or use Artificial Intelligence and write some smart tool. Since getting human resource is very expensive :) , we decided to develop some tool to classify input data.
After looking at the various possible solutions, we decided to use Bayesian Classifier. For people who are interested to know more about Bayesian Classifier , this is what Wikipedia say about it --
Bayes' theorem (also known as Bayes' rule or Bayes' law) is a result in probability theory, which relates the conditional and marginal probability distributions of random variables. In some interpretations of probability, Bayes' theorem tells how to update or revise beliefs in light of new evidence a posteriori.
The probability of an event A conditional on another event B is generally different from the probability of B conditional on A. However, there is a definite relationship between the two, and Bayes' theorem is the statement of that relationship.
Use of classifier based on the Bayesian's theorem is well known in the email spam filtering. Generally in spam filters, they have a large set of data in terms of good mail and spam mail. It works on the probability that certain words will be present in spam mails rather than normal email. System of spam mail filtering also learns from it's users every time user hit report spam or not a spam button.
So we decided to write our own tool based on the Bayseian theorem with the capabilities of learning what is good data and what is bad data. This tool will learn how to classify data based on how we train it. In simple terms, input for the tool would be definition of what is good, what is bad and sample data. Based on this, it will classify data in good or bad, as simple as that.
Normally to classify a set of text, we have to teach the tool what is good and what is bad. During the training, classifier will keep track of how often words categorized as good or bad are showing up in each category.
Implementation
This tool was developed in Ruby, as Lucas Carlson's Classifier library is already available as classifier gem. This library provides a naive Bayesian classifier. More information about this can be found here.
In our implementation, following code reads three files
* good.yml * not_good.yml * input file
For the execution, we need to give two command line arguments. City Name and Input File Name. Now based on the definition of good and bad, it will create a directory by city name and put good.txt and bad.txt in that directory containing information classified as good or bad. |