Sunday, December 20, 2009

Research paper in NDSS 2010: Improving Spam Blacklisting Through Dynamic Thresholding and Speculative Aggregation

Our blacklist paper titled:

Improving Spam Blacklisting Through Dynamic Thresholding and Speculative Aggregation

and authored by:

Sushant Sinha, Michael Bailey, and Farnam Jahanian
University of Michigan, Ann Arbor, MI - 48109.

is going to be presented in Network and Distributed System Security (NDSS) Symposium, 2010 from 28th Febuary to 3rd March in San Deigo, California.

Here is the abstract:


Unsolicited bulk e-mail (UBE) or spam constitutes a significant
fraction of all e-mail connection attempts and routinely frustrates
users, consumes resources, and serves as an infection vector for
malicious software. In an effort to scalably and effectively reduce
the impact of these e-mails, e-mail system designers have increasingly
turned to blacklisting. Blacklisting (blackholing, block listing) is a
form of course-grained, reputation-based, dynamic policy enforcement
in which real-time feeds of spam sending hosts are sent to networks so
that the e-mail from these hosts may be rejected. Unfortunately,
current spam blacklist services are highly inaccurate and exhibit
both false positives and significant false negatives. In this paper, we
explore the root causes of blacklist inaccuracy and show that the
trend toward stealthier spam exacerbates the existing tension between
false positives and false negatives when assigning spamming IP
reputation. We argue that to relieve this tension, global aggregation
and reputation assignment should be replaced with local aggregation
and reputation assignment, utilizing preexisting global spam
collection, with the addition of local usage, policy, and reachability
information. We propose two specific techniques based on this premise,
\emph{dynamic thresholding} and \emph{speculative aggregation}, whose
goal is to improve the accuracy of blacklist generation. We
evaluate the performance and accuracy of these solutions in the
context of our own deployment consisting of 2.5 million production
e-mails and 14 million e-mails from spamtraps deployed in 11 domains
over a month-long period. We show that the proposed approaches
significantly improve the false positive and false negative rates when
compared to existing approaches.

Friday, November 13, 2009

Anonymization of Court Judgments

Indian Kanoon has been pretty useful for people seeking information on Indian Law. Over time Indian Kanoon reputation on general search engines like google and yahoo improved and as a result pages from Indian Kanoon started surfacing in the top few of search results. So one completely inadvertent thing that happened was that people would go on google/yahoo and type someone's name and see the indiankanoon court judgment related to that person. As many of these judgments are on issue that may be societally bad, many people started feeling embarrassed by their court judgments. To be fair in many cases these people were right. But definitely one has to read the judgments to realize that.

The next thing that happened was that many folks sent me legal threats that I should remove court judgments corresponding to them. Some even claimed that they are currently citizen of countries in which such information is private and I should remove these documents immediately. A few nuanced requests only asked for restricting search engines from indexing these documents by putting these URLs under robots.txt.

As court judgments constitute public records, removing them from the website was out of question. However, restricting generic search engines from indexing such judgments was not that bad. One problem is that someone may be looking for information on a person and he or she will miss that because it is not indexed by generic search engines. After all a person acquitted by a court judgment is only legally right and societal or individual values may differ. So people should be free to decide on the issue by reading these court judgments. The second issue is that these court judgments also brings new users to Indian Kanoon and make them more knowledgable about law. So till now I have been doing nothing with such requests.

Recently a law suit has been filed in Andhra Pradesh High Court seeking for name anonymization in one of the court judgments. The order is here. That particular court judgment refers to a copyright infringement case related to a woman. It seems strange that court will ask to anonymize the name in this case as it does not relate to a woman being a victim of a sexual nature and not a minor. However, the AP high court has taken up this writ petition and it seems possible that the court may decide in the favor of plaintiff.

One question that is of interest is what is the bar for name anonymization in court judgments. It seems like if there is no such bar, then no one would want their name on the records. My search on Indian Kanoon did not reveal much as it seems like this issue has not been addressed in depth till now.

The search on IK reveals a bunch of cases when names have been withheld:"name+withheld"

However, almost all of these cases are when a woman is a victim of sexual nature. So it will be interesting to know if the courts order anonymization in cases related to copyright infringement.

Saturday, October 3, 2009

MCC Murgugappa Gold Cup 2009 - Schedule and Results

MCC Murugappa Gold Cup Hockey tournament 2009 - October 1 to October 11, 2009 in Mayor Radhakrishanan Stadium, Chennai Egmore

Pool A
Air India
Indian Overseas Bank (IOB)
Karnataka XI

Pool B
Indian Oil Corporation (IOC)
Army XI
Mumbai XI
Indian Railways
Punjab National Bank (PNB)

01-10-2009 Thursday 2:15 pm IOB vs Karnataka XI2-2
02-10-2009 Friday 2:15 pm Railways vs Army XI 1-2
4:00 pm Air India vs ONGC 1-5
5:45 pm IOC vs Mumbai X16-3
03-10-2009 Saturday 2:15 pm Karnataka XI vs ONGC1-3
4:00 pm Mumbai XI vs Railways0-2
5:45 pm Army XI vs PNB 3-1
04-10-2009 Sunday 2:15 pm Karnataka XI vs BPCL1-3
4:00 pm Army XI vs IOC0-1
5:45 pm Air India vs IOB2-0
05-10-2009 Monday 2:15 pm IOC vs PNB4-1
4:00 pm IOB vs ONGC1-3
5:45 pm Air India vs BPCL0-3
06-10-2009 Tuesday 4:00 pm Mumbai XI vs PNB1-4
5:45 pm IOC vs Railways1-0
07-10-2009 Wednesday 2:15 pm IOB vs BPCL0-2
4:00 pm Air India vs Karnataka XI3-3
5:45 pm Army XI vs Mumbai XI3-1
08-10-2009 Thursday 4:00 pm BPCL vs ONGC3-4
5:45 pm Railways vs PNB3-2
09-10-2009 Friday REST DAY
10-10-2009 Saturday TBA Winner A vs Runner B>/td>
TBA Runner A vs Winner B
11-10-2009 Sunday TBA Finals

Tuesday, July 28, 2009

Failure of software upgrade on Indian Kanoon server: Should be back soon

A recent upgrade to the Indian Kanoon server broke down the glibc dependency of many software packages. Glibc is one of the most critical pieces of libraries whose upgrade has definitely hoaxed the box. I came to the data center today trying to fix this problem and even the package manager was not working.

Booting from scratch and then reinstalling core components has produced some results. At least the package manager is working now and I am able to recompile glibc. The compilation is till going on and after these packages that depend on glibc need to be recompiled. Hopefully the server should be up soon.

I am extremely sorry for my mistake due to which other folks are suffering. And I promise not to do blind updates on a production server from now on.


Unfortunately the server could not be fixed today and had to bring the server to home. The server is fixed now but not accessible at You can access the server at You can get the search and access to documents. However, forums are not back yet because of the domain verification The server will be put back in the data center tomorrow and it will be back on Since the DNS queries are cached for a day it is not worth modifying the DNS entry.


The server is resored in the data center as of 1:00 AM IST.

Friday, July 24, 2009

Final PhD Defense: Context Aware Network Security

Computer Science and Engineering
CSE Defense

Context Aware Network Security
Sushant Sinha
Friday, July 31, 2009
3:00 pm - 5:00 pm
3725 CSE

Chair: F. Jahanian

The public is invited to attend

Thursday, May 21, 2009

New Text Search Goodies in Postgresql 8.4

Postgresql has a full text search engine built into it. Teodor Siagev and Oleg Bartunov, who started a text search engine called OpenFTS, merged their code base into Postgresql 7.4 as a separate contrib module called tsearch2.

Tsearch2 is highly extensible and sophisticated code base for text search. It is flexible in the sense that you can write your own stemmer and parser, or even use the default one for any new language. Tsearch2 keeping up with the tradition of extensibility in Postgres provides users to define their own ranking function or headline generation. Besides that it provides people to use two text indexes namely GIST and GIN both with different performance curves for initial indexing and index updates. Tsearch2 was merged in the core of Postgres last year in the 8.3 release.

Postgres 8.4 is being released after more than a year of development and testing. A few patches that many people wanted like a default replication scheme in the core and a SELinux in Postgres called SE-Postgrresql were punted for 8.5. The main reasons being that these patches were too big for people to review in the last months.

Postgres 8.4 brings a number of improvements in text search. Here are the list of new text features that you may look out for and may force you to upgrade:

1. Optimizer selectivity function for @@ text search operations (Jan Urbanski)

This was a Google Summer of Code (GSoC) project taken up by Jan Urbanski and it is great that the project was successful. Though it took a lot of time for this patch to be accepted, it provides a quite accurate selectivity measure for text search. This will enable the Postgres planner to produce better plans for SQL queries when text search matching operator is combined with other equally complicated operators.

2. Fast prefix matching in full text searches (Teodor Sigaev, Oleg Bartunov)

Earlier prefix matching was used to be done using LIKE operator. For example, for searching all documents that have have the beginning few words "sush", the WHERE clause needs to contain LIKE 'sush%'. However, LIKE operator does not use the text index and is very slow. This patch introduces fast prefix matching in Postgres. Now the where clause can be something like:

xt_tsvector @@ to_tsquery('sush:*')

3. Support multi-column GIN indexes (Teodor Sigaev)

Earlier if you have to index two separate text columns like "documents" and "comments", then you could only have separate indexes for each column. And then a query has to be matched with each column (q @ document and q @ comments). With this patch, such queries can take advantage of multi-column indexes if the developer has used one. Here is the performance improvement of multi-column index over single index as observed by Teodor:

Multicolumn index vs. 2 single column indexes

Size: 539 Mb 538 Mb
Speed: *1.885* ms 4.994 ms
Index: ~340 s ~200 s
Insert: 72 s/10000 66 s/10000

4. Improve full text search headline() function to allow extracting several fragments of text (Sushant Sinha)

This patch was contributed back by me. Headline generation is the identification of text fragment in a document where query terms appear. The default headline generation function shows only one text fragment for a set of query terms. Further, the existing headline generation function did not show good headlines (it is a more subjective judgment as there is no way to identify a good fragment.

My patch allowed more than one non-overlapping fragment to be displayed for a set of query terms. Further, the text fragments that were chosen in such a way that those fragments contained query items in the most compact way. A lot of databases will be envious of production quality headline generation in postgres now.

5. Improve support for Nepali language and Devanagari alphabet (Teodor)

I do not know much about this but looking at the CVS log, here was the bug that was fixed:

"Some languages have symbols with zero display's width or/and vowels/signs which
are not an alphabetic character although they are not word-breakers too.
So, treat them as part of word."

Tuesday, May 19, 2009

Net Neutrality: No Caps on Internet Bandwidth!

ISP's have been warning us for a long time that the bandwidth on the Internet is exhausting. The doom's day when we will not be able to use the Internet is pretty close. They have attributed this problem to a variety of reasons: a small group of users consuming significantly more bandwidth than others, some applications like bit-torrent consuming the entire bandwidth, some websites like youtube consuming the bandwidth, etc. Solutions with respect to each of these problems were proposed respectively: capping user bandwidth and penalizing for any extra bandwidth, use protocol shapers that slow down (read discriminate) certain protocols, or charge the website owner like youtube for reaching the customer.

Cory Doctrow in his guardian article argues that all of these mechanisms are designed to prohibit people from using the Internet. I have been always against middle boxes like Packeteer that restrict bandwidth usage based on protocol because such discrimination is arbitrary and can be used to used to target and extort any application they want. Similarly, charging websites for reaching their customers is totally flawed as this can again be extortionist. And as Cory points out that it can also be used to curb public protests and free speech. In his words:

"by allowing ISPs to silently block access to sites that displease them, we invite all the ills that accompany censorship – Telus, a Canadian telcom that blocked access to a site established by its striking workers where they were airing their grievances."

I used to think that the solution in which ISP's fix the total data transfered over a month, is fair. This solution was not discriminatory to any particular website or application. However, I did not realize that such policy can significantly hinder the free usage of Internet. As Cory says:

But the real problem of per-usage billing is that no one – not even the most experienced internet user – can determine in advance how much bandwidth they're about to consume before they consume it. Before you clicked on this article, you had no way of knowing how many bytes your computer would consume before clicking on it. And now that you've clicked on it, chances are that you still don't know how many bytes you've consumed. Imagine if a restaurant billed you by the number of air-molecules you displaced during your meal, or if your phone-bills varied on the total number of syllables you uttered at 2dB or higher.

Actually in India, people who exceed the bandwidth cap are penalized significantly. So capping bandwidth usage with penalties is either going to scare off people from using the bandwidth or too conservative to try any new website or application on the Internet.

If we do not want any restrictions and ISP's keep claiming of the clogging routers, what is the solution. If ISP's keep adding more bandwidth, we probably don't need to talk about it. But if they keep insisting, what is the type of plan that I may agree to. The only thing that I may agree to is a price model which is incentive based for using more bandwidth. So it can be at a fixed cost till say 10GB a month and then they can charge me at a reduced rate for next 10GB. Such graduated rate will give more incentive for people to try new stuff on Internet. After all we want more people to freely use the Internet without worrying about the details that they do not understand.

Sunday, May 3, 2009

Press Release: Indian Kanoon - Making Law Accessible To Common People

Please spread the press release for Indian Kanoonfar and wide

Wednesday, April 29, 2009

Ubuntu - Upgrade to Jaunty Jackalope

I recently upgraded Ubuntu on my Dell Inspiron 13 laptop from the Ibex Intrepid (8.10) to Jaunty Jackalope (9.04). "upgrade -d" was very slow on 23rd April, the release date for Jaunty. So I downloaded the Jaunty image using bit torrent. Once I had enough number of peers the average speed was topping 10MBPS. Looks like I was behind a 100Mbps switch.

Jaunty brings in Gnome 2.26 and a number of new features to the desktop. A common notification system for all applications is one of the very useful features. Desktop is much more slick and the new compiz and xorg are super fast. Compositing windows and the 3-D desktop effects are very useful for quickly turning to the window of interest. My music player rhythmbox used to hang while downloading multiple podcast feeds. This bug has been fixed in the jaunty. Jaunty brings in very fast boot-up with the boot time averaging to just 7 seconds on my laptop. And off course all devices on the laptop including webcam, bluetooth, wireless card and sound card were automatically identified and correct drivers were loaded.

Two things that did not work out of box were the intel GM965 graphics cards and the in-built microphone. The new Xorg (an open source implementation of X window system) has been a significantly changed from the previous release. Compiz when used with the new Xorg has freezes on Intel GM965 cards. As a result ubuntu blacklisted many intel graphics cards and will not turn up compiz. So you can use the metacity package that only supports 2-d graphics, which also had some performance regression with respect to the previous release. So overall people with Intel graphics card that wanted 3-d acceleration were left in cold. This has been kind of surprising and disappointing considering that Intel has been very nice with the support of their graphics cards on Linux. That has been one of the reasons that I exclusively buy Intel hardware.

Fixing the graphics bug was definitely not easy. As you can see the bug report on Ubuntu launchpad it has been very difficult to figure out where the bug is. Figuring out whether Xorg needs fixing or the driver has got even more complicated by the possibility of multiple bugs. The new UXA support has been added in Xorg to fix possibly one of the problems and I used it according to the instructions provided here. Then I turned off the blacklist and started compiz. Since then compiz has been working great with excellent performance and no freezes.

The support for in-built microphone in snd-hda-intel driver has been hard because of the large number of laptops that have different forms of the sound cards. Currently many people have reported this problem on the alsa website and hopefully support for my microphone will be added soon.

I wish there was better support for my graphics card and the in-built microphone. But overall Jaunty brings a great desktop for the usage of normal users.

Sunday, April 26, 2009

Letter to Allahabad High Court - Removing restrictions to court judgments

To Allahahad High Court,

I am Sushant Sinha, a PhD candidate in the Department of Computer
Science and Engineering at the University of Michigan. I am also a founder
of legal search engine called Indian Kanoon (
Indian Kanoon provides state of art free search and free access to
Indian court judgments to the common people.

Indian Kanoon daily crawls different Indian court websites and adds the
set of updated judgments for Supreme court and high courts to its
database. Since court judgments do not have copyright protection,
Indian Kanoon does not violate any copyright law. However, recently the
elegalix portal used by allahabad high court has introduced image
captcha for restricting automated access to court judgment. For example,
someone needs to solve an image captcha to access the following

Indian Kanoon provides just another portal for people to get access to
court judgments and thereby allows more widespread distribution of court
judgments. Restricting access to judgments in this particular fashion
will hinder Indian Kanoon ability to provide access to Allahabad High
Court decisions and thereby in people to have easy access to court

Indian Kanoon fills in many voids which exist in current Indian court
websites. Restricting access to judgments also forces people to stay
with the court websites and force them to not use the law search tools
provided by other providers like Indian Kanoon. I think providing
unhindered access to court judgments is in the interest of Indian people
as they can use any research tools provided by any competitive portal.
If such restrictions are removed, people can choose whichever website
they like most.

Beside that image captchas cannot be solved by many people who are
blind, old age or do not have a perfect eye. While there are tools (like
text to speech) that allow such people to get information available on
Internet, there are no tools available for solving image captchas.
Therefore, image captchas on allahabad high court restrict access to
court judgments to an important class of Indian population.

I would like to know the reason for restricting the free access to court
judgments that was previously provided on Allahabad High Court website.
If the problem was in Allahabad server getting overload because of
Indian Kanoon crawling, I would be happy to follow any guidelines that
you would provide. Beside any additional guidelines, Indian Kanoon
crawling only starts at 12:00 am IST when there is little chance of
affecting any normal user on your website. Further, replicating court
judgments on Indian Kanoon reduces the load on Allahabad court servers
as many people can access the judgments directly on Indian Kanoon.

So having provided you all reasons for removing such restrictions and my
willingness to follow any guidelines that you provide, I would like to
know your decision in this respect.

Thank you,

Monday, April 20, 2009

Research Paper - One Size Does Not Fit All: 10 Years of Applying Context-Aware Security

Our new paper One Size Does Not Fit All: 10 Years of Applying Context-Aware Security is going to be published in May 2009 in International Conference on Technologies for Homeland Security 2009.

Here is the abstract:

Defenders of today's critical cyber-infrastructure (e.g., the Internet) are equipped with a wide array of security techniques including network-based intrusion detection systems (IDS), host-based anti-virus systems (AV), and decoy or reconnaissance systems such as host-based honeypots or network-based telescopes. While effective at detecting and mitigating some of the threats posed to critical infrastructure, the ubiquitous nature of malicious activity (e.g., phishing, spam, DDoS) on the Internet indicates that the current deployments of these tools do not fully live up to their promise. Over the past 10 years our research group has investigated ways of detecting and stopping cyber-attacks by using the context available in the network, host, and the environment. In this paper, we explain what exactly we mean by context, why it is difficult to measure, and what one can do with context when it is available.
We illustrate these points by examining several studies in which context was used to enable or enhance new security techniques. We conclude with some ideas about the future of context-aware security.

Tuesday, April 14, 2009

Book Review: India Unbound - Gurcharan Das

I had the book titled "India Unbound" (by Gurcharan Das) for a long time. Finally I got time to read this book on the flight.

Gurcharan Das discusses the economic policies of indian government since independence. Narrated in first person, the book does an extremely good job of putting the economic policies of indian government in perspective and how that impacted the indian society. Gurcharan Das, born and brought in India, went to Harvard for his bachelor's degree in Philosphy and then came back to India for work. He started working in a small team of 12 people trying to market Vicks Vaporub.

He discusses the huge investment in public sector as planned by Nehru, then the license raj and rationing perpetuated by Indira Gandhi and finally economic liberalization brought in by Narasimha Rao. He blames Nehru not for the large public sector investment but for his poor management of his own vision. He believes the delay in implementing economic liberalization and perpetuation of stricter license raj had been the worst thing that happened to India, whose blame squarely falls on Indira Gandhi.

The book argues that the economic model of controlling production, distribution and consumption of good through license raj, rationing and higher import duties have been the main reason for India's economic backwardness. It argues that the economic freedom that was obtained from Britishers was unfortunately handled to Indian bureaucrats who had no idea about business. The balance of payment crises that came in 1991 was handled much better by Narasimha Rao and Manmohan Singh. Instead of increasing import duty and restricting money flow outside, the economy was liberalized by reducing import duties, reducing protection for domestic industries and removing the red tape of licensing raj. Definitely opening up the market turned out to be better than the bureaucratic control.

Beside an easy analysis of indian economic policies, what I found really interesting in the book is the large number of important people that Gurcharan Das met himself or read about. Understanding stories about these people, how they became successful or why they failed is one of the important contributions of this book. His experience as director of different companies and his work with venture capital firms shows more of his breadth and that makes the book more wholesome.

One common criticism of the book that I found on the Internet was that Gurcharan Das is a supporter of unfettered capitalism. However, I feel such boxing of him into a class is quite contradictory to large set of specific problems he points out in the licensing raj. I do not see why anyone should control the market in such a fashion as Indian government did till 1991. If the argument is for the protection to indian industry, I feel that 40 years is way long time for this. The government may have a role in regulating the market which the author agrees with in the book. However, the economic policies were not just regulating but also setting production, distribution and price of a large number of products.

I largely agree with his view of mis-handling of indian economy by the of politicians with very simplistic view of planned economy and little regard for human creativity for growth. However, I feel that there are some virtues of freedom and liberty which should not be sacrificed for any small term gain. Because in many situations, it is very hard to move forward once a person is stuck with a monopolistic product. Definitely most consumer goods do not fall under such category and he is right on that. I found his casual dismissing of the values of freedom and liberty quite disturbing. These values are important for future economic growth.

I would highly recommend this book and I have a copy of it if you are interested in loaning.

Sunday, March 22, 2009

Flying to India

Just a quick update that I am in New York waiting to board the Air India flight to India. I will be in India from 25th March to 12th April.

Sunday, March 15, 2009

Banking crisis: What is the problem?

I just read Michael Mandel's article in business week about what caused the banking crisis and why it is difficult for people to comprehend the problem.

I have seen the broad facts and the arguments earlier. But Michael Mandel does an extremely good job of presenting them in a simple fashion as the title of the article claims. I remember reading economist and business week 4 years back when people were talking about the huge trade deficit of US. Many economists were arguing at the time that trade-deficit is not a big issue and the current situation can continue for next 15 years. The main threat that people were discussing at the time was the strength of the dollar. Certainly the problem of trade deficit has manifested in ways no one imagined at that time. And surprisingly dollar has strengthened.

A good explanation of why wall street is still under danger ( quoting from the article):

And when there wasn’t enough “safe assets” to sell to willing foreigners, the intrepid investment bankers created more. Consider, for example, credit default swaps, which pay off if a bond defaults—in effect, insurance on debt. Wall Street saw this as a ‘two-fer.’ They would sell corporate bonds to foreign investors, and at the same time collect fees on credit default swaps on the bonds in order to reassure those apparently too-nervous investors from another part of the world.

But the joke in the end was on Wall Street. The foreign investors bought the bonds, but they also bought the protection—which much to everyone’s surprise was needed. And the U.S. banks and investment banks were left with piles of ‘toxic assets’—the obligation to pay off all sorts of bonds and derivatives

Saturday, February 28, 2009

Tutorial for dpkt

dpkt is a very useful python package for constructing/parsing packets with different protocols. However, it is very poor in documentation and a lot of magic is hidden inside the code. Jon Oberheide presents the first tutorial on dpkt.

Monday, February 23, 2009

Cincinnati Field Hockey Tournament

I went to play the Midwest field hockey championship in Cincinnati last weekend. There were 8 teams in total. Most of them like Chicago, Notredame, Indiana, our team (Michigan) were from mid-west except a team from Florida. It was fun to play some competitive field hockey after a while. Being an indoor tournament, the game was restricted to 7 players a side with at most 3 substitutes.

Every team had a few good players and the team with best talent in the tournament was the Flying Pigs team from Cincinnati. All their players were quite talented and that made playing against that team difficult. Otherwise I felt most other teams were quite competitive.

We played 6 games on the first day. We won 2 games, drew 2 and lost 2. A fairly balanced day. We won against Notredame and Indiana. We lost to Flying Pigs and Chicago. We drew with Columbus and Miami team. On the second we played our last league match against the Purple people from Cincinnati. We won that game and were placed 3rd in the league with 3 wins, 2 draws and 2 losses.

Then we went to the title games and lost the game to Columbus. Our bid to the cup was over and we drove back to Ann Arbor. After every tournament, there is a desire into what went wrong and if we can correct that in future tourneys. I myself felt good and got some goals in the tournament. I regret missing a few close shots on goals and a bit more patience would have been useful. Otherwise winning and losing a tournament is after all a team responsibility and it is hard to pin point the mistakes. Beside the tournament was for fun and it is not worth splitting hair on how we could have won a few more games.

Saturday, February 14, 2009

Guest post on E-Legal: The Government efforts, shortcomings and suggestions

My post on government efforts in digitizing primary legal resources. I highlight some problems in the efforts and some suggestions for future improvements.

Saturday, February 7, 2009

Ayodhya: Ram birth place or that of Unified India

BJP was quite successful in raising Ayodhya issue to become a dominant national party in India. However Ayodhya issue has not given it any significant benefit in any subsequent elections. Time has come for BJP to turn to a more development oriented party.

The crack between hindu and muslim community need to be healed too. A large symbol of secularism can be constructed in Ayodhya to show the renewed birth of a unified India. This will benefit BJP as it can focus on the development agenda and beat Congress and other regional parties in the upcoming election.

Wednesday, January 28, 2009

Interview with

My interview with Kishore Buddha on Indian Kanoon and other stuff is here.

Sunday, January 18, 2009

Indian Kanoon - The road so far and the road ahead

I was quite pleased to find law information publicly available on the judis and
the indiacode. However, it was too difficult to look for anything on these
websites and so I started building tool sets to play with law data. At a
certain point I felt that integration of these small software pieces will be
very interesting. I was still skeptic as to whether search on law documents
meant anything to common people who do not know the law jargon. In any case I
integrated the tool sets into a search engine and got pleasantly surprised when
many of my common queries were well answered. So I deployed it as a publicly
available service, called it Indian Kanoon and fortunately many people have
found it useful over time.

When actual people start using a service (whether free or fee-based), the
demand for correctness and usability increases significantly. The need to
understand the problems, think about the issues and fix them have kept me in
tight grip. Indian Kanoon was announced last January in a very crude form and a
number of changes have gone in the past year. So this post is mostly to
highlight what all work has gone into indian kanoon in the last year, what the
challenges were and what features are planned in future.

Integrating more legal documents

Indian Kanoon started only with supreme court judgments and central laws.
Clearly this was not sufficient to many people who wanted to search in high
court judgments, law commission reports and law journals. Over last year, a
number of other legal documents have been added. Firstly, the law commission reports
and a law journal
was added. The law journal "Central India Law Quarterly" has been
digitized and was put up on Internet by Devaranjan. The only problem in their integration
was that the many of these documents were images scanned from the books. So I used tesseract,
a free OCR software supported by google, for extracting text from these images.
However, the text extraction quality was just 90% and I am skeptical if google
uses tesseract for its own google books project. Tarunabh pointed out the availability
of constituent assembly debates that can be integrated. He pointed out two main
problems in integrating them. First, the article numbers in the debates were different
than in the constitution. Secondly, debates are cited in the court judgments using
page numbers in the official books. But both of these numbers were not available in
the digital copy provided by the government. So the only way out was to go back to
the actual books. We did not want to give away the digital route yet. So we went to that had a scanned copy of the debates. Tarunabh emailed Google
to release those books in public domain as the copyright on them has expired the
previous year. Google replied saying that they are not sure about the copyright
expiration and will be conservative in making books publicly available. Finally,
I loaned the books from a library, manually copied the page numbers and the
association list between the article numbers in the debates and the article numbers
in the Constitution and integrated the constituent assembly debates.

Indian Kanoon was highly deficient in terms of high court judgments and even in
Supreme court judgments as Dilip earlier pointed out on my blog. So I
integrated the high court judgments and made Indian Kanoon more comprehensive.


Beside making Indian Kanoon comprehensive in terms of legal documents, a number
of features to make searching easier have been added. The most common problem
was the mis-spelling of Indian names and so I I first added the most critical
feature for
spelling suggestions
. Ability to search and order documents by date was added next. The search and forums were redesigned to look aesthetically appealing. In order to provide notifications for new judgments, RSS feed for court judgments was recently added. Finally, people may like to monitor documents related to certain words or phrases. So on Tarunabh's suggestion I added the
RSS feed for any arbitrary query.

Contributing code back

Developing indian kanoon software has been possible because of the availability
of large amount of free software. As a result I was able to modify these
software and customize it for law search. Indian Kanoon uses a feature rich
open source database - Postgresql as the
backend. When users submit a query, matching documents are found, ordered and
the top few are shown. For each document, the search engine also displays a
small text excerpt where the query terms appear. The text excerpt allows people
to quickly evaluate whether the document is relevant to the query. The
headline function developed for indian kanoon was contributed back to postgres
and has been
added to the postgres CVS head
. Beside that a bug in postgres was fixed as well. I also sent the
phrase search function to the postgres list. But, Teodor Sigaev, who merged OpenFTS in the Postgresql, wants a generic operator that can check for arbitrary distance between the lexemes. I have not yet got time to work on this operator.

Beside development on the database, the Indian Kanoon forums has been released
as djangobb - Django Bulletin board that uses the django web application framework. The judis recently moved to a really obfuscated website where the judgment did not have a
stable URL. Prashant Iyengar pointed out that we are not getting the live feed from the judis. So I reverse engineered the website and released the judis reverse engineering code.

Future works

Even after so much of work a number of things need to be improved on indian
kanoon. Here is a list of changes that I think are required to make indian
kanoon more comprehensive, more rich and better in search. Please feel free to
suggest more.

1. Reverse engineering different court and tribunal websites so that indian
kanoon can provide a live feed of all Indian court and tribunal judgments.

2. Currently indian kanoon cannot answer questions like "list of judgments in
which a particular law section was held" and "search only in family law
judgments". The problem is that we do not have enough semantic information
about judgments. So I want to enable common users to start tagging documents.
There will be two kinds of tagging: categorizing court judgments and laws into
broad categories like family law, constitutional law, right to equality etc and
secondly, tag whether a judgment explains, bolsters, or overturns a given law
or judgment. The tags generated by the users will be available to everyone
with the Creative Commons-Attribution-Share Alike license 3.0.

3. A number of people type in natural language in the search box. For example,
someone will type "recent judgments from delhi high court". Even though we can
answer these questions, we directly search the query to the documents. For
example, the above query could have been reduced to "doctypes: delhi sortby:
mostrecent". So what we need is a small natural language processor that can
automatically convert such natural language queries to a more precise query
that the engine can evaluate.

4. I only support searching for a set of words in the documents. Roy wanted a
more sophisticated
query langauge
that supports boolean queries. This will enable people to
issue more complicated queries like (freedom OR speech) AND (NOT expression).

5. With the addition of more data over time, Indian Kanoon takes more than a
second to evaluate some queries. A number of software changes (or possible
hardware upgrade) are required to bring back the evaluation time to sub-second.

Tuesday, January 13, 2009

Challenges with constitutional democracy in South Asia

In A constitutional state, Rasul Bakhsh Rais points out the reasons for weak democracy in Pakistan.
While the generally pointed problem is the over indulgence of military into civil government, Rasul points out other aspects that have made civilian governments weak. He points "The answer lies in the undemocratic mindset of traditional leaders who control the political parties and, through them, the electoral process.". He continues - "One point that is debated often but never understood is why there is no democracy within political parties; and why and how families and oligarchs dominate them. In essence, these elements use political parties to maintain their dominance, using the party’s name, social support base and elite network to control access to electoral politics and power."

This problem is not only confined to Pakistan and also resonates very well with India.

Tuesday, January 6, 2009

Extradition of mumbai suspects using SAARC convention

An informative article from Daily Times, Pakistan (Next steps after evidence from India) that talks about India invoking "SAARC Regional Convention on Suppression of terrorism (1987)" for extradition of Pakistani suspects in Mumbai attack. Article 3(4) says:
"If a Contracting State which makes extradition conditional on the existence of a treaty receives a request for extradition from another Contracting State with which it has no extradition treaty, the requested State may, at its option, consider this Convention as the basis for extradition in respect of the offences set forth in Article I or agreed to in terms of Article II. Extradition shall be subject to the law of the requested State"

It is important to note that the extradition is optional and not binding on the "Contracting State". It appears that this is just an enabling clause and Pakistan is not obliged in this case. The daily times notes that India has not invoked this treaty in the 1999 hijack of Indian airliner to Kandahar.

Saturday, January 3, 2009

A brief analysis of Supreme Court judgment on SAR Gilani

The supreme court judgment in which Afzal Guru, his wife Navjot Sandhu, Shaukat hussain Guru, and SAR Gilani were tried is here:
State (N.C.T. Of Delhi) vs Navjot Sandhu@ Afsan Guru on 4 August, 2005

The judgment narrates the events, talks about the police investigation, confessions and finally the judgment. It also talks in detail about the different provisions relating to confessions under POTA and it has actually stuck it down. Finally, the court went with a normal criminal confession under a magistrate.

SAR Gilani was defended by Ram Jethmalani. Court rejected some witnesses who said that they have seen Shaukat and Gilani together while procuring room and board for the terrorists (that were killed). This was mostly because the witnesses did some mistake while identifying Gilani.

However, one evidence that was irrefutable was the constant phone calls between Gilani and Shaukat and Afzal. This evidence was furnished by AIRTEL and ESSAR after warrants provided under Indian Telegraph Act. Court accepted these phone calls as evidence. However, supreme court held the high court view that just phone calls between Shaukat and Gilani did not confirm that Gilani knew about the conspiracy. Here is the text from the judgment:

"The High Court after holding that the disclosure statement of Gilani
was not admissible under Section 27 of the Evidence Act and that the
confession of co-accused cannot also be put against him, observed thus:

"We are, therefore, left with only one piece of evidence against
accused S.A.R. Gilani being the record of telephone calls between
him and accused Mohd. Afzal and Shaukat. This circumstance, in
our opinion, do not even remotely, far less definitely and unerringly
point towards the guilt of accused S.A.R. Gilani. We, therefore,
conclude that the prosecution has failed to bring on record
evidence which cumulatively forms a chain, so complete that there
is no escape from the conclusion that in all human probabilities
accused S.A.R. Gilani was involved in the conspiracy.""

Police could only get the call records for previous conversations. However, they recorded the call between GIlani and Brother of Gilani after the incident. Here is the text excerpt translated from Kashmiri:

"Caller: (Bother of Gilani) What have you done in Delhi?
Receiver: (Gilani) It is necessary to do (while laughing) ( Eh che zururi).
Caller: Just maintain calm now.
Receiver: O.K. (while laughing)Where is Bashan?
This portion of the conversation appears almost towards the end of talk.
The defence version of translation is as follows:
Caller: (Brother of Gilani) What has happened?
Receiver: (Gilani) What, in Delhi?
Caller: What has happened in Delhi?
Receiver: Ha! Ha! Ha! (laughing)
Caller: Relax now.
Receiver: Ha! Ha! Ha!, O.K. Where are you in Srinagar?"

Police did another mistake here of recording it really poorly that high
court rejected the first two lines as inaudible. Police needs to do a
better job than this. On the other part Supreme court said:

"However, we would like to advert to one disturbing feature. Gilani rejoiced and laughed heartily when the Delhi event was raised in the conversation. It raises a serious suspicion that he was approving of the happenings in Delhi. Moreover, he came forward with a false version that the remark was made in the context of domestic quarrel. We can only say that his conduct, which is not only evident from this fact, but also the untruthful pleas raised by him about his contacts with Shaukat and Afzal, give rise to serious suspicion at least about his knowledge of the incident and his tacit approval of it. At the same time, suspicion however strong cannot take the place of legal proof. Though his conduct was not above board, the Court cannot condemn him in the absence of sufficient evidence pointing unmistakably to his guilt."

Finally the judgment:

"In view of the foregoing discussion we affirm the verdict of the High
Court and we uphold the acquittal of S.A.R. Gilani of all charges."

On the whole I felt that there was surely not enough evidence (or significant
amount of police mistakes) to implicate
Gilani as a conspirator in the unfortunate happening. However, a significant
amount of doubt still remains on his character.

Blog Archive

Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 2.5 India License.