I was quite pleased to find law information publicly available on the judis and
the indiacode. However, it was too difficult to look for anything on these
websites and so I started building tool sets to play with law data. At a
certain point I felt that integration of these small software pieces will be
very interesting. I was still skeptic as to whether search on law documents
meant anything to common people who do not know the law jargon. In any case I
integrated the tool sets into a search engine and got pleasantly surprised when
many of my common queries were well answered. So I deployed it as a publicly
available service, called it Indian Kanoon and fortunately many people have
found it useful over time.
When actual people start using a service (whether free or fee-based), the
demand for correctness and usability increases significantly. The need to
understand the problems, think about the issues and fix them have kept me in
tight grip. Indian Kanoon was announced last January in a very crude form and a
number of changes have gone in the past year. So this post is mostly to
highlight what all work has gone into indian kanoon in the last year, what the
challenges were and what features are planned in future.
Integrating more legal documents
Indian Kanoon started only with supreme court judgments and central laws.
Clearly this was not sufficient to many people who wanted to search in high
court judgments, law commission reports and law journals. Over last year, a
number of other legal documents have been added. Firstly, the law commission reports
and a law journal was added. The law journal "Central India Law Quarterly" has been
digitized and was put up on Internet by Devaranjan. The only problem in their integration
was that the many of these documents were images scanned from the books. So I used tesseract,
a free OCR software supported by google, for extracting text from these images.
However, the text extraction quality was just 90% and I am skeptical if google
uses tesseract for its own google books project. Tarunabh pointed out the availability
of constituent assembly debates that can be integrated. He pointed out two main
problems in integrating them. First, the article numbers in the debates were different
than in the constitution. Secondly, debates are cited in the court judgments using
page numbers in the official books. But both of these numbers were not available in
the digital copy provided by the government. So the only way out was to go back to
the actual books. We did not want to give away the digital route yet. So we went to books.google.com that had a scanned copy of the debates. Tarunabh emailed Google
to release those books in public domain as the copyright on them has expired the
previous year. Google replied saying that they are not sure about the copyright
expiration and will be conservative in making books publicly available. Finally,
I loaned the books from a library, manually copied the page numbers and the
association list between the article numbers in the debates and the article numbers
in the Constitution and integrated the constituent assembly debates.
Indian Kanoon was highly deficient in terms of high court judgments and even in
Supreme court judgments as Dilip earlier pointed out on my blog. So I
integrated the high court judgments and made Indian Kanoon more comprehensive.
Beside making Indian Kanoon comprehensive in terms of legal documents, a number
of features to make searching easier have been added. The most common problem
was the mis-spelling of Indian names and so I I first added the most critical
spelling suggestions. Ability to search and order documents by date was added next. The search and forums were redesigned to look aesthetically appealing. In order to provide notifications for new judgments, RSS feed for court judgments was recently added. Finally, people may like to monitor documents related to certain words or phrases. So on Tarunabh's suggestion I added the
RSS feed for any arbitrary query.
Contributing code back
Developing indian kanoon software has been possible because of the availability
of large amount of free software. As a result I was able to modify these
software and customize it for law search. Indian Kanoon uses a feature rich
open source database - Postgresql as the
backend. When users submit a query, matching documents are found, ordered and
the top few are shown. For each document, the search engine also displays a
small text excerpt where the query terms appear. The text excerpt allows people
to quickly evaluate whether the document is relevant to the query. The
headline function developed for indian kanoon was contributed back to postgres
and has been
added to the postgres CVS head. Beside that a bug in postgres was fixed as well. I also sent the
phrase search function to the postgres list. But, Teodor Sigaev, who merged OpenFTS in the Postgresql, wants a generic operator that can check for arbitrary distance between the lexemes. I have not yet got time to work on this operator.
Beside development on the database, the Indian Kanoon forums has been released
as djangobb - Django Bulletin board that uses the django web application framework. The judis recently moved to a really obfuscated website where the judgment did not have a
stable URL. Prashant Iyengar pointed out that we are not getting the live feed from the judis. So I reverse engineered the website and released the judis reverse engineering code.
Even after so much of work a number of things need to be improved on indian
kanoon. Here is a list of changes that I think are required to make indian
kanoon more comprehensive, more rich and better in search. Please feel free to
1. Reverse engineering different court and tribunal websites so that indian
kanoon can provide a live feed of all Indian court and tribunal judgments.
2. Currently indian kanoon cannot answer questions like "list of judgments in
which a particular law section was held" and "search only in family law
judgments". The problem is that we do not have enough semantic information
about judgments. So I want to enable common users to start tagging documents.
There will be two kinds of tagging: categorizing court judgments and laws into
broad categories like family law, constitutional law, right to equality etc and
secondly, tag whether a judgment explains, bolsters, or overturns a given law
or judgment. The tags generated by the users will be available to everyone
with the Creative Commons-Attribution-Share Alike license 3.0.
3. A number of people type in natural language in the search box. For example,
someone will type "recent judgments from delhi high court". Even though we can
answer these questions, we directly search the query to the documents. For
example, the above query could have been reduced to "doctypes: delhi sortby:
mostrecent". So what we need is a small natural language processor that can
automatically convert such natural language queries to a more precise query
that the engine can evaluate.
4. I only support searching for a set of words in the documents. Roy wanted a
query langauge that supports boolean queries. This will enable people to
issue more complicated queries like (freedom OR speech) AND (NOT expression).
5. With the addition of more data over time, Indian Kanoon takes more than a
second to evaluate some queries. A number of software changes (or possible
hardware upgrade) are required to bring back the evaluation time to sub-second.
Sunday, January 18, 2009
- ► 2010 (12)
- ▼ 2009 (23)
- ► 2008 (8)