Urdu Search Engine (USE) - [funded by ICT R&D, PKR 33.3 M] - Introduction

Urdu Search Engine (USE) - [funded by ICT R&D, PKR 33.3 M]

“Search engines are as critical to Internet use as any other part of the network infrastructure, but they differ from other components in two important ways. First, their internal workings are secret, unlike, say, the workings of the DNS (domain name system). Second, they hold political and cultural power, as users increasingly rely on them to navigate online content “(Cafarella and Cutting, 2004)). Moreover, most of the popular search engines, such as Google, Yahoo, Bing, etc., still do not offer their services to fulfill the needs of all types of user needs. For example, these search engines do not provide a search infrastructure that is generic to all the Languages spoken in the world. There are several reasons for this but one main reason is that each language is not only morphologically different from the other but also the syntax varies greatly. Although, in classical language philosophy there is a divide among the linguists upon the treatment of the language (for example Chomsky’s vs. Quine’s) but each group acknowledges that there do exist unique language specific complexities that need to be catered. Unfortunately, the field of Natural Language Processing (NLP) is also less explored in the context of Urdu language. Urdu language is 5th most spoken language of the world [1]. Majority of Urdu speakers is in Pakistan, India, Canada, and UK.

In the light of all this, a search engine that searches Urdu content against Urdu queries is not only challenging but also an exciting research area. It shall not only give Pakistani researchers an opportunity to investigate Urdu specific challenges in general but also facilitate a large group of user communities who prefer to search and view information in Urdu due to English large barrier.

Modern NLP applications perform computations over large corpora. With increasing frequency, NLP applications use the Web as their corpus and rely on queries to commercial search engines to support these computations. But search engines are designed and optimized to answer people’s queries, not as building blocks for NLP applications. In response, Google has created the “Google API” to shunt programmatic queries away from Google.com and has placed hard quotas on the number of daily queries a program can issue to the API. Other search engines have also introduced mechanisms to block programmatic queries, forcing applications to introduce “courtesy waits” between queries and to limit the number of queries they issue. Having a “Language Specific” search engine would enable an NLP application to issue a much larger number of queries quickly (Cafarella and Etzioni, 2005). Google does a mere string search for Urdu content without any intelligence. Moreover, a lot of Urdu content is available in images too where Google or other search engines fail to parse the content inside.

Finally, modern search engines also overlook the requirements of low-end mobile users. A primary reason for this is that in the developed world the penetration of smart gadgets is overwhelming and the high-speed Internet services (for example 3G data transfer) do not raise a need of developing convenient search mechanisms for the users of low-end mobile phones. This includes the development of an SMS-based search facility for ordinary mobile users, which requires extracting a succinct summary of the search results and send it back to the user via SMS. Text summarization is an active research area, and exploring it from the view of handling the complexities of Urdu language will give us an opportunity to contribute in the field besides providing a unique feature to a huge segment of the society.

Benefits of the Project

Direct Customers / Beneficiaries of the Project:

  • NLP researchers will be able to issue a much larger number of queries quickly.
  • Official policy makers who require user search trends in order to formulate government level policies.
  • Commercial or public sector organizations, in order to improve their services or processes based on the trends of user queries.
  • A large user group in Pakistan and outside, who wish to search the WWW in Urdu.
  • Urdu content providers who want to provide visibility to their content over the WWW.

Outputs Expected from the Project:

  • Development of new research for Urdu NLP.
  • A working content and language identifier.
  • A working indexer for Urdu corpora.
  • An elementary text summarizer for Urdu language.
  • A working system for SMS based search.
  • Initial phase is more focused on exploring the R&D issues and primarily on creating a returning user base of students, middle aged Urdu users, by providing culture specific search patterns, so that they can use this system consistently.