WILDRE 2018 - 4th Workshop on Indian Language Data: Resources and Evaluation (WILDRE-4)



VenueMiyazaki, Japan Japan



Topics/Call fo Papers

In the past couple of decades, the Indian NLP and Speech Technology community has shown an ever increasing interest in the development of Language Resources for Indian Languages. This has primarily been due to the fact that as the community grew, increasing research in and development of Language Technology brought out the acute awareness of a serious lack of appropriate resources across the languages of India. A number of initiatives have been taken to address this issue, by the Government of India as well as academia and the industry. Many of these initiatives have targeted specific NLP and Speech technologies, inculcating collaborations between several academic institutions across the country, and active involvement of industry partners. As expected, when a number of resources are simultaneously being developed by several research groups across many languages, the need for standards also takes on some urgency. In the past few years years, the Govt. of India, in consultation with the experts from academia and industry have taken lead in developing appropriate standards for NLP resources. This concentrated effort has resulted in a number resources, standards, tools and technologies becoming available for many Indian languages in the past few years. While the activity in the Indian Language community may still not be comparable to for example, the work done on European languages, we firmly believe that the community has come of age and is at a point where sharing of ideas and experience is necessary, not only within the community but with other communities working in similar situations, so that India can move forward in planning for the future language technology resources and requirement while maintaining its linguistic diversity.
India has 4 language families – Indo Aryan (76.87 % speakers), Dravidian (20.82 % speakers) being the major ones. These families have contributed 22 constitutionally recognized (‘scheduled’ or ‘national’) languages out of which Hindi has the ‘official’ status in addition to having the ‘national’ status. Besides these, India has 234 mother tongues reported by the recent census (2001), and many more (more than 1600) languages and dialects. Of the major Indian languages, Hindi is spoken in 10 (out of a total of 25) states of India with a total population of over 60 % followed by Telugu and Bangla. There are more than 18 scripts in India which need to be standardized and supported by technology. Devanagari is the largest script being used by more than 10 languages.
Indian languages are under the exclusive control of respective states they are spoken in. Therefore every state may decide on measures to promote its language. However, since these 22 languages are national (constituent) languages, the center (union of India) also has responsibility towards each of them, though it has certain additional responsibility towards Hindi which is national as well official language of the Indian union. From time to time, minor/neglected languages claim constituent status. The situation becomes more complex when such a language becomes the rallying point for the demand for a new state or autonomous region.
This complex linguistic scene in India is a source of tremendous pressure on the Indian government to not only have comprehensive language policies, but also to create resources for their maintenance and development. In the age of information technology, there is a greater need to have a fine balance between allocation of resources to each language keeping in view the political compulsions, electoral potential of a linguistic community and other issues.
Language promotion and maintenance by the Ministry of Human Resource Development
The MHRD through its language agency called CIIL and many academic institutions across the country has set up a Linguistic Data Consortium for Indian Languages (LDCIL). This consortium, being set up in the lines of the LDC at the University of Pennsylvania (USA), will not only create and manage large Indian languages databases, it will also provide a forum for researchers in India and other countries working on Indian languages to publish and build products for use based on such databases that would not otherwise be possible.
LDC-IL is expected to:
Become a repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora.
Facilitate creation of such databases by different organizations which could contribute and enrich the main LDC-IL repository.
Set appropriate standards for data collection and storage of corpora for different research and development activities.
Support language technology development and sharing of tools for language-related data collection and management.
Facilitate training and manpower development in these areas through workshops, seminars etc. in technical as well as process related issues.
Create and maintain the LDC-IL web-based services that would be the primary gateway for accessing its resources.
Design or provide help in creation of appropriate language technology based on the linguistic data for mass use and
Provide the necessary linkages between academic institutions, individual researchers and the masses.
The Technology Development for Indian Languages (TDIL) program of the Ministry of Communications and IT (MCIT)
The MCIT started a program called TDIL in 1991 for building technology solutions for Indian languages. The stated objective of the TDIL is
(i) to develop information processing tools and techniques,
(ii) to facilitate human-machine interaction without language barrier,
(iii) to create and access multilingual knowledge resources and integrate them to develop innovative user products and services.
The TDIL has made available in the public domain many basic software tools and fonts for 22 Indian languages. On the language resources funds, TDIL is running several language corpora projects in consortium mode. Some of the significant projects are:
• Development of LRs for English to Indian Languages Machine Translation (MT) System,
• Development of LRs Indian Language to Indian Language Machine Translation System
• Development of LRS Sanskrit-Hindi Machine Translation
• Development of LRs for Robust Document Analysis & Recognition System for Indian Languages
• Development of LRs for On-line handwriting recognition system
• Development of LRs Cross-lingual Information Access
• Development of Speech Corpora/Technologies
• Parallel Language Corpora development in all 22 national languages (ILCI)

