ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

CodeSwitch 2014 - First Workshop on Computational Approaches to Code Switching

Date2014-10-25

Deadline2014-05-31

VenueDoha, Qatar Qatar

Keywords

Websitehttps://emnlp2014.org/workshops/CodeSwit...

Topics/Call fo Papers

Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the inter sentential, intra sentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies, including parsing, Machine Translation (MT), automatic speech recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance will degrade at a rate proportional to the amount and level of mixed-language present.
CS is pervasive in informal text communications such as news groups, tweets, blogs, and other social media of multilingual communities. Such genres are increasingly being studied as rich sources of social, commercial and political information. Apart from the informal genre challenge associated with such data within a single language processing scenario, the CS phenomenon adds another significant layer of complexity to the processing of the data. Efficiently and robustly processing CS data presents a new frontier for our NLP algorithms on all levels. This workshop aims to bring together researchers interested in solving the problem and to increase awareness of the community at large with possible viable solutions to reduce the complexity of the phenomenon.
The workshop invites contributions from researchers working in NLP approaches for the analysis and/processing of mixed-language data especially with a focus on intra sentential code switching. Topics of relevance to the workshop will include the following:
Development of linguistic resources to support research on code switched data
NLP approaches for language identification in code switched data
NLP techniques for the syntactic analysis of code switched data
Domain/dialect/genre adaptation techniques applied to code switched data processing
Language modeling approaches to code switch data processing
Crowdsourcing approaches for the annotation of code switched data
Machine translation approaches for code switched data
Position papers discussing the challenges of code switched data to NLP techniques
Methods for improving ASR in code switched data
Survey papers of NLP research for code switched data
Sociolinguistic aspects of code switching
Sociopragmatic aspects of code switching
Shared Task: Language Identification in Code-Switched (CS) Data
You thought language identification was a solved problem? Think again! Recent research has shown that fine-grained language identification is still a challenge, and is particularly error prone when the spans of text are smaller. Now imagine you have more than one language in those small text spans! We are organizing a shared task on language identification of CS data. The goal is to allow participants to explore the use of unsupervised and supervised approaches to detection of language at the word level in code-switching data. We will release a small gold standard data for tunning systems in four language pairs, Spanish-English, Modern Standard Arabic and Arabic dialects, Chinese-English and Nepalese-English.
Task Definition
For each word in the Source, identify whether it is Lang1, Lang2, Mixed, Other, Ambiguous, or NE (for named entities, which are proper names that represent names of people, places, organizations, locations, movie titles, and song titles). The focus of the task is on microblog data, so we will use Twitter as the source of data, although each language combination will have data from a "surprise genre" as additional test data as well.
Participants for this shared task will be required to submit output of their systems following the schedule proposed below in order to qualify for evaluation under the shared task. They will also be required to submit a paper describing their system.
Since we're using Twitter data we're following the now usual procedure to release labeled data that other researchers have used. Participants can use their own scripts or download our python script to collect the data directly from Twitter and we will release char offsets with the label information.
Please join our google group to receive announcements and other relevant information for the workshop: codeswitching_workshop-AT-googlegroups.com
To register your team please follow this link: Registration Form
Data Release
The script to crawl Twitter data is this one: twitter. You will need to have Beautiful Soup installed for this python script to work.
A second method to crawl Twitter data using the Twitter API is also available: Twitter via API. You will need to have the Launchy gem for Ruby installed, which can be done via 'gem install launchy' in the command line. You will also need a Twitter account to authenticate with the application.
** Updated 3/31/14 ***
Nepalese-English Trial data (20 tweets)
Spanish-English Trial data (20 tweets)
Mandarin-English Trial data (20 tweets)
Modern Standard Arabic-Arabic dialects (20 tweets)
Spanish-English Training data (11,400 tweets)
Nepali-English Training data (6,000 tweets)

Last modified: 2014-05-01 22:17:16