Khallaf, Nouran;
              
      
            
                Arfon, Elin;
              
      
            
                El-Haj, Mo;
              
      
            
                Morris, Jonathan;
              
      
            
                Knight, Dawn;
              
      
            
                Rayson, Paul;
              
      
            
                Hammouda, Tymaa Hasanain;
              
      
            
            
          
      
        
        
        
    
  
(2023)
  Open-Source Thesaurus Development for Under-Resourced Languages: a Welsh Case Study.
    
    
      In: Carvalho, Sara and Khan, Anas Fahad and Anić, Ana Ostroški and Spahiu, Blerina and Gracia, Jorge and McCrae, John P and Gromann, Dagmar and Heinisch, Barbara and Salgado, Ana, (eds.)
      Proceedings of the 4th Conference on Language, Data and Knowledge.
      
      (pp. pp. 306-315).
    
 NOVA CLUNL, Portugal: Vienna, Austria.
  
  
       
    
  
| Preview | Text Arfon_2023.ldk-1.30.pdf Download (439kB) | Preview | 
Abstract
This paper introduces an open-access, userfriendly online thesaurus for the Welsh language, aimed at enriching digital resources for Welsh speakers and learners. Utilising advances in Natural Language Processing (NLP), our approach combines pre-existing word embeddings, a Welsh semantic tagger, and human evaluation to establish related terms. In this case, an initial list of 250 words was expanded by adding 6,953 synonyms provided by linguists, creating a more extensive foundation for building the gold-standards. With this expanded list, when a user queries a particular word, the thesaurus presents all of its synonyms, allowing them to choose from a wider range of options. This is especially helpful when a user is unsure of the exact word they want to use or wants to explore different ways to express a concept. The resulting thesaurus offers a comprehensive, reliable resource for Welsh language users, fostering enhanced communication and expression. Our work promotes Welsh NLP and showcases NLP’s potential to support under-resourced languages. The thesaurus will be accessible via a bilingual website, and the accompanying Python code will be available in a bilingual, public GitHub repository. Our approach presents a more efficient, costeffective method for thesaurus creation, with potential applicability to other under-resourced languages.
| Type: | Proceedings paper | 
|---|---|
| Title: | Open-Source Thesaurus Development for Under-Resourced Languages: a Welsh Case Study | 
| Event: | LDK 2023 – 4th Conference on Language, Data and Knowledge | 
| Location: | Vienna, Austria | 
| Dates: | 12 Sep 2023 - 15 Sep 2023 | 
| Open access status: | An open access version is available from UCL Discovery | 
| Publisher version: | https://aclanthology.org/2023.ldk-1.30 | 
| Language: | English | 
| Additional information: | Creative Commons Attribution 4.0 International License, https://creativecommons.org/licenses/by/4.0/. | 
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Education UCL > Provost and Vice Provost Offices > School of Education > UCL Institute of Education UCL > Provost and Vice Provost Offices > School of Education > UCL Institute of Education > IOE - Culture, Communication and Media | 
| URI: | https://discovery.ucl.ac.uk/id/eprint/10190716 | 
Archive Staff Only
|  | View Item | 
 
                      
