South African Classic Project

Last updated on Oct 2, 2020 2 min read

About

The South African Centre for Digital Language Resources (SADiLaR) is a national centre supported by the Department of Science and Innovation (DSI) as part of the new South African Research Infrastructure Roadmap (SARIR).

“SARIR is a high-level strategic and systemic intervention to provide research infrastructure across the entire public research system, building on existing capabilities and strengths, and drawing on future needs.” (DST SARIR brochure).

SADiLaR has an enabling function, with a focus on all official languages of South Africa, supporting research and development in the domains of language technologies and language-related studies in the humanities and social sciences. The Centre supports the creation, management and distribution of digital language resources, as well as applicable software, which are freely available for research purposes through the Language Resource Catalogue.

SADiLaR clients include academic scholars and professionals in all domains of Humanities and Social Sciences, Language Technologies, Natural Language Processing, Computer Science, as well as potential end-users in education, business and industry.

https://www.districtsix.co.za/

See GLAM’s intro here

Challenges:

Awareness, Remote access to cultural heritage data, artefacts, archives

Data set available:

Natural Language Processing data
Language corpus in 11 official languages
Audio recordings

Pronunciation dictionaries

General phonemic pronunciations for frequently occurring words in SA languages. Dictionaries were developed to be practically usable for speech technology systems, rather than phonetically accurate. Audio samples of all phonemes included. A letter-to-sound rule set for predicting the pronunciations of generic words included. (Separate entry describes rule sets.)

link

Speech corpora

This speech corpus consisting of 16 female speakers and 17 male speakers was recorded in Lagos, Nigeria for the purpose of speech recognition research. Each speaker recorded about 130 utterances read from short texts selected for phonetic coverage. Recordings were done using a microphone connected to a laptop computer in a quiet office environment.

link

Text Corposa

This directory contains corpora developed during project NCHLT: Text. Languages included are: Afrikaans, English*, isiXhosa, isiNdebele, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, Xitsonga. These corpora are based on documents from the South African government domain, mainly crawled from gov.za websites and collected from various language units.

link

Other

Click here to access two year’s woth of a Xitsonga community newspaper.
Click here to access the rest of the data.

The above dataset is licensed under

Problem statement

Creating awareness about digital and computational approaches and providing showcase projects/ examples of what is possible through linked open data. Creating awareness about what digital humanities entails as a domain