About the project
The Corpus of Written Tatar is a collection of electronic texts in the Tatar language.
The work on the Corpus of Tatar texts was started in 2010. The beginnings of the project were connected with Authors' discussions about two directions of research:
By studying the relevant literature we became aware that modern systems of MT and automatic recognition of speech rely on national corpora of the languages in question, applying the “hypothesis — check” method. This fact urged us to commit ourselves to the creation of a similar corpus of the Tatar language.
The Corpus of Written Tatar is mainly based on materials available in the web. Following the web addresses given after the examples (sentences) in the search results, the user can obtain more information about the sources used in creating the corpus.
The texts originating from different sources have been automatically processed before including them in the Corpus of Tatar language: hmtl-tags have been deleted, sentences in foreign languages have been removed, the encoding of the texts has been converted into utf-8, the sentence borders have been added to the material, etc.
The work on collecting materials and processing them is going on. After having learned about the existence of the Corpus of written Tatar, many writers and scholars have provided us with electronic versions of their books and articles. According to our practice, we update the published version of the Tatar corpus when the word count of newly acquired contributions reaches 5-6 million word occurrences. At the same time, the user interface is updated.
The Corpus of Written Tatar can also be regarded as an enormous reference book, giving the user an orderly view into the world of the Tatar language.
A majority of the texts included in the corpus of Tatar language pertains to three styles: journalism (≈ 60%), fiction (≈ 35%) and scientific literature in the field of humanities (≈ 5%).
The basic purpose of the Corpus of Written Tatar language is to provide assistance in research into the Tatar lexicon. Furthermore, the corpus can be used in language learning, and as a source of models for various types of documents.
The user interface of the Tatar language corpus makes it possible to perform the following operations:
The searches described above allow the following tasks to be accomplished:
The list of applications of the Corpus of Tatar language given above is, of course, not exhaustive. Electronic corpus materials are also indispensable in the work on automatic recognition of speech as well as machine translation.
Today the Tatar corpus has a balanced representativeness in relation to the language reality.
New contributions to the Corpus of Tatar are welcomed with gratitude. If you want to help us, please send electronic versions of your own books, articles and other documents to us for inclusion in the corpus.
In order to protect the copyrights of the authors, texts are stored in the corpus as individual sentences, which means that it is not possible to extract whole texts from the corpus. Each sentence is provided with a link to the literary work in question.
All texts of the Corpus of Written Tatar online on this site are only made available for non-commercial scientific or educational use (Article 19 of the Russian Copyright Law). No text on the site can be downloaded and/or read in full.
If you quote text excerpts retrieved from the Corpus of Written Tatar please cite Corpus of Written Tatar as the source, as well as the author of the text in question and the name of the text.
Offline version of the corpus, in a limited size, will be available later.
Using the Corpus of Written Tatar is free of charge.
In order to adequately represent the Tatar language and to be called the national corpus of the Tatar language, corpus should contain no less than 100 million word occurrences. We have achieved this amount in 2014.The Corpus of Written Tatar was created by the following people: