CORPUS OF WRITTEN TATAR

General Information

This website contains a Text Corpus of the modern Tatar language consisting of over 500 million word occurrences (>620 mln tokens).
The corpus represents modern written Tatar language in electronic form.
The total count of different word forms in the Tatar corpus is about 5 mln.
This collection of Tatar texts in electronic form is intended for the use of those interested in the structure, present condition and prospects of the Tatar language.
The Corpus of Written Tatar language is indispensable for everyone who wants to study Tatar by the methods of corpus linguistics.

Attention!

This project does not get financial support from any scientific fund or organization.
All work on the Corpus of Written Tatar is being done by the project participants in spare time.

Project news

24.01.2026 - Some minor improvements and bug fixes.
16.03.2022 - A new Personal Names section has been added to the site, where Tatar names, patronyms and surnames are placed.
26.12.2021 - Some changes in the project:
- The Corpus has been moved to a new hosting.
- Access via HTTP protocol has been removed due to its obsolescence. Currently only HTTPS is used.
- Some sections of the site were updated.
21.10.2019 - Tatar language spelling checker is updated.
20.10.2019 - 4th version of the Corpus is released:
- Corpus volume increased from 356 million word occurences up to 500 million word occurences;
- amount of sources reached 17 000 units;
- the quality of morphological annotation was increased.
We are very grateful to all people who participated in the preparation of this material!
20.03.2019 - Tatar palindromes are placed in the Statistics section.
24.01.2019 - The function of sorting found results is added in the Search by n-grams section.
21.01.2019 - A new Thesaurus section has been added to the site, where word embeddings generated using word2vec technnology on the basis of a shallow neural network are placed.
28.12.2018 - Tatar language spelling checker's quality was significantly improved:
- system now is based on our new corpus;
- the morphological analyzer of the Apertium project began to be used.
27.11.2018 - 3rd version of the Corpus is released:
- Corpus volume increased from 116 million word occurences to 356 million word occurences;
- amount of sources reached 16 000 units.
Many thanks to all those numerous enthusiasts and organizations who helped in the preparation of this material!
27.06.2018 - At the request of users, the old style of displaying sources is returned as an additional option.
23.06.2018 - Numerous changes associated with the web-site adaptation for mobile devices.
03.06.2018 - KWIC mode (Key Word in Context) for displaying found results is made available.
25.03.2018 - Now it is possible to search not only in the whole corpus, but also in particular text(s). You can use masks and regular expressions there as well.
16.03.2018 - NoSketchEngine search system is integrated in the Corpus.
16.02.2018 - There have been made many improvements:
- Beginning of initial support of the Corpus Query Language (CQL), which is de facto standard in corpus linguistics.
- Search using Extended POSIX Regular Expressions is implemented.
- Added function to view context of found sentences (use "Expand context" button).
- Found bugs and mistakes were fixed.
08.02.2018 - Wildcard search (using "*" and "?" symbols) is available now not only with word forms, but also with lemmas, for example, (ат*): (ат), (атна), (атла), (атаклы)...
26.11.2017 - The search system for N-grams was completely revised, for which there fastngrams (GitHub) program was developed. Major changes in n-grams functionality:
- the search speed has been increased many times;
- added the possibility of using in the search such parameters as wordform, lemma, grammatical tags (parts of speech, morphological categories), case sensitivity, mask.
03.07.2017 - Changes in the Spellcheck service:
- definition of spelling errors is improved;
- preservation of formatting of the source text is added;
- the offer of the list of similar words from the Corpus is added.
21.06.2017 - In the fastmorph system a set of grammatical tags is shown for all words in the sentence now.
03.06.2017 - We integrated to our website the Tatar Speech Synthesizer "Talgat" that was developed on the base of RHVoice system in the Republican special library for blind and visually impaired.
27.02.2017 - The 5th version of fastmorph corpus search engine is released. Now it consumes about 2,5 times less RAM.
23.01.2017 - Spellchecker for Tatar language has been launched in Online SpellCheck section.
09.01.2017 - N-grams based search has been launched in Search the Corpus section. 1, 2, 3, 4, 5 and 6-grams are supported.
22.11.2016 - We opened the source code of the "fastmorph" corpus search engine under GNU General Public License v3.0 and placed it in the GitHub.
18.11.2016 - The 4th version of fastmorph corpus search engine is released. List of changes:
- case sensitive search option was added;
- the memory (RAM) usage by the search system is reduced twice;
- because of essential changes in the application architecture, search query performs now 3 - 5 times faster.
17.11.2016 - The Corpus is reannotated with the most recent version of Apertium morphological tagger.
12.10.2016 - Frequency lists of Tatar language lemmas are placed in the Statistics section.
19.07.2016 - Some improvements in the Complex morphological search engine "fastmorph":
- in addition to the existing mask "*", that matches any number of any symbols, the mask "?", that represents any single character, were added. More information about it you can find in the updated Guides;
- in the technical plan memory usage by the search system is reduced up to 25%;
- found bugs were fixed.
01.07.2016 - User's Guides in Tatar, Russian and English languages are updated.
13.06.2016 - Search by the middle part of a word functionality was added in the fastmorph module. For example, if you type *әме*, words like ярдәмендә, бәйрәмен, үткәрәмен, өйдәме will be found...
21.04.2016 - Because of implementation in "fastmorph" module some processor optimizations and multithreading support we achieved that complex morphological search now performs up to five times faster.
03.04.2016 - Complex morphological search system's features were significantly extended. You can get more info about them in The Guides updated up to 3.0 and higher version.
29.03.2016 - Graphical mode of entering grammatical features in search query is created in the Complex morphological search section.
22.02.2016 - Complex morphological search function appeared in The Corpus of Written Tatar, where you can use different combinations of such parameters as wordform, lemma, grammatical tags, beginning and end of words, distances between them.
21.11.2015 - Finnish Tatars writing system support realized in "Tatar Text-To-Speech" synthesizer.
20.11.2015 - In the User's Guide section Manual in English language is available now.
06.10.2015 - User's Guide new section is created. Currently users there can download Russian version of a Manual to the Corpus of Written Tatar. English and Tatar versions of Guide will be available some later.
16.08.2015 - "Tatar Text-To-Speech" system is placed in the Corpus' site. It is being developed by the team of Written Corpus of Tatar. We invite all interested persons to take a part in this project!
11.06.2015 - For users without Tatar keyboard layout in their computers we added a virtual keyboard in the Search page of the Corpus. After launching it you can type clicking with mouse or just pressing buttons on your real keyboard.
18.04.2015 - Template (the end of the word) based search system in the Corpus is implemented.
29.03.2015 - Limit for viewing right, left and semantic contexts is increased from 100 to 10 000 units. To view them in a table format you should click on the link "Show all".
26.03.2015 - Now The Corpus is also available at the new address www.corpus.tatar. At the same time you can use the old address corpus.tatfolk.ru.
14.03.2015 - Template (the beginning of the word) based search system in the Corpus is implemented.
12.10.2014 - Implementation of a system for listening of visualized sentences (by clicking on the appropriate button on the left of sentences).
05.10.2014 - The morphological marking of the Corpus is made. The meta-language of grammatical labels is based on the system of "tags" for Turkic languages, developed by the international project Apertium.
14.08.2014 - New version of the Corpus is released:
- amount of the Corpus is increased from 45 million word occurences to 116 million word occurences;
- as sources now included in numerous novels, collections of scientific works, monographs, newspapers and magazines, religious literature and others;
- implementation of viewing of sentences where certain phrase occurs (when you click on the words in the right and left context sections);
- introduced new types of statistical data ("Log-likelihood");
- creation of Statistics section, which will be gradually supplemented;
- completion of the section Publications.
16.03.2014 - Changes list:
- in connection with cases of heavy load of the server, created by various robots, certain restrictions on amount of inquiries were made;
- the required word in the found sentences for convenience of viewing is highlighted in red;
- completion of the section Publications;
- found bugs were fixed.
24.03.2013 - There have been made many improvements:
- interface is available in Tatar, Russian and English languages;
- database and search engine optimizations;
- there is no limit of 50 sentences for view anymore;
- possibility to view context of most sentences (press "Find text" for that).
15.03.2012 - The main work on creating The Corpus of Written Tatar language has been completed. The basic versions of the website and the search engine have been developed. Launching the service.