Contact us

Address: Shanghai, Hongkou District Guang Ji Road No. 838 bridge Chinese publication of Creative Industry Park building B room 215-217

Telephone: 02161072106

Fax: 021-23081199





The national language of modern Chinese corpus

The modern Chinese corpus provides free retrieval of data is about 2000 words, word segmentation and POS tagging corpus.

The ancient Chinese corpus

The site now also increases the one hundred million word in ancient Chinese language, ancient Chinese studies can also go to the query and download. At the same time, also provides the word segmentation, POS tagging, word frequency, word frequency statistics software statistics software,word frequency statistical results released by the American National Corpus and thesaurus based on the study of language learning, for the teachers and students to use.

Central Research Institute of Taiwan


1 modern Chinese balance corpus

Specific language analysis and design, each sentence with words is disconnected, and markedpart of speech. Data collection also as far as possible the modern Chinese in the different theme and language style, is a representative sample of an infinite number of sentences in modern chinese. The existing corpus for linguistic analysis and design, complete the language information,by the Central Research Institute thesaurus group, contains introduction, instructions, the currentversion of the corpus is 4.

The 2 ancient Chinese corpus

The ancient Chinese corpus consists of the following five Corpora: the ancient Chinese, Middle Chinese (including Tripitaka), modern Chinese, the other, the unearthed literature. Part of the datafrom the Institute of history and Philology Chinese full-text database, so little overlap between the two. The unearthed literature corpus corpus, all from the Institute of history and Philology of Han Dynasty Group production database.

3 modern Chinese tagged corpus

The construction requirements of the study of the history of Chinese corpus. At present in the corpus collected materials has covered the ancient Chinese (pre Qin to the Western Han Dynastyin Medieval Chinese (Han), Wei Jin and Northern and Southern Dynasties), modern Chinese (laterTang and Five Dynasties) most of the important data, and has opened up to use; in the tagged corpus, the ancient Chinese and modern Chinese corpus tagging have been completed as the result of work, and gradually provide on-line retrieval.

The 4 tree database

"Chinese sentence tree structure database (Sinica Treebank Version 3) contains 6 files, 61087Chinese tree, 361834 words, is the Central Research Institute of Central Research Institute ofthesaurus group from Balanced Corpus (Sinica Corpus) in sentence extraction, through thecomputer analysis of a bill, and manual correction, the income test results after. In Chinesesentence structure tree, we show Chinese sentence semantic and grammatical information. This "Chinese sentence tree structure database is now open online retrieval and data transfer, for thescholars and experts in the Chinese syntax, semantic relationship for reference. The 1000sentence structure tree is available for download.

5 bilingual ontological WordNet

Combined with the word web, ontology, knowledge base and domain mark words.

6 search the characters

Contains the search term to find the word "," the beauty of literature "," game ", the ancient word"doubts "world of four units, the unit, radical, word, sound, word mutual check, and can be found in the four books, old, Zhuang, Tang in the source, and a direct link to the source, read the text.

7 the hunt

On the basis of the search words, in Chinese learners as the objects, the words, words, retrieval functions and the sound editing, South Wah, three versions of the Chinese language textbooks,combined with the three hundred Tang poems, Song three hundred first, a dream of Red Mansions, outlaws of the marsh and other literary works, study on the network language material.

8 three hundred poems of the Tang Dynasty

In junior high school, primary school students as the main object, singing, painting, calligraphy, provide multimedia data, text data, including data the author's life, the pronunciation annotation,translation, annotation, commentary, allusion; retrieval contains poems, verses, the author,comprehensive information, genre classification retrieval results can be listed in the paper,; and select mark related text and multimedia data. And provides a set of can automatically check meter,rhyme, rhyme in correcting the "metrical pattern automatic detection index teaching system", to help the children in accordance with the rhyme of poetry, to assist teachers marking compositions.

9 Chinese electronic document

The 25 contains the whole history of Ruan, more than 2000 words engraved in 13 by the Taiwan historical materials, the 1000 words of the Tripitaka and other classics.

Research on data center network teaching 10 of a dream of Red Mansions

Research Chinese literature network system of Yuan Ze University developed the network reading- Chinese literature network system ", as the Research Center for Luo Fengzhu teacher led, a dream of Red Mansions is one of the subsystems, also includes other rare books, the book of songs, the Tang and Song poetry, lyric poetry etc.. This website is the largest domestic InternetChina literature database, provides the user the most complete China literature data.

Communication University of China

1 Communication University of China text corpus retrieval system


2 online tagging system


3 new words research resources


Harbin Institute of Technology

Hit-ir shared resources of the corpus corpus of Chinese English bilingual corpus, 100000 aligned bilingual sentence pairs, a text file format,Tongyici Cilin extended edition, 77343 words, adhering to the "compilation style synonym word forest", at the same time, using five levels of encoding system, multi document automatic summarization corpus, the 40 theme, text file format, the same theme is different reports of the same event, Chinese dependency treebank, with 50000 lines, with 10000 lines, LTML, word segmentation, POS tagging, syntactic, graphical view of question answering system, problem sets,6264, has been marked problems type LTML, word segmentation, part of speech, semantic,syntactic, semantic, shallow processing, single document summarization corpus, 211, divided intodifferent genres, LTML, word segmentation, part of speech tagging sentences, words, syntax,semantics, text classification, shallow, anaphora resolution procedure processing.

WeChat public account


The number of public concern