ZCTC - corpus markup

3. Corpus markup

The ZCTC corpus is marked up in Extensible Markup Language (XML) which is complaint with the Corpus Encoding Standards (CES). Each of the 500 data files has two parts: a corpus header and a body.

The cesHeader gives general information about the corpus (publicationStmt) as well as specific attributes of the text sample (fileDesc). Details in the publicationStmt element include the name of the corpus in English and Chinese, authors, distributor, availability, publication date, and history. The fileDesc element shows the original title(s) of the text(s) from which the sample was taken, individuals responsible for sampling and text processing, the project that creates the corpus file, data of creation, language usage, writing system, character encoding, and mode of channel.

The body part of the corpus file contains the textual data, which is marked up for structural organisation such as paragraphs (p) and sentences (s). Sentences are consecutively numbers for easy reference. Part-of-speech annotation is also given in XML, with the POS attribute of the w element indicating its part-of-speech category (see corpus annotation for the tagset).

The XML markup of the ZCTC corpus is perfectly well-formed and has been validated using Altova XMLSpy 2008. The XML elements of the corpus are defined in the Document Type Definition.

The ZCTC corpus is encoded in Unicode, applying the Unicode Transformation Format 8-Bit (UTF-8), which is a lossless encoding for Chinese while keeping the XML files at a minimum size.