(Translated by https://www.hiragana.jp/)
Instructions : Semantic annotation - Chinese Text Project
Follow us on Facebook to receive important updates Follow us on Twitter to receive important updates Follow us on sina.com's microblogging site to receive important updates Follow us on Douban to receive important updates
Chinese Text Project

Semantic annotation

Introduction

Semantic annotation involves adding computer-readable data about the meaning of words and phrases in their given context to a text. This enables further processing, and allows the system to display additional relevant information. For example, in the following passage, the semantically annotated version (left) provides useful contextual information about dates, people, and written works:
With annotationWithout annotation
1 なつよんがつおつりょえびすうえけいゆうほうたからしんろく》。甲子きのえねりょえびすおうそう蔡齊やめ,以おうずいため門下もんかさむらいろうどう中書ちゅうしょもん下平しもだいらあきらごと昭文あきふみかんだい學士がくしちん堯佐どう中書ちゅうしょもん下平しもだいらあきらごとしゅうけん殿どのだい學士がくしもり樞密院すうみついんごとかんおくほどいし中立ちゅうりつまいり政事せいじおうどう樞密院すうみついんごと
1 なつよんがつおつりょえびす簡上《けいゆうほうたからしんろく》。甲子きのえねりょえびす簡、おう曾、そう綬、蔡齊やめ,以王ずいため門下もんかさむらいろうどう中書ちゅうしょもん下平しもだいらあきらごと昭文あきふみかんだい學士がくしちん堯佐同中どうちゅうしょもん下平しもだいらあきらごとしゅうけん殿どのだい學士がくしもり樞密院すうみついんごとかんおくほど琳、いし中立ちゅうりつさん政事せいじおう鬷同樞密院すうみついんごと

General principles

Semantic annotation in the Chinese Text Project involves creating three types of closely related data:

  1. Annotations. An annotation locates a short region of text - usually a word or short phrase - and provides information about what that word or phrase means in the particular context in which it occurs. For example, in the sentence "孔子こうしてきひとし。" we might want to add an annotation for the word "孔子こうし" indicating that in this sentence, "孔子こうし" refers to a particular person: the historical individual Confucius.
    Two types of annotation are supported in ctext:
    • Entity annotations - indicate that the annotated text refers to a particular entity, such as "ctext:855132" (おうやすしせき).
    • Date annotations - indicate that the annotated text refers to a particular historical date. The date is specified by recording the era (or ruler) to which the date belongs, such as "ctext:27110" (てん禧 era), as well as data about the meaning of the date, such as "year 1, month 2".
  2. Entity records. An entity record represents a unique thing. This may be a concrete object - such as a person, or a physical building - or an abstract or constructed object, like a bureaucratic office. For example, factual and fictional historical people - like Wang Anshi - have entity records; so do works - like the History of Song - and dynasties - like Northern Song. Entity records are used to contain information about entities, and as a reference point for annotations: the annotation of "孔子こうし" in the example above would point to the entity record for Confucius. Entity records help distinguish between different things that sometimes have the same name, and identify the same thing when it may be referred to by different names. Every entity record has a unique identifier, e.g. "ctext:27110" (てん禧 era). Using these identifiers allows us to precisely distinguish between entities with the same name - such as "ctext:474358" for the 紹興しょうこう era of the Song dynasty, and "ctext:63988" for the 紹興しょうこう era of the Western Liao dynasty. The page for each entity lists its identifer immediately below the title.
  3. Knowledge claims. A knowledge claim represents one piece of information about an entity; entity records are made up of knowledge claims about that entity. A knowledge claim primarily connects three things: a subject (the entity the claim relates to), a verb or relation, and an object or target of the relation. For example, a knowledge claim about Wang Anshi might connect Wang Anshi (subject) and Wang Yi (object), with the verb "father" - thus recording the fact that Wang Anshi's father is Wang Yi. As a second example, we might connect Wang Anshi and the office Hanlin Academic through the relation "held-office", to indicate that Wang Anshi held this particular bureaucratic office.
    Sometimes it is useful to record additional information about a claim. This can be done by adding one or more qualifiers to the claim. A qualifier is an additional part of a claim which connects that claim with two other pieces of information: an additional verb (the qualifier), and an additional object. For example, while it is true to say that Wang Anshi held the office of Hanlin Academic, it is useful to further explain this by indicating that he held the office starting from a particular date - this is done by adding the from-date qualifier to the claim, together with an object representing that particular date.

Citations

Citations are required for most types of claim. A citation is a specific textual reference in ctext citation format. A citation is composed of two parts: a URN identifying a particular chapter of one edition of a text, and the literal content of the text being cited (in Traditional Chinese); these two parts are combined using the symbol "@". For example:

The citation should be chosen to be a complete sentence or meaningful sentence fragment that justifies the claim. Context does not need to be cited, because the text will be linked directly to its source.

Most claims require evidence, with the following exceptions:

Annotation conventions

In order to promote consistency in the data and facilitate effective automated processing, please observe the following conventions when marking up texts:

Dates

Dates are important pieces of historical data that need to be annotated carefully. A date annotation connects a date in a text (e.g. "がつ") with enough additional data to make the date unambiguous - for example, the information that the date refers to a particular year and month within some specific era. The annotation client provides a mechanism to input this information, by connecting each date annotation to an era. In many cases, dates in a text do not directly contain all of this information, as it is provided contextually - as in the following passage:

1 ひらきたからきゅうねんふゆじゅうがつみずのとうしふとしくずれみかどとげそく皇帝こうていおつ大赦たいしゃつね赦所はらしゃ咸除

The first of the two dates in the above passage is "complete": it directly contains enough information, taken together with the era, to unambiguously point to a particular date - specifically, the information year 9, month 10, day みずのとうし. The second date ("おつ") does not directly contain this information because the information is implied by the context. Date annotation involves explicitly recording these separate values, so that digital systems can correctly process the date.

The annotation client will attempt to suggest appropriate values, however these will sometimes be incorrect. It is important to pay attention to the contextual flow of information when annotating dates, especially where parenthetical references to other years and eras do not affect the interpretation of dates later in the text. For example, in the following passage, purple arrows indicate the correct contextual flow of date information:

The annotation client will help by suggesting the correct values automatically for most cases - e.g. suggesting that "おつ" refers to year 9 month 10 of the ひらきたから era - but in this example will incorrectly propose that "じゅういちがつみずのと" refers to the 11th month of year 8 of ひらきたから, due to year 8 having been referenced immediately prior. In cases like these it is important to pay attention to the date flow: if "じゅういちがつみずのと" is marked as referring to year 8, then the annotation client will infer that 甲子きのえね and 庚午こうご should also be marked as year 8, whereas in this passage they actually refer to year 9. Mistakes of this kind easily cascade to affect many dates in historical texts because much of the date information is implied contextually.

Texts and editions

Only one edition of each text should be annotated. This should normally be the representative edition.

Some annotations have been added to the following texts; please use the editions linked below when adding or correcting annotations:

Standard Histories

  1. 史記しき
  2. 漢書かんしょ
  3. こう漢書かんしょ
  4. 三國志さんごくし
  5. すすむしょ
  6. そうしょ
  7. みなみひとししょ
  8. はりしょ
  9. ひねしょ
  10. しょ
  11. きたひとししょ
  12. しゅうしょ
  13. みなみ
  14. きた
  15. ずいしょ
  16. きゅうとうしょ
  17. しんとうしょ
  18. きゅうだい
  19. しんだい
  20. そうふみ
  21. りょう
  22. きむふみ
  23. もと
  24. あかり
  25. 清史きよし稿こう

Other historical works

Bibliographic works and catalogs

The above are only partial lists; other texts can also be annotated, provided that: