17 research outputs found
Webを母集団とした超大規模コーパスの開発 : 収集と組織化
国立国語研究所 コーパス開発センター国立国語研究所 コーパス開発センター プロジェクト研究員国立国語研究所 コーパス開発センター プロジェクト研究員国立国語研究所 コーパス開発センター 非常勤研究員国立国語研究所 言語資源研究系Center for Corpus Development, NINJALPostdoctoral Research Fellow, Center for Corpus Development, NINJALPostdoctoral Research Fellow, Center for Corpus Development, NINJALAdjunct Researcher, Center for Corpus Development, NINJALDepartment of Corpus Studies, NINJAL国立国語研究所コーパス開発センターでは2011年より超大規模コーパスプロジェクトとして,Webを母集団とした100億語規模のコーパスの構築を進めている。構築にあたっては,工程を収集・組織化・利活用・保存の四つに分割して実装を進めている。本論文ではそのうち最初の2工程について報告する。収集に関しては,2012年第4四半期より3か月ごとに1億URLのクロールを繰り返し実施している。また組織化に関しては,2013年第3四半期までの約1年間に収集されたWebページの文抽出・形態素解析・係り受け解析を実施した。これらの作業に生じた問題とその解決法を示した後,2013年末において構築されたコーパスデータの基礎統計量を示し,本コーパスを用いてどのような理論的・応用的研究が可能になると考えられるかを論じる。In 2011, the National Institute for Japanese Language and Linguistics launched a corpus compilation project with the aim of constructing a ten-billion-word Web corpus. The project was split into the following four sub-projects: page collection, linguistic annotation, release, and preservation. During the page collection stage, crawling began during the fourth quarter of 2012. We crawled 100 million URLs every three months as fixed-point observations. During the linguistic annotation, normalization (HTML tag removal and character encoding conversion), Japanese morphological analysis (word segmentation and part-of-speech tagging), and Japanese dependency analysis were performed on the data that were crawled in the timespan of one year, specifically from the fourth quarter of 2012 to the third quarter of 2013. In this paper, we present the basic statistics of the crawled data and discuss possible theoretical and practical implications of the language resources. Additionally, we address issues encountered during the page collection and linguistic annotation stages, and offer tentative solutions
The combined effect of the T2DM susceptibility genes is an important risk factor for T2DM in non-obese Japanese: a population based case-control study
<p>Abstract</p> <p>Background</p> <p>Type 2 diabetes mellitus (T2DM) is a complex endocrine and metabolic disorder. Recently, several genome-wide association studies (GWAS) have identified many novel susceptibility loci for T2DM, and indicated that there are common genetic causes contributing to the susceptibility to T2DM in multiple populations worldwide. In addition, clinical and epidemiological studies have indicated that obesity is a major risk factor for T2DM. However, the prevalence of obesity varies among the various ethnic groups. We aimed to determine the combined effects of these susceptibility loci and obesity/overweight for development of T2DM in the Japanese.</p> <p>Methods</p> <p>Single nucleotide polymorphisms (SNPs) in or near 17 susceptibility loci for T2DM, identified through GWAS in Caucasian and Asian populations, were genotyped in 333 cases with T2DM and 417 control subjects.</p> <p>Results</p> <p>We confirmed that the cumulative number of risk alleles based on 17 susceptibility loci for T2DM was an important risk factor in the development of T2DM in Japanese population (<it>P </it>< 0.0001), although the effect of each risk allele was relatively small. In addition, the significant association between an increased number of risk alleles and an increased risk of T2DM was observed in the non-obese group (<it>P </it>< 0.0001 for trend), but not in the obese/overweight group (<it>P </it>= 0.88 for trend).</p> <p>Conclusions</p> <p>Our findings indicate that there is an etiological heterogeneity of T2DM between obese/overweight and non-obese subjects.</p
BCCWJ-TimeBank: Temporal and Event Information Annotation on Japanese Text
Temporal information extraction can be split into the following three tasks: tem-poral expression extraction, time normalisa-tion, and temporal ordering relation resolu-tion. This paper describes a time expression and temporal ordering annotation schema for Japanese, employing the Balanced Cor-pus of Contemporary Written Japanese, or BCCWJ. The annotation is aimed at allow-ing the development of better Japanese tem-poral ordering relation resolution tools. The annotation schema is based on an ISO anno-tation standard – TimeML. We extract verbal and adjective event expressions as ⟨EVENT⟩ in a subset of BCCWJ. Then, we annotate temporal ordering relation ⟨TLINK ⟩ on the above pairs of event and time expressions by previous work. We identify several issues in the annotation.
Novel recombinant feline interferon carrying N-glycans with reduced allergy risk produced by a transgenic silkworm system
Abstract Background The generation of recombinant proteins for commercialisation must be cost-effective. Despite the cost-effective production of recombinant feline interferon (rFeIFN) by a baculovirus expression system, this rFeIFN carries insect-type N-glycans, with core α 1,3 fucosyl residues that act as potential allergens. An alternative method of production may yield recombinant glycoproteins with reduced antigenicity. Results A cDNA clone encoding the fifteenth subtype of FeIFN-α (FeIFN-α15) was isolated from a Japanese domestic cat. This clone encoded a protein of 189 amino acids with a molecular mass of 21.1 kDa. The rFeIFN-α15 was expressed using a transgenic silkworm system, which was expected to yield an N-glycan structure with reduced antigenicity compared with the protein produced by the baculovirus system. The resulting rFeIFN-α15 accumulated in the sericin layer of silk fibres and was easily extracted and purified by column chromatography. The N-terminal amino acid sequence of purified rFeIFN-α15 was identical to the mature form of natural sequence. Moreover, its N-glycans did not include detectable core α 1,3 fucosyl residues. Its anti-vesicular stomatitis virus activity (2.6 × 108 units/mg protein) was comparable to that of the baculovirus-expressed rFeIFN. Conclusions The lower allergy risk of rFeIFN produced by the transgenic silkworm system than by the baculovirus expression system is due to the former lacking core α 1,3 fucosyl residues in its N-glycans. The rFeIFN-α15 produced by the transgenic silkworm system may be a prospective candidate for the next generation of rFeIFN in veterinary medicine