Larramendi, Azkoitiko Sermoia: Difference between revisions

From MLV
No edit summary
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
Testu historikoen edukiak errepresentatzeko eta anotazioez aberasteko datu-eredu garatzeko asmotan, Larramendiren [[Item:Q453|Azkoitiko Sermoia]] hartu dugu adibide. Wikitekan (euskarazko Wikisourcen), eskuizkribua eta transkribapena ditugu, eta hemen, MLV Wikibase honetan, transkribaketaren tokenak (hau da, hitzak eta interpuntzio ikurrak segmentu banatan jasotzen duen zatiketa, modu bertikalean errepresentatu daitekeena, aspalditik usadioa den legez (ikus, adibidez, [https://universaldependencies.org/format.html CONLL formatua]). Galdeketak bistarazten duen taularen atzetik, Datu Lotuak daude, hau da, hirukote semantikoak. Corpus datuak Datu Lotu gisan jasotzeko proposatzen dugun eredu honetan, Linguistik Linked Data arloko azkenengo proposamenak hartzen ditugu aintzat (ikus [[Item:Q1260|Stanković et al. 2023]]).
Testu historikoen edukiak errepresentatzeko eta anotazioez aberasteko lan-fluxua eta datu-eredua garatzeko asmotan, Larramendiren [[Item:Q453|Azkoitiko Sermoia]] hartu dugu adibide. Wikitekan (euskarazko Wikisourcen), eskuizkribua eta transkribapena ditugu, eta hemen, MLV Wikibase honetan, transkribaketaren tokenak (hau da, hitzak eta interpuntzio ikurrak segmentu banatan jasotzen duen zatiketa, modu bertikalean errepresentatu daitekeena, aspalditik usadioa den legez (ikus, adibidez, [https://universaldependencies.org/format.html CONLL formatua]). Galdeketak bistarazten duen taularen atzetik, Datu Lotuak daude, hau da, hirukote semantikoak. Corpus datuak Datu Lotu gisan jasotzeko proposatzen dugun eredu honetan, Linguistic Linked Data arloko azkenengo proposamenak hartzen ditugu aintzat (ikus [[Item:Q1260|Stanković, Chiarcos et al. 2023]]).
 
[https://doi.org/10.13140/RG.2.2.30500.86400 2023ko abenduan aurkeztutako posterra ikus ezazu] (euskaraz).
 
''With the aim of proposing a workflow and data model for the representation of historical text content and annotations, we use Larramendi's [[Item:Q453|Azkoitiko Sermoia]] as showcase. On Basque Wikisource, we store the manuscript faximile and its transcription, and here, on MLV Wikibase, the text tokens (i.e., words and interpunction signs as vertical text, like it is usual (see e.g. [https://universaldependencies.org/format.html CONLL format]). Behind the table visualized in the sparql query interface, there are Linked Data, that is, semantic triples. In this model we propose for representing corpus data, we follow recent proposals made in the domain of Linguistic Linked Open Data (see [[Item:Q1260|Stanković, Chiarcos et al. 2023]]).''
 
Import from Wikisource in this first experiment is done with [https://github.com/dlindem/wikibase/blob/main/mlv/wikisource-to-wikibase.py this script].
 
See a [https://zenodo.org/records/12078616 poster presented in June 2024] (English Version).


== SPARQL ==
== SPARQL ==
=== Token ===
Erabili galdeketa hau Azkoitiko Sermoiaren tokenak eta anotazioak ikusteko.
Erabili galdeketa hau Azkoitiko Sermoiaren tokenak eta anotazioak ikusteko.
''Use this query for seeing tokens and their annotations.''


<sparql tryit="1">
<sparql tryit="1">
Line 13: Line 24:
PREFIX mno: <https://monumenta.wikibase.cloud/prop/novalue/>
PREFIX mno: <https://monumenta.wikibase.cloud/prop/novalue/>


#title: Galdetzen du tokenak non dauden wikisourcen, eta zer lema-formei lotuta dauden
#title: Galdetzen du tokenak non dauden wikisourcen, eta zer lema-formei eta wikidatako zer entitateri lotuta dauden


select ?token ?token_zbk ?token_forma  ?mlv_lexema (iri(concat('http://www.wikidata.org/entity/',?wd_qid)) as ?wikidata_lexema)  
select ?token ?token_zbk ?token_forma  ?mlv_lexema (iri(concat('http://www.wikidata.org/entity/',?wikidata_sense_id)) as ?wikidata_sense)
?wd_pos_label
(iri(concat('https://eu.wikisource.org/wiki/',?wikisource)) as ?wikisource_paragraph)  
(iri(concat('https://eu.wikisource.org/wiki/',?wikisource)) as ?wikisource_paragraph)  
  ?lemma ?sense ?forma (group_concat(?morph_label;SEPARATOR="-") as ?morph_labels) ?pos_label
  ?lemma ?sense ?forma (group_concat(?morph_label;SEPARATOR="-") as ?morph_labels) ?pos_label
(iri(concat('http://www.wikidata.org/entity/',?wd_erref)) as ?wd_ent_erref)  
(iri(concat('http://www.wikidata.org/entity/',?wd_erref)) as ?wd_ent_erref)  
(concat(?wd_erref_label," (",?class_label,")") as ?wd_erref_info)
(concat(?wd_erref_label," (",?class_label,")") as ?wd_erref_info)
 
 
where {
where {
   ?token mdp:P5 mwb:Q15 ;
   ?token mdp:P5 mwb:Q15 ;
Line 28: Line 40:
   optional { ?token mp:P7 ?lemmanode . ?lemmanode mps:P7 ?mlv_lexema. ?mlv_lexema wikibase:lemma ?lemma .
   optional { ?token mp:P7 ?lemmanode . ?lemmanode mps:P7 ?mlv_lexema. ?mlv_lexema wikibase:lemma ?lemma .
             optional {?mlv_lexema mdp:P1 ?wd_qid .}
             optional {?mlv_lexema mdp:P1 ?wd_qid .}
             optional {?lemmanode mpq:P155 ?sense_id. ?sense_id skos:definition ?sense .}
             optional {?lemmanode mpq:P155 ?sense_id. ?sense_id skos:definition ?sense; mp:P1 [mps:P1 ?wikidata_sense_id; mpq:P153 ?wd_pos]. ?wd_pos rdfs:label ?wd_pos_label. filter(lang(?wd_pos_label) = "eu")}
             optional {?lemmanode mpq:P156 ?form_id. ?form_id ontolex:representation ?forma .
             optional {?lemmanode mpq:P156 ?form_id. ?form_id ontolex:representation ?forma .
             optional {?form_id mdp:P172 ?morph. ?morph rdfs:label ?morph_label. filter(lang(?morph_label) = "eu")}
             optional {?form_id mdp:P172 ?morph. ?morph rdfs:label ?morph_label. filter(lang(?morph_label) = "eu")}
Line 35: Line 47:
           }
           }
   optional { ?token mdp:P178 ?wd_erref .
   optional { ?token mdp:P178 ?wd_erref .
          bind(iri(concat(str(wd:),?wd_erref)) as ?item)
          SERVICE <https://query.wikidata.org/sparql> {
          select ?item ?wd_erref_label (sample(?class_l) as ?class_label)
          where {?item rdfs:label ?wd_erref_label. filter(lang(?wd_erref_label) = "eu")
                  ?item wdt:P31/rdfs:label|wdt:P279/rdfs:label ?class_l. filter(lang(?class_l) = "eu")}
              group by ?item ?wd_erref_label ?class_label   
                }         
          }
} group by ?token ?token_zbk ?token_forma ?mlv_lexema ?wikidata_sense_id ?wd_pos_label ?wikisource ?lemma ?sense ?forma ?morph_labels ?pos_label ?wd_erref ?wd_erref_label ?class_label
order by xsd:integer(?token_zbk)
</sparql>
=== Span ===
Erabili galdeketa hau Azkoitiko Sermoiaren spanak (anotazioa partekatzen duten token-multzoak) ikusteko.
''Use this query for seeing token spans and their annotations.''
<sparql tryit="1">
PREFIX mwb: <https://monumenta.wikibase.cloud/entity/>
PREFIX mdp: <https://monumenta.wikibase.cloud/prop/direct/>
PREFIX mp: <https://monumenta.wikibase.cloud/prop/>
PREFIX mps: <https://monumenta.wikibase.cloud/prop/statement/>
PREFIX mpq: <https://monumenta.wikibase.cloud/prop/qualifier/>
PREFIX mpr: <https://monumenta.wikibase.cloud/prop/reference/>
PREFIX mno: <https://monumenta.wikibase.cloud/prop/novalue/>
#title: Spanak zerrendatzen ditu, zer token hartzen dituzten barne, eta zer anotazio duten
select
?span
(group_concat(strafter(str(?token),str(mwb:))) as ?span_tokenak)
?span_label
(group_concat(?num_forma) as ?span_formak)
(iri(concat('https://eu.wikisource.org/wiki/',sample(?wikisource))) as ?wikisource_paragraph)
(iri(concat('http://www.wikidata.org/entity/',?wd_erref)) as ?wd_ent_erref)
(concat(?wd_erref_label," (",?class_label,")") as ?wd_erref_info)
?phil_anot
where {
?span mdp:P5 mwb:Q20;
      mp:P30 [mps:P30 ?token; mpq:P32 ?ord];
      rdfs:label ?span_label. filter(lang(?span_label) = "eu")
?token mdp:P147 ?token_forma . bind (concat(?ord,":",?token_forma) as ?num_forma) 
?token mdp:P177 ?wikisource .
  optional{?span mdp:P178 ?wd_erref .
           bind(iri(concat(str(wd:),?wd_erref)) as ?item)
           bind(iri(concat(str(wd:),?wd_erref)) as ?item)
           SERVICE <https://query.wikidata.org/sparql> {
           SERVICE <https://query.wikidata.org/sparql> {
Line 43: Line 98:
                 }           
                 }           
           }
           }
} group by ?token ?token_zbk ?token_forma ?mlv_lexema ?wd_qid ?wikisource ?lemma ?sense ?forma ?morph_labels ?pos_label ?wd_erref ?wd_erref_label ?class_label
  optional{?span mp:P180 ?philst . ?philst mps:P180 ?phil_anot .
order by xsd:integer(?token_zbk)
          }
 
   
} group by ?span ?span_tokenak ?span_label ?span_formak ?wd_erref ?wd_erref_label ?class_label ?phil_anot
</sparql>
</sparql>

Latest revision as of 17:38, 7 September 2024

Testu historikoen edukiak errepresentatzeko eta anotazioez aberasteko lan-fluxua eta datu-eredua garatzeko asmotan, Larramendiren Azkoitiko Sermoia hartu dugu adibide. Wikitekan (euskarazko Wikisourcen), eskuizkribua eta transkribapena ditugu, eta hemen, MLV Wikibase honetan, transkribaketaren tokenak (hau da, hitzak eta interpuntzio ikurrak segmentu banatan jasotzen duen zatiketa, modu bertikalean errepresentatu daitekeena, aspalditik usadioa den legez (ikus, adibidez, CONLL formatua). Galdeketak bistarazten duen taularen atzetik, Datu Lotuak daude, hau da, hirukote semantikoak. Corpus datuak Datu Lotu gisan jasotzeko proposatzen dugun eredu honetan, Linguistic Linked Data arloko azkenengo proposamenak hartzen ditugu aintzat (ikus Stanković, Chiarcos et al. 2023).

2023ko abenduan aurkeztutako posterra ikus ezazu (euskaraz).

With the aim of proposing a workflow and data model for the representation of historical text content and annotations, we use Larramendi's Azkoitiko Sermoia as showcase. On Basque Wikisource, we store the manuscript faximile and its transcription, and here, on MLV Wikibase, the text tokens (i.e., words and interpunction signs as vertical text, like it is usual (see e.g. CONLL format). Behind the table visualized in the sparql query interface, there are Linked Data, that is, semantic triples. In this model we propose for representing corpus data, we follow recent proposals made in the domain of Linguistic Linked Open Data (see Stanković, Chiarcos et al. 2023).

Import from Wikisource in this first experiment is done with this script.

See a poster presented in June 2024 (English Version).

SPARQL

Token

Erabili galdeketa hau Azkoitiko Sermoiaren tokenak eta anotazioak ikusteko. Use this query for seeing tokens and their annotations.

PREFIX mwb: <https://monumenta.wikibase.cloud/entity/>
PREFIX mdp: <https://monumenta.wikibase.cloud/prop/direct/>
PREFIX mp: <https://monumenta.wikibase.cloud/prop/>
PREFIX mps: <https://monumenta.wikibase.cloud/prop/statement/>
PREFIX mpq: <https://monumenta.wikibase.cloud/prop/qualifier/>
PREFIX mpr: <https://monumenta.wikibase.cloud/prop/reference/>
PREFIX mno: <https://monumenta.wikibase.cloud/prop/novalue/>

#title: Galdetzen du tokenak non dauden wikisourcen, eta zer lema-formei eta wikidatako zer entitateri lotuta dauden

select ?token ?token_zbk ?token_forma  ?mlv_lexema (iri(concat('http://www.wikidata.org/entity/',?wikidata_sense_id)) as ?wikidata_sense)
?wd_pos_label
(iri(concat('https://eu.wikisource.org/wiki/',?wikisource)) as ?wikisource_paragraph) 
 ?lemma ?sense ?forma (group_concat(?morph_label;SEPARATOR="-") as ?morph_labels) ?pos_label
(iri(concat('http://www.wikidata.org/entity/',?wd_erref)) as ?wd_ent_erref) 
(concat(?wd_erref_label," (",?class_label,")") as ?wd_erref_info)
   
where {
  ?token mdp:P5 mwb:Q15 ;
        mdp:P148 ?token_zbk ;
        mdp:P147 ?token_forma ;
        mdp:P177 ?wikisource ;
  optional { ?token mp:P7 ?lemmanode . ?lemmanode mps:P7 ?mlv_lexema. ?mlv_lexema wikibase:lemma ?lemma .
            optional {?mlv_lexema mdp:P1 ?wd_qid .}
            optional {?lemmanode mpq:P155 ?sense_id. ?sense_id skos:definition ?sense; mp:P1 [mps:P1 ?wikidata_sense_id; mpq:P153 ?wd_pos]. ?wd_pos rdfs:label ?wd_pos_label. filter(lang(?wd_pos_label) = "eu")}
            optional {?lemmanode mpq:P156 ?form_id. ?form_id ontolex:representation ?forma .
            optional {?form_id mdp:P172 ?morph. ?morph rdfs:label ?morph_label. filter(lang(?morph_label) = "eu")}
            optional {?form_id mdp:P173 ?pos. ?pos rdfs:label ?pos_label. filter(lang(?pos_label) = "eu")}          
                     }
           }
  optional { ?token mdp:P178 ?wd_erref .
           bind(iri(concat(str(wd:),?wd_erref)) as ?item)
           SERVICE <https://query.wikidata.org/sparql> {
           select ?item ?wd_erref_label (sample(?class_l) as ?class_label)
           where {?item rdfs:label ?wd_erref_label. filter(lang(?wd_erref_label) = "eu")
                  ?item wdt:P31/rdfs:label|wdt:P279/rdfs:label ?class_l. filter(lang(?class_l) = "eu")}
               group by ?item ?wd_erref_label ?class_label    
                }          
           }
} group by ?token ?token_zbk ?token_forma ?mlv_lexema ?wikidata_sense_id ?wd_pos_label ?wikisource ?lemma ?sense ?forma ?morph_labels ?pos_label ?wd_erref ?wd_erref_label ?class_label
order by xsd:integer(?token_zbk)

Try it!


Span

Erabili galdeketa hau Azkoitiko Sermoiaren spanak (anotazioa partekatzen duten token-multzoak) ikusteko. Use this query for seeing token spans and their annotations.

PREFIX mwb: <https://monumenta.wikibase.cloud/entity/>
PREFIX mdp: <https://monumenta.wikibase.cloud/prop/direct/>
PREFIX mp: <https://monumenta.wikibase.cloud/prop/>
PREFIX mps: <https://monumenta.wikibase.cloud/prop/statement/>
PREFIX mpq: <https://monumenta.wikibase.cloud/prop/qualifier/>
PREFIX mpr: <https://monumenta.wikibase.cloud/prop/reference/>
PREFIX mno: <https://monumenta.wikibase.cloud/prop/novalue/>

#title: Spanak zerrendatzen ditu, zer token hartzen dituzten barne, eta zer anotazio duten

select 
?span 
(group_concat(strafter(str(?token),str(mwb:))) as ?span_tokenak)
?span_label
(group_concat(?num_forma) as ?span_formak) 
(iri(concat('https://eu.wikisource.org/wiki/',sample(?wikisource))) as ?wikisource_paragraph)
(iri(concat('http://www.wikidata.org/entity/',?wd_erref)) as ?wd_ent_erref) 
(concat(?wd_erref_label," (",?class_label,")") as ?wd_erref_info)
?phil_anot

where {
?span mdp:P5 mwb:Q20; 
       mp:P30 [mps:P30 ?token; mpq:P32 ?ord];
       rdfs:label ?span_label. filter(lang(?span_label) = "eu")
 ?token mdp:P147 ?token_forma . bind (concat(?ord,":",?token_forma) as ?num_forma)  
 ?token mdp:P177 ?wikisource .
  optional{?span mdp:P178 ?wd_erref .
           bind(iri(concat(str(wd:),?wd_erref)) as ?item)
           SERVICE <https://query.wikidata.org/sparql> {
           select ?item ?wd_erref_label (sample(?class_l) as ?class_label)
           where {?item rdfs:label ?wd_erref_label. filter(lang(?wd_erref_label) = "eu")
                  ?item wdt:P31/rdfs:label ?class_l. filter(lang(?class_l) = "eu")}
               group by ?item ?wd_erref_label ?class_label    
                }          
           }
  optional{?span mp:P180 ?philst . ?philst mps:P180 ?phil_anot .
          }

     
 } group by ?span ?span_tokenak ?span_label ?span_formak ?wd_erref ?wd_erref_label ?class_label ?phil_anot

Try it!