Modeling LCSH Subject Headings as RDA Linked Data in Sinopia

We are writing a metadata application profile. We are following Library of Congress’ profile syntax for creating BIBFRAME profiles (http://www.loc.gov/bibframe/docs/bibframe-profiles.html). Almost all our properties have been taken from Resource Description and Access (RDA; see https://www.rdatoolkit.org/about and the RDA Registry at https://www.rdaregistry.info). These are json profiles, intended to produce data input forms that output linked data. The data input forms will display in a linked data editor currently in development, Sinopia (https://sinopia.io/), allowing input specialists (catalogers, for example) to enter values for RDA properties. Once input, the data can be output as RDF data; that output is a presupposition of our work. In fact, we often create the profile in accordance with conceptual output we model for a given property. This entry illustrates that profile-creation practice: conceiving what data input form structures are required by modeling the form’s expected output; more specifically, this entry focuses on how the form outputs properties and values for subject headings (categories describing the content of a resource). The values are complex for library data because libraries traditionally use precoordinated headings (https://www2.archivists.org/glossary/terms/p/precoordinate-indexing) as values, which our model must accommodate. How? This illustration is not an exhaustive treatment of precoordinated headings but, rather, a clue toward a possible solution.

We created multiple models for outputting subject headings; these models are platform-independent, as is our practice: model the desired output conceptually, without regard to the application we plan to use (Sinopia). If Sinopia cannot provide the desired output, then our practical task becomes the design of work-arounds that allow us to approximate the desired output.

All our models are expressed using Turtle notation with the following prefixes:

@prefix uw: <http://uw.edu/imaginaryResource/> .
@prefix rdac: <http://rdaregistry.info/Elements/c/> .
@prefix rdaw: <http://rdaregistry.info/Elements/w/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix madsrdf: <http://www.loc.gov/mads/rdf/v1#> .

Model #1:

#This resource’s subject is a single heading “Linked Data” in LC Subject Headings (LCSH) (see http://id.loc.gov/authorities/subjects.html).

uw:0001
a   rdac:C10001 ;      #---C10001 is the identifier for the RDA class “Work”---
rdaw:P10256  <http://id.loc.gov/authorities/subjects/sh2013002090> .

#---P10256 is the RDA property “has subject”---

 

#This resource’s subject is a heading with subheading “Railroad trains—Fiction” in LCSH.

uw:0002
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/childrensSubjects/sj97000515> .

 

#This resource’s subject is heading/subheading “Popcorn—Social aspects”; the full string is NOT in LCSH but both components of the heading ARE in LCSH as separate headings.

uw:0003
a   rdac:C10001 ;
rdaw/datatype:P10256   “Popcorn—Social aspects” .

 

#This resource’s subject is heading/subheading “Popcorn—Green Lake Way (Seattle, Wash.)” where one component is in LCSH, the other is not.

uw:0004
a   rdac:C10001 ;
rdaw/datatype   “Popcorn—Green Lake Way (Seattle, Wash.)” .

 

Model #2:

#This resource’s subject is a single heading “Linked Data” in LC Subject Headings (LCSH).

uw:0001
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/subjects/sh2013002090> .

 

#This resource’s subject is a heading with subheading “Railroad trains—Fiction”” in LCSH.

uw:0002
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/childrensSubjects/sj97000515> .

 

#This resource’s subject is heading/subheading “Popcorn—Social aspects”; the full string is NOT in LCSH but both components of the heading ARE in LCSH as separate headings.

uw:0003
a   rdac:C10001 ;
rdaw:10256  [
a madsrdf:ComplexSubject ;
rdfs:label   “Popcorn—Social aspects” ;
madsrdf:componentList   (<http://id.loc.gov/authorities/subjects/sh85104866>
<http://id.loc.gov/authorities/subjects/sh85123910> )
] .

 

#This resource’s subject is heading/subheading “Popcorn—Green Lake Way (Seattle, Wash.)” where one component is in LCSH, the other is not.

uw:0004
a   rdac:C10001 ;
rdaw:10256  [
a madsrdf:ComplexSubject ;
rdfs:label   “Popcorn—Green Lake Way (Seattle, Wash.)” ;
madsrdf:componentList   (<http://id.loc.gov/authorities/subjects/sh85104866>)
] .

 

Model #3:
Same as Model #2 except it uses rdf:Seq instead of rdf:List

#This resource’s subject is a single heading “Linked Data” in LC Subject Headings (LCSH).

uw:0001
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/subjects/sh2013002090> .

 

#This resource’s subject is a heading with subheading “Railroad trains—Fiction”” in LCSH.

uw:0002
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/childrensSubjects/sj97000515> .

 

#This resource’s subject is heading/subheading “Popcorn—Social aspects”; the full string is NOT in LCSH but both components of the heading ARE in LCSH as separate headings.

uw:0003
a   rdac:C10001 ;
rdaw:10256  [
a madsrdf:ComplexSubject ;
rdfs:label   “Popcorn—Social aspects” ;
madsrdf:componentList
[
a rdf:Seq ;
rdf:_1   <http://id.loc.gov/authorities/subjects/sh85104866> ;
rdf:_2    <http://id.loc.gov/authorities/subjects/sh85123910>
]
] .

 

#This resource’s subject is heading/subheading “Popcorn—Green Lake Way (Seattle, Wash.)” where one component is in LCSH, the other is not.

uw:0004
a   rdac:C10001 ;
rdaw:10256  [
a madsrdf:ComplexSubject ;
rdfs:label   “Popcorn—Green Lake Way (Seattle, Wash.)” ;
madsrdf:componentList
[
a rdf:Seq ;
rdf:_1   <http://id.loc.gov/authorities/subjects/sh85104866>
]
] .

 

Model #4:
Same as Model #2 except the complex subject is identified by an IRI rather than a nameless blank node

#This resource’s subject is a single heading “Linked Data” in LC Subject Headings (LCSH).

uw:0001
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/subjects/sh2013002090> .

 

#This resource’s subject is a heading with subheading “Railroad trains—Fiction”” in LCSH.

uw:0002
a   rdac:C10001 ;
rdaw:P10256   <http://id.loc.gov/authorities/childrensSubjects/sj97000515> .

 

#This resource’s subject is heading/subheading “Popcorn—Social aspects”; the full string is NOT in LCSH but both components of the heading ARE in LCSH as separate headings.

uw:0003
a   rdac:C10001 ;
rdaw:10256  uw:sub900001 .
uw:sub900001
a    madsrdf:ComplexSubject ;
rdfs:label   “Popcorn—Social aspects” ;
madsrdf:componentList   (<http://id.loc.gov/authorities/subjects/sh85104866>
<http://id.loc.gov/authorities/subjects/sh85123910> ) .

 

#This resource’s subject is heading/subheading “Popcorn—Green Lake Way (Seattle, Wash.)” where one component is in LCSH, the other is not.

uw:0004
a   rdac:C10001 ;
rdaw:10256  uw:sub900002 .
uw:sub900002
a    madsrdf:ComplexSubject
rdfs:label   “Popcorn—Green Lake Way (Seattle, Wash.)” ;
madsrdf:componentList   (<http://id.loc.gov/authorities/subjects/sh85104866>) .

 

Analysis

Initial analysis will use the following categories:

  1. Simplicity: “simplicity” refers to expected form controls required to produce the desired output.
    1. Values: maximum simplicity/moderate simplicity/complex
  2. Scalability as general data: do we anticipate the model allows for data that will be understood well into the future; this is a concern largely with future data consumption. Also worth considering: does the model permit a wide range of subject heading types to be entered, even if they are not types explicitly anticipated by our models; this is a concern largely with future data production.
    1. Values: highly scalable/moderately scalable/not scalable
  3. Scalability as semantic web data: do we anticipate the data will be useful as linked data well into the future; is it “good RDF.”
    1. Values: highly scalable/moderately scalable/not scalable
  4. Additional modeling required: is the model self sufficient or does it require additional modeling.
    1. Values: completely modeled/sufficiently modeled/requires more models
  5. Additional data required: will the model result in a form that allows input of data sufficient to describe the subject of the resource.
    1. Values: complete data/some more data required/highly incomplete data
  6. Precision of property selected: are the properties used the “right property for the job.”
    1. Values: highly precise/precise enough/imprecise.

 

Model 1
  1. Simplicity: maximum simplicity
    1. Discussion: The only model that is simpler is one that expects only a literal value for every subject property. Using this model, we would search the LCSH vocabulary for matching input strings; if there is an exact match, we describe the subject using the LCSH IRI by somehow entering the LCSH IRI; if there is not a match, we enter only the string or, in future RDA, we create a Nomen and insert the nomen string. There is no complexity in dealing with subheadings; either the full heading is represented as an IRI or it is not. The anticipated data entry form will reflect that simplicity: it will search LCSH for a string; if it succeeds, it will extract and insert the appropriate IRI; if it fails, it will instruct the data entrist to enter a literal value.
  2. Scalability as general data: highly scalable
    1. Discussion: uses classes and properties from only one ontology, simplifying human understanding of the data and, by extension, form production. The data structure is highly processable: triples with explicit subjects, explicit properties from a single ontology, with either an IRI directly entered as the value or a literal value entered as a value. Processability is further increased because IRI values are distinguished from literal values through the use of different properties. It is difficult to conceive of any sort of subject heading that cannot be efficiently described by this model.
  3. Scalability as semantic web data: not scalable
    1. Discussion: although the model provides one property for IRI values and another for literals, many literal values are expected; however, as we understand it, LCSH complex subjects — the precoordinated subjects — are in the process of being assigned IRIs. We are anticipating that all LCSH subject headings will be assigned an IRI (although this may be an unrealistic expectation). As a result, we will have entered literals when, in the future, IRIs will exist for all the precoordinated subject headings. This may diminish the usefulness of our data in a future semantic web if we do not replace the literals with IRIs when available.
  4. Additional modeling required: completely modeled
    1. The model will remain the same: when there is an IRI, we enter it; when there is not, we enter a literal. If there is ever reconciliation of our literal values with LCSH, to resolve those literals into LCSH IRIs, there is a place in the model to insert those IRIs.
  5. Additional data required: highly incomplete data
    1. Discussion: we will want to regularly search external data for IRIs to replace our literal values.
  6. Precision of property selected: highly precise
    1. Discussion: The property for the literals, rdaw/datatype:P10256, appear to have been provided precisely for this purpose in RDA. The property for the IRIs, rdaw:P10256, is less precise: the property does not have a range, so it’s use to exclusively record IRIs is not entirely predictable. This is a potential interoperability problem in the linked data cloud; however in our local data the values will be highly predictable, increasing its processability through a highly precise use of two properties.

 

Model 2
  1. Simplicity: complex.
    1. When the subject heading string matches the LCSH string, the value is very simply the appropriate LCSH IRI. The complexity with this model is for complex subjects lacking an LCSH IRI. Unlike model #1, which requires direct entry or literals only, model  #2 requires the creation of a blank node for the subject-concept, which is in turn described using 2 ontologies beyond RDA: namely rdfs and madsrdf. The blank node is classed using a madsrdf class, the subject heading string is the direct value of rdfs:label, and all headings/subheadings identified by an IRI will be inserted into an rda:List following a search against the LCSH vocabulary for a matching IRI. This complexity requires the form to create a new resource, an unnamed madsrdf:CompexSubject, that requires further searching in an external vocabulary (LSCH). The data input form will thus inherit a significant degree of complexity.
  2. Scalability as general data: moderately scalable.
    1. Discussion: The use of the blank node can cause ambiguity in the linked data cloud. The use of multiple ontologies may cause confusion; however the ontologies used are well deployed, well constructed and adequately adopted to produce confidence in future processability. The use of rdf:List may cause some difficulty if applications stop accommodating this not widely used, and poorly understood, feature in RDF (although I have no supporting data for this, it’s a fixture in many discussions, like in Joshua Taylor’s March 12, 2015 post at https://stackoverflow.com/questions/29001433/how-rdfbag-rdfseq-and-rdfalt-is-different-while-using-them). The most significant problem of this model-as-data into the future lies in its inclusion of IRIs for subject heading components only as available; when the IRI is available, it is inserted into the List; when it is not, there is no representation of the subheading in the List. This will cause confusion as to which IRI matches which component. The model however seems well designed to accommodate all types of headings.
  3. Scalability as semantic web data: moderately scalable.
    1. The blank node is typed and labeled, increasing our confidence in the usefulness of this data on the semantic web. The value of componentList however has the potential to be highly ambiguous and of limited usefulness. No distinction is made between values of rdaw:P10256 that are IRIs and blank nodes, again diminishing its processability. The use of rdf:List embeds the subject heading components in layers of RDF complexity that make it difficult to query and otherwise process (see Andy Seaborne’s blog entry http://seaborne.blogspot.com/2011/03/updating-rdf-lists-with-sparql.html).  Finally, because precoordinated subjects are in the process of being assigned IRIs in LCSH, labels, not components, without IRIs will need to be reconciled regularly to acquire appropriate IRIs — then these IRIs will need to be inserted as the direct value of rdaw:P10256 so that, in that case, we wonder what will happen to the componentList? This significantly reduces the scalability of the expected data as semantic web data, as a good deal of effort will be required when updating.
  4. Additional modeling required: requires more models.
    1. Discussion: as suggested in #3 above, more modeling will be required to manage updates; specifically something like this: search the rdfs:label for a matching label in LCSH; if found, extract the LCSH IRI and insert as the direct value of rdaw:P10256; at that point delete the entire blank node that formerly described the subject/concept. This process may be more appropriately described as requiring additional workflow procedures rather than as additional modeling.
  5. Additional data required: highly incomplete data.
    1. Discussion: we will want to regularly search external data for IRIs to replace our literal values. However if components are searched in LCSH for identifiers, the data for model #2 can be considered slightly more complete than model #1, providing more structure for LCSH IRIs when LCSH IRIs are not available full and complex precoordinated subject strings.
  6. Precision of property selected: precise enough.
    1. Discussion: one RDA property is used for all values. Values are always a resource however, either an IRI or blank node. The RDA ontology does not assign a range to rdaw:P10256, so nothing in RDA will inaccurately class the resource/value. The use of rdaw:P10256 seems accurate enough, as does the use of the properties describing the blank node. The node is typed as a madsrdf:ComplexSubject, the domain of the property madsrdf:componentList. The range of componentList is rdf:List or rdf:Seq, and rdf:List is in fact the type of value modeled. The use of rdf:List however leads to ambiguity as described above, when there is no IRI in LCSH for a given component. The use of rdfs:label is stable and sufficiently precise.

 

Model 3
  1. Simplicity: complex.
  2. Scalability as general data: moderately scalable.
  3. Scalability as semantic web data: moderately scalable.
  4. Additional modeling required: requires more models.
  5. Additional data required: highly incomplete data.
  6. Precision of property selected: precise enough.
  7. Discussion: Almost all the discussion of model #2 applies to #3; the only difference between the models is that model #3 uses rdf:Seq instead of rdf:List. Like rdf:List, this is not an abundantly used feature of RDF and may cause confusion when applications do not accommodate rdf:Seq. Unlike rdf:List, rdf:Seq is a subclass of rdfs:Container; rdf:List is a subclass of rdf:Collection. They differ because and rdf:Collection is always closed; rdf:List can remain open if not terminated by a rdf:rest = rdf:nil. This can cause added ambiguity for rdf:List; it makes the rdf:Seq seem a little more stable in the linked data cloud. The rdf:Seq also does not bury the subject heading components in the complexity of multiple rdf:first and rdf:rest properties, again making model #3 seem slightly more scalable as RDF data than model #2. We have not tested this however on any RDF data stores.

 

Model 4
  1. Simplicity: complex.
  2. Scalability as general data: moderately scalable.
  3. Scalability as semantic web data: moderately scalable.
  4. Additional modeling required: requires more models.
  5. Additional data required: highly incomplete data.
  6. Precision of property selected: precise enough.
  7. Discussion: again this is largely the same as above models, specifically #2 and #3, except each complex subject is identified by an IRI rather than a blank node. We don’t think this simplifies the requirements of the data entry form. It may improve the scalability ratings, as each subject will be identified by an IRI and will therefore have an identity outside the local context. It may also reduce additional modeling required, as the IRI for a complex subject lacking an LCSH IRI for the full heading can be easily related as a named resource (using for example owl:sameAs) to the IRI to be created in the future by the Library of Congress. It’s possible, in this case, we’ll want to delete the entire madsrdf:ComplexSubject, as above; if so, this task should be simplified when the madsrdf:ComplexSubject is a named resource rather than a bnode.

 

Conclusion

We favor model #1. It’s simple design should simplify data input form design. It will be reliable data well into the future. As semantic web data it will be relatively routine to run reconciliation processes that enable strings to be updated to IRIs. It employs different properties for values that are IRIs and values that are literals, increasing the precision of the properties. More than anything else however it allows us to describe precisely bibliographic resources, creating bibliographic data where the subjects of our triples are described without embedding descriptions of additional resources — in this case concepts that are LCSH subject headings — within the context of those triple-subjects. In other words, when we are describing a work, we only describe the work using model #1, not the subjects as subjects; subjects described as subjects is something performed separately, as a part of libraries’ authority work. The dataset that describes the subject headings — in our case LCSH — is the best place to describe those headings and, if possible, their components. In addition, it seems unnecessary to locally mint IRIs for each complex subject unless we intend to make statements about those complex subjects, which we do not plan to do, and that was not proposed as a solution. Furthermore, as mentioned, Library of Congress is expecting to mint IRIs for all complex subjects, and it seems best to reconcile our strings in a simple data model at a future date. Although our model #1 does nothing to alleviate the difficulty of representing complex subjects in a linked data environment, we still think it is the best solution for our project, although there were some dissenting opinions! The main point of dissent revolved around the notion that madsrdf helps solve the precoordinated subjects problem; why not adopt those features of madsrdf and extend them in the interest of serving the profession? In the end however we chose the simpler model.

One thought on “Modeling LCSH Subject Headings as RDA Linked Data in Sinopia

  1. Our (LC) instructions are basically #2 for complex things and #1 for simpler things. I don’t think #3 is an option because component lists are specifically a collection, not a container. #4 isn’t really different than #2 – ultimately, you need to move away from blank nodes because of technical requirements in certain systems.

    Specifically, this particular usecase: “This resource’s subject is heading/subheading “Popcorn—Green Lake Way (Seattle, Wash.)” where one component is in LCSH, the other is not.” we’re imagining a different implementation than the example in #2/#4.

    More broadly, JSONLD is not well-aligned with the RDF datamodel. You can do #3 in JSONLD but it isn’t an array (list) in JSON because it isn’t part of the spec.

    We’ll go through the use cases and post an update on the LD4P slack channel since there’s been some questions about this in Sinopia.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *