Dataset Characteristics Identification for Federated SPARQL Query

Nowadays, the amount of data published in the RDF format is increasing. Federated SPARQL query engines that are able to query from multiple distributed SPARQL endpoints have been developed recently. A federated query engine usually has different performance compared to the others. One of the factors that affect the performance of the query engine is the characteristic of the accessed RDF dataset. The aim of this work is to identify the characteristic of RDF dataset and create a query set for evaluating a federated engine. The study was conducted by identifying 16 datasets that used by 10 research papers in Linked Data area. The metrics used in this work are the number of classes, properties, entities, objects, and subjects as well as the distribution of classes and properties.


INTRODUCTION
The discovery of Semantic Web has made the exchange of data on the web easier. Semantic Web has spawned a new standard that makes data on the web scattered around the world more structured and connected. Semantic Web also makes the data understandable by the machine to enable the use of data by applications.
The standard data exchange framework used on Semantic Web is the RDF (Resource Description Framework) [1]. Today many online databases have been created in RDF format, including DBpedia and Freebase. In order to obtain RDF data from these sources, queries should be performed using a specific query language for RDF, SPARQL (SPARQL Protocol and RDF Query Language) [2]. A web services that allows RDF data queries via SPARQL are called SPARQL endpoints.
The standard data exchange framework used on Semantic Web is the RDF (Resource Description Framework) [1]. Today many online databases have been created in RDF format, including DBpedia and Freebase. In order to obtain RDF data from these sources, queries should be performed using a specific query language for RDF, SPARQL (SPARQL Protocol and RDF Query Language) [2]. A web services that allows RDF data queries via SPARQL are called SPARQL endpoints. From year to year, the amount of data published in the RDF format is increasing. To compensate for the rapid development of RDF data, an application system that allows RDF data queries from multiple data sources is created. RDF data queries from some SPARQL endpoints can be done using an application called a federated SPARQL query engine, namely SPLODGE [3] FedEx [4], ANAPSID [5], and ADERIS [6].
Those query engines performs different performance while using different datasets [7]. A query engine may perform well for queries on data source A, but it performs worse when it is used to query data sources B. Other query engines may work the other way around, which has a good performance for queries on data source B. Therefore, it is difficult to determine which query engine has the best performance. One of the factors causing such performance differences is the characteristics of the RDF dataset accessed by the query engine. RDF datasets certainly have different characteristics such as the number of triples, the number of classes, etc. Therefore, this work investigates characteristics of set of datasets used in federated SPARQL query. We use 16 datasets are used in 10 papers Linked Data. These datasets are identified by using a set of metrics that proposed by Duan [13] and Rakhmawati [7]. Moreover, we generate a set of queries from these datasets which can be used for federated SPARQL queries benchmark. Note that, we do not a create new benchmark suite.
The remainder of this paper is divided into six sections: Section 2 initially reviews other related works, Section 3 describes set of characteristics of datasets used in our evaluation, Section 4 explains our methodology, Section 5 explain the results of our research, and Section 5 concludes our work.

RELATED WORKS
Various works have been focusing on federated SPARQL queries benchmark [8][9][10]. Schmidt [9], proposed FedBench, a comprehensive benchmark suite for testing and analyzing the performance of federated query processing on semantic data. FedBench is highly flexible benchmark suite which is able to cover a wide range of semantic data processing strategies and use cases. Montoya [8] evaluated FedBench [9] on three federated query engines, namely ARQ1, ANAP-SID [5], and FedX [4]. The analysis has allowed to uncover hidden properties of the these systems. BioBenchmark [10] evaluated whether RDF native stores can be used to meet the needs of a biological database provider. The research evaluated five triple stores, with five biological datasets.
In terms of generating a set queries for benchmark, several works proposed how to generate queries for assessing a performance of a federated queries. Rakhmawati [11] introduced QFed, a dynamic SPARQL query set generator that takes into account the characteristics of both dataset and queries along with the cost of data communication [12]. Generated queries based logs owned by SPARQL endpoints. Dataset is also considered in assessing federated SPARQL queries. Duan [13] proposed a metric to measure the structuredness of a dataset, which is called coherence. The proposal of this metric is motivated by the fact that primitive data metrics, such as the number of triples and the number of literals are not enough to reveal the characteristics of the datasets.
Rakhmawati [7] investigated the relationship between the data distribution and the communication cost in a federated SPARQL query framework. A metric called Spreading Factor is proposed to compute the distribution of classes and properties on a dataset. The investigation showed that the spreading factor is correlated with the communication cost between a federated engine and the SPARQL endpoints.
In this work, we evaluate the characteristics of set of datasets used in ten papers that work on federated SPARQL queries. The metrics used for this identification are a set of metrics presented by Duan [13] and Rakhmawati [7] 3. DATASET CHARACTERISTICS This section describes the characteristics of datasets in terms of a federated SPARQL queries.
A RDF triple consists of subject, predicate, object, while an entity is an instance of a class which its property is rdf:type. The following is an instance of a triple: dbr:Indonesia rdf:type dbo:country Where dbr:Indonesia is a subject, rdf:type is a predicate or property, dbo:country is an object, dbr:Indonesia is also an entity, and dbo:country is a class. Suppose that, we have the following dataset from DBPEDIA which is divided into two datasets A and B: The characteristics of the dataset above can be identified as follows: 1) Number of Subjects There are three subjects in the dataset above, namely dbr:Indonesia, dbr:Malaysia, and dbr:Philippines.

2) Number of Properties
There are four properties in the dataset above, namely rdf:type, dbo:anthem, dbo:capital, and dbo:currency.

3) Number of Classes
There is one class in the dataset above, namely dbo:country. 4) Number of Entities There are three entities in the dataset above, namely dbr:Indonesia, dbr:Malaysia, and dbr:Philippines.

5) Number of Triples
The number of triples shows the amount of data in a dataset that is expressed in subject-predicate-object form. There are 12 triples in the dataset above. 6) Spreading Factors (SF ) Spreading Factors is a metric used to identify the distribution of classes and properties in a dataset [13]. There are two types of Spreading Factor, namely Spreading Factor of the dataset (SF) and Spreading Factor associated with the queries (SF Q). SF Q only calculates the distribution of the classes and properties that occurs in the query.

METHODS
This section describes three parts, namely dataset identification, queries generation and evaluation. In a nutshell, our methodology can be presented in Figure 1. First, we collect datasets which used in papers focusing on a federated SPARQL query. Some of those datasets need to be clean up and split up before being calculated. Based on dataset calculation, queries can be generated. .

Datasets
The dataset used in this study was taken from other papers which should be meeting the following criteria: 1) related to Linked Data 2) using a federated SPARQL query 3) freely assessed without permission The papers are obtained from several sources, including Google Scholar, Science Direct, Semantic Web Journal, and International Semantic Web Conference. Some of downloaded datasets cannot be read properly by RDF data processing applications. This is generally due to an inappropriate character (bad character) on some datasets. To solve the problem, we made the conversion of bad character to the dataset to the appropriate characters. In addition, there are also problems with the Geonames dataset where the dataset format is not recognized by the RDF data processing application. The reason is that the unusual format of RDF writing contained in the dataset. Therefore, the dataset is converted into N-triples format that can be read by RDF data processing application. The conversion process is done using script in Python [24] and RDFLib library. The datasets have various amounts and sizes. A dataset may consist of one to dozens of RDF files, while the size of RDF files also vary from several kilobytes to reach tens of gigabytes. While running identification queries [25], we split up the large datasets into small partitions. The results of identification dataset can be found in Table 3.

Set of Queries
At this stage, a set of queries generated based on the most classes and properties that are frequently occurred in all datasets. However, two journals ("Luzzu -A Framework for Linked Data Quality Assessment" and a journal entitled "Semantic Hadith Leveraging Linked Data Opportunities for Islamic Knowledge") whose datasets have only specific classes and properties and most are not found in other journals are excluded in evaluation since the generated queries cannot be run over those two datasets.
There are two types of queries to be created, namely star queries and chain queries. Star queries are used to get one subject that has multiple predicates and objects, while the query chain is used to retrieved some subjects and objects, where an object can be a subject to obtain other objects [11]. Each query is added a limit keyword for reducing the possibility of too long query processing time.
Star Query 1 contains four rows which consist of one class and four properties. Star Query 1 retrieves person URI, person name (label), person description (comment) and any related information (owl:sameAs).

RESULT AND DISCUSSION
The average number of triples and properties in paper 7 [20] is the highest, while the average number of triples in paper 6 is the lowest. The ratio of average number of classes and average number of triples in paper 6 is low, it implies that the amount of class diversity is quite high. Moreover, the average subjects and entities is likely the same. It can be concluded that most of triples contains rdf:type.
40% of datasets have a SF value less than 0.35. It means that the distribution of classes and properties are not distributed evenly. None of datatasets reach SF value more than 0.6, since many classes do not belongs to all datasets.
Paper 4 consisting two datasets has a high SFQ values. The two datasets in that journal has high number of classes and properties that are included in the set of queries such as rdf:label and foaf:name. Although average number of classes and properties in paper 5 and 7, the SF and SFQ value are low. It indicates that some classes and properties are only populated in a certain dataset.
Although Paper 10 has a high SF value, the SFQ value is the lowest amongst other journal papers. It seems that the classes and properties used in queries spread across over the dataset such as owl is sameAs and rdfs is a label as seen in Table 4. 6. CONCLUSION 16 datasets that are used in 10 Linked Data papers have been investigated in this paper. The dataset characteristics that are investigated are the number of classes, properties, entities, objects, and subjects. Moreover, the distribution of classes and properties are also considered as one of characteristics of the dataset. The main characteristics of these datasets are rdf is a type, owl is sameAs, and rdfs is a label occur in all datasets. The distribution of properties and classes is not evenly. In the future, we need to evaluate the performance of a federated engine over those datasets by using the generated queries.