Are you looking into RDF and the W3C standards for open data, and considering to set up a SPARQL endpoint? Maybe you have a data set or a data model ready, or even a finished ontology in OWL, and you want to publish it? Your goal may be to implement semantic interoperability within your organization or add some analytic capabilities to an application. But is setting up a SPARQL endpoint always the best solution?

What Is a SPARQL Endpoint?

One of the usual and most powerful data exchange services for RDF data is a SPARQL endpoint. With increasing focus on Linked Open Data, many customers are looking for the establishment of a SPARQL endpoint to expose or analyze their data in this open and standardized format.

So, why a SPARQL endpoint? A SPARQL endpoint is a powerful tool, that gives the endpoint user the possibility to query RDF data freely and apply aggregation and reasoning capabilities of SPARQL if needed. It shows a tempting vision, that rhymes well with the thought of open linked data—breaking down the walls between information silos.

However, there are a number of considerations and limitations to this technology that must be addressed in the context of the domain at hand. Many people testing out RDF aren’t aware of these limitations before they start to implement solutions based on the SPARQL endpoint.

Setting up for Success or Disappointment?

In my experience, people often take it for granted that a SPARQL endpoint is easy to set up, works flawlessly out of the box, integrates with every technology seamlessly and with high performance, and needs no extra maintenance or special design effort. When it doesn’t work this way, the result is disappointment (with regards to performance, effort, results, integrations etc).  And sometimes also abandonment of the solution, concluding that RDF and semantic technology itself are responsible for the shortcomings of the project.

In this post I have listed a couple of areas one should look closer into before deciding whether to use a SPARQL endpoint. Some of them are basic knowledge, but even for software developers with some experience with semantic technology, it doesn’t hurt to be reminded of them from time to time.

Reading Data with a SPARQL Endpoint

Let’s consider reading operations against a SPARQL endpoint, both for analytic or integration purposes. Most users connect the concept of a SPARQL endpoint primarily with being able to perform (read) queries. The same problem areas are relevant also for CRUD (Create, Read, Update and Delete), but there are many more considerations that need to be addressed in a CRUD-enabled RDF server. Many vendors of SPARQL endpoints separate the functionality for queries and create/delete by using different HTTP services.

It is –obviously– wise to think thoroughly through if someone else than the system administrator or carefully controlled services should have the possibility to create or delete data.

Below is a list of some points to consider.

Security and Restricted Access to Data

Not all data are open for public use. For some customers there’s a large portion of data that has strong restrictions on the usage and who is able to see what. We distinguish between authentication and authorization. When a user is authenticated to be a specific user (often by means of separate log in and authentication solutions), the question about authorization to view different data sets or data instances remains.

There is no common W3C standard for how to implement access rules, but different vendors have different solutions. It is important to realize that this type of restrictions can put a serious strain on the data access performance. If the access rules are implemented in SPARQL, note that for every SPARQL query a user performs, the SPARQL access rules (which can be quite complicated) add to the complexity and load.

If the access rules are implemented as a rule base, you need to consider the impact on updates of the entailments (preprocessed index of rules and dataset) that often are needed. There are many reasoning engines, but with large amounts of data reasoning often does not scale particularly well. Vendor specific implementations of rules can be more efficient, but quickly create a strong vendor lock-in, or in some cases even a product lock-in.

Some triple stores or databases with RDF capabilities have other ways to restrict data, like e.g. Oracle Virtual Private Database for RDF, Oracle Label Security or Jena Permissions. Again, they are very specific to a vendor or a technology.

In short, authorization does not come for free with SPARQL endpoint solutions, but needs a lot of work and customization.

Heavy Queries

When querying a database arbitrarily, RDF or not, there is always the risk of long running or heavy queries, taking up a lot of memory or processing power.  Running such heavy queries against a production database serving other purposes as well will result in a drop of performance for the other tasks. Depending on the setup of the production environment and the query in question, this drop in performance can be a significant disturbance to the production system.

A separate environment is always to be preferred when exploring a dataset with (uncontrolled) queries. Restricting the processing time a query is allowed to run is also a good practice, as well as restricting the number of active queries.

Large Data Set Results

A query can result in a very large data set result, and one should be aware of this. Depending on the setup of the endpoint and the nature of the query this can cause the server to crash due to memory problems. Sometimes it happens that a running query is still active in the database after such an endpoint crash, which in turn can lead to problems if there are no specified restrictions for terminating that query. This is especially bad for production systems, but will be irritating even in a test or analytics environment. Sometimes, the only way to resolve these problems is to restart the entire triple store as well as the endpoint server. It’s better to avoid that the problem arises at all restricting the size of the data set that can be returned.

Heavy Load

If multiple users query a SPARQL endpoint at the same time, performance will drop. If there are running queries when a new query or request arrives, the performance will drop, in the end leading to a denial of service. If a system is setup to depend on a SPARQL endpoint delivering a result, the service can fail on small, simple tasks, if the strain on the system is too large. It is advised to think through the usage pattern of a SPARQL endpoint before setup and the performance profile needed. Typical elements include how many reads and writes are to be expected, whether the queries will usually be small or large, concurrency, etc… Restricting access or the number of active queries and/or users in order to optimize performance for the active users is also a possibility.

Also, if the requirements change, the setup and architecture around the endpoint should be reviewed. One should not expect a (basic) SPARQL endpoint setup to work out of the box, scaling to different and changing needs. For example, if the requests are expected to come in bursts from time to time, one can consider to use an asynchronous message queue to handle the requests and offload the SPARQL endpoint rather than trying to scale it up through clustering.

In short, investigate the expected usage patterns, and customize the setup and restrictions accordingly.

Direct User Access – Competence, Capacity etc.

The competence of the users running the queries must also be taken into account. If you open up your endpoint to many, perhaps inexperienced users, you have less control over your environment. Regardless of how experienced the users are with regards to SPARQL queries, you can still run into problems. A less experienced user will often use the try-and-fail approach to explore the data set and there will be a potential to run into many of the above stated problems: long running queries, that are interrupted but never aborted, large data sets, inefficient queries that use unnecessary resources etc. A more experienced user will perhaps be more familiar with aborting strategies, is better at defining efficient queries and usually takes precautions to limit the data set size, but not always. In addition, the chance that an experienced user will try complex and demanding tasks is bigger, which in turn puts a strain on the server. Most developers would not allow open and unlimited access to an SQL database, and accessing the RDF data set directly via SPARQL isn’t much different. To take precautions and add some built-in limitations to the SPARQL endpoint is considered best practice when allowing direct access.

In short, a fully open and available SPARQL endpoint will only work well with a small number of experienced users, with relevant security permissions, or on a data set that is completely separate from production environments and without security considerations. One should however be aware of the fact that this sometimes can reduce the user experience, at least if they expect to be able to “do everything” and run advanced and heavy queries.

Other Options to Direct Access

There are many alternatives to a SPARQL endpoint when it comes to exposing RDF. One simplistic approach is the export of RDF data dumps. Users can replicate these RDF data dumps into their own RDF stores and play around with the (filtered) data locally. Alternatively, one can use predefined templates, which allow a certain amount of freedom but have all relevant limitations to the data set in question. Setting up services that extract data with fully predefined queries (services, reports, etc) is often the way to go when integrating with other systems.

Conclusion: SPARQL, Open Standards and Handcrafting

In short, before you set up a SPARQL endpoint, ask yourself whether this really is the best way to serve your needs, or whether there’s another solution that’s more suitable to get your data to the users. In particular, consider carefully whether you really need to give arbitrary query access. If the functionality you’re looking for in your solution isn’t part of the semantic stack in the tool you choose for your implementation, or part of the supported open standard, don’t expect it to be handled out of the box. You will probably need to handcraft every exception, service or rule, or add another tool or product to your software stack.

Don’t expect wonders, but make sure you understand the specific needs of your domain and don’t expect every challenge to solve itself automatically. Be pragmatic and use whatever combination of technologies and standards that work best for you, and you have the best starting point for your project.

Legg igjen en kommentar

Din e-postadresse vil ikke bli publisert. Obligatoriske felt er merket med *