That weekend the sun was melting everything alive in St. Petersburg, so I decided to stay at home and experiment with Grakn. Grakn is a knowledge base for intelligent systems. A quick look at their developer site and examples piqued my curiosity, so today we’re going to dig deeper.
Our aim is to:
- Get some sample data
- Create an expressive Grakn data schema
- Import the data into our knowledge base
- Come up with some interesting ways to query it.
Let’s start
Grab yourself a hot cup of ☕...
Let’s check what we’ll need for our experiments.
First, obviously, Grakn. It runs on Mac, Windows, and Linux. It’s Docker friendly, too. To keep things nice and platform agnostic, let’s go with Docker.
Note: If you want to install Grakn locally, follow these instructions instead. You’ll also need OpenJDK or Oracle Java.
- Pull the image from the Docker registry
docker pull graknlabs/grakn:1.5.3
- Start the container with an external volume:
docker run --name grakn -d -v $(pwd)/db/:/grakn-core-all-linux/server/db/ -p 48555:48555 graknlabs/grakn:1.5.3
- Check the server status:
docker exec -ti grakn bash -c '/grakn-core-all-linux/grakn server status'
Perfect, we have Grakn ready to go in just a couple of minutes!
The dataset
Let's grab ourselves some StackOverflow user data. You can read the endpoint documentation and query the StackExchange API over here. An example of JSON formatted data for one user looks like this:
{
"badge_counts": {
"bronze": 3,
"silver": 2,
"gold": 1
},
"view_count": 1000,
"down_vote_count": 50,
"up_vote_count": 90,
"answer_count": 10,
"question_count": 12,
"account_id": 1,
"is_employee": false,
"last_modified_date": 1565470447,
"last_access_date": 1565513647,
"reputation_change_year": 9001,
"reputation_change_quarter": 400,
"reputation_change_month": 200,
"reputation_change_week": 800,
"reputation_change_day": 100,
"reputation": 9001,
"creation_date": 1565470447,
"user_type": "registered",
"user_id": 1,
"accept_rate": 55,
"about_me": "about me block",
"location": "An Imaginary World",
"website_url": "http://example.com/",
"link": "http://example.stackexchange.com/users/1/example-user",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User"
}
I created a simple script in Go to download as much data as the API quota allows. The source code is available here. To save some time, I suggest using this JSON document I compiled overnight.
There isn't a lot of interesting insight to gain from the raw JSON data alone.
The schema
OK – we have some raw data in JSON format. Before it can go into our database, we'll need to model the things we want to know about as a schema – a skeleton structure that represents the logical view of the entire knowledge graph. According to the Grakn docs, schema is a means to address the problems of managing and handling unstructured or loosely structured data – perfect! Let’s take a look at the basics.
The Grakn data model
First of all, everything that describes a domain in a Grakn knowledge graph is a concept, including the elements of the schema (called schema concepts) and the actual data.
What can we have in a schema? There are three types of things:
- Entity — entities are means of classifying the objects in our domain.
- Attribute — think of them as properties. We can assign any number of them to entities, relations, and even to other attributes.
- Relation — relations allow us to connect several things together. Things can play roles in a relation. Each relation is required to have at least one role.
There's a lot more to Grakn data modelling than this. It allows you to define type hierarchies, hyper-entities, hyper-relations, and chainable rules. I won't go into too much detail – if you are curious, you can read all about it here. For now, let's focus on Graql — Grakn's query language that allows you to model, query and reason over data.
Our schema.gql
file should start with the **define**
keyword right at the top.
Next, we're going to describe our StackOverflow users characteristically (with a bunch of attributes like their name and avatar) and contextually, in terms of their location and their achievements as contributors.
define
## ENTITIES ##
user sub entity,
key user-id,
has account-id,
has about,
has age,
has name,
has is-employee,
has user-type,
has created,
has last-accessed,
has last-modified,
has penalty-until,
has url,
has website,
has profile-image,
has reputation,
has accept-rate,
has view-count,
has down-vote,
has up-vote,
has answer-count,
plays located-user,
plays contributor;
As you can see, most of the properties in our JSON user data are mapped as attributes. Let’s walk through the syntax.
The general idea is:
<name> sub [entity|attribute|relations|<element to inherit>]
The statement should end with a semicolon.
Attributes are assigned using the has
keyword. In the schema, we can define elements in any order, so it’s completely fine to define the attribute types later.
To avoid duplication of users, let's make user-id
a unique attribute using the key
keyword.
Some things are common to multiple users, like a location (e.g. Austin, TX, USA) or the types of badges they've been awarded (bronze, silver, gold). We'll model locations and badges as separate entities.
location sub entity,
key address,
plays user-location;
badge sub entity,
key color,
plays award;
We've ended up with three entities: user, badge and location. How to glue them together? Using relations.
## RELATIONS ##
location-of-user sub relation,
relates located-user,
relates user-location;
achievements sub relation,
has score,
relates contributor,
relates award;
We are interested in two relations:
-
location-of-user
connects a location entity, which plays theuser-location
role, and a user entity, which plays thelocated-user
role -
achievements
connect user and badge entities. A user plays thecontributor
role; a badge plays theaward
role
Attributes can be assigned to anything, including relations. We'll use an attribute (score
) to store a user's badge count on the achievements
relation.
There's just one more step – defining the attribute types:
## ATTRIBUTES ##
name sub attribute,
datatype string;
address sub attribute,
datatype string;
timestamp sub attribute, abstract,
datatype date;
created sub timestamp;
last-modified sub timestamp;
last-accessed sub timestamp;
penalty-until sub timestamp;
url sub attribute,
datatype string;
website sub url;
profile-image sub url;
score sub attribute,
datatype long;
accept-rate sub score;
view-count sub score;
down-vote sub score;
up-vote sub score;
answer-count sub score;
reputation sub score;
user-type sub attribute,
datatype string,
regex "^(unregistered|registered|moderator|team_admin|does_not_exist)$";
color sub attribute,
datatype string,
regex "^(bronze|silver|gold)$";
about sub attribute,
datatype string;
age sub attribute,
datatype long;
identifier sub attribute, abstract,
datatype long;
account-id sub identifier;
user-id sub identifier;
is-employee sub attribute,
datatype boolean;
Graql supports the following data types:
- long: a 64-bit signed integer
- double: a double-precision floating-point number, including a decimal point
- string (which can also be restricted via regexp)
- boolean: true or false
-
date: a date or date-time in ISO 8601 format
Grakn doesn't support timezones (but there's an open GitHub issue). If you don’t take care of the timezone offset, Grakn will create date records with the server’s timezone.
An attribute can be abstract if you never assign it directly and use it only as a parent type. Entities can be abstract, too, if they are never instantiated.
The whole schema file is available here.
Now that we have the schema ready, the next step is to load it into Grakn.
First, place the schema.gql
file in the container volume. In my case it’s db/schema.gql
.
Then run:
docker exec -ti grakn bash -c '/grakn-core-all-linux/grakn console --keyspace experiment --file /grakn-core-all-linux/server/db/schema.gql'
This should result in something like:
Loading: /grakn-core-all-linux/server/db/schema.gql
...
{}
Successful commit: schema.gql
We just created a Grakn keyspace experiment and defined its schema.
Importing the data
Now that we have modelled our data, it’s time to load the dataset into our knowlede graph.
We will use the python client API to interact with Grakn. Let’s install it:
pip install grakn-client #or pip3 install grakn-client
I had an issue with my six package installation, and solved it with the
--ignore-installed six
flag.
The code below instantiates a client, opens a session, and runs an insertion query:
from grakn.client import GraknClient
with GraknClient(uri="localhost:48555") as client:
with client.session(keyspace="experiment") as session:
## session is open
## execute query using a WRITE transaction
with session.transaction().write() as write_transaction:
insert_iterator = write_transaction.query(query)
concepts = insert_iterator.collect_concepts()
if bool(args.verbose):
print("Inserted a record with ID: {0}".format(concepts[0].id))
## to persist changes, write transaction must always be committed (closed)
write_transaction.commit()
Additionally, it will print the identifier of the inserted record for testing/debuging purposes.
Now let’s focus on queries.
Here is our insert query for a StackOverflow user (generated from our JSON data using this handy transformation function):
insert $u isa user,
has user-id 9515207,
has name "CertainPerformance",
has reputation 123808,
has is-employee false,
has last-modified 2019-08-12T01:02:31.750870,
has last-accessed 2019-08-12T01:02:31.751320,
has created 2019-08-12T01:02:31.751325,
has url "https://stackoverflow.com/users/9515207/certainperformance",
has up-vote 2630,
has down-vote 15027,
has view-count 22457,
has answer-count 4469,
has account-id 13173718,
has user-type "registered",
has profile-image "https://www.gravatar.com/avatar/34932d3e923ffad9a4a1423e30b1d9fc?s=128&d=identicon&r=PG&f=1";
That was pretty straightforward. Rinse and repeat for locations and badges. What about relations? Let’s look at describing a user’s location:
match
$u isa user, has user-id 9515207;
$l isa location, has address "Austin, TX, USA";
insert
$r (located-user: $u, user-location: $l) isa location-of-user;
This query matches a user instance† that plays the located-user role (assigned to variable $u
), and a location instance that plays user-location ($l
). Then it inserts a location-of-user relation with $u
and $l
as its roleplayers ($r
).
† The roles of a relation to be inserted are expected to be played by instances that already exist in the knowledge graph.
Creating relations with badges looks similar. Keep in mind that you will first need to insert three badge instances, corresponding the three StackOverflow badges (bronze, silver and gold).
match
$u isa user, has user-id 9515207;
$b isa badge, has color "gold";
insert
$award-badge (contributor: $u, award: $b) isa achievement, has score 16;
The final version of the python script is available here. It doesn’t pretend to be efficient or optimal, since it’s all just a weekend experiment.
When you are ready to load our dataset, I recommend throttling transactions to 100-200 records at a time to avoid a long wait.
Time for some fun
We’ve created the schema and hydrated a Grakn keyspace experiment with StackOverflow user data. Let’s see what we can find out by querying our knowledge graph.
Start the Grakn console:
docker exec -ti grakn bash -c '/grakn-core-all-linux/grakn console --keyspace experiment
Can we get the names of the top ten users, by reputation?
match $u isa user,
has reputation $r, has name $n;
get $n,$r;
sort $r desc;
limit 10;
What about their location?
match
$u isa user, has reputation $r, has name $n;
$location-of-user ($l,$u) isa location-of-user;
$l isa location, has address $addr;
get $n, $addr, $r;
sort $r desc;
limit 10;
What about the median and mean of all users’ reputation?
experiment> compute median of reputation, in user;
11687
experiment> compute mean of reputation, in user;
18798.470588235294
What about...
The limit is my imagination
# analytics OLAP
compute cluster in location, location-of-user, top-ten-user;
## results
{"Austin, TX, USA", "stackUser", "cleverTexan"}
{"London, UK, GBR", "aLondoner", "Leonid"}
It's really that easy to compute the geographic concentration of top StackOverflow contributors!
💥 Analytics 💥
Distributed analytics is a set of scalable algorithms that allows you to perform computation on big data in a distributed fashion. This often involves a challenging implementation.
In Grakn, these algorithms are built-in as native functionalities of the language.
Conclusion
Grakn is pretty neat for many reasons. I've never worked with graph-like databases before, so I had to make a cognitive jump to understand some of its concepts – and reimagine what’s possible around the idea of connections.
Pros
- Very expressive and infinitely flexible schema, so you can create complex knowledge models. 👍
- Built-in distributed analytics algorithms, so you can analyse multidimensional data interactively. 👍
- Graql's strong abstraction means writing less code – and it also automatically optimises query execution. 👍
- Its automated reasoning sounds great – maybe an experiment for another weekend?
Cons
- Lack of timezone support – so you have to account for offset all on your own. 😒
- Hard-to-grok error messages – unless you're a Java developer. 😑
Top comments (0)