WARNING: this tutorial is out of date, please check memri.io for the most up-to-date instructions.

Introduction

Memri allows you to have all your personal data in one place under your control: your pod. To get the data into your pod, you can use downloaders to extract your data from a growing collection of resources. The pod can then be used to interact with your data from frontends like an iOS/OSX application or a browser application (still under development).

Now that we truly have control over our data, we can be a bit more creative about what we actually do we with our data. Most users are not really familiar with this, because the organizations that control your data may have no incentive to show what they do with your data. As an example, let's have a look at my location data. I can request it via google takeout. Once downloaded, I see a file and a folder.

The Location History.json file contains my raw geographical data: the coordinates of the places that I visited. However, the "Semantic Location History" is much more interesting, let's open a file from that folder and inspect them side by side.

Now we see that semantic location history contains much more interesting and readable information. We can not only see the coordinates of the locations, but also what I was doing (walking/cycling/driving/etc.), and the name of the place where I was doing it.

Conceptually, we can think of this kind of information as "inferred information". Based on our sensory data (GPS coordinates), we infer other data. This is exactly what indexers do: you can use them to take your data, and make inferences about your data to enrich it, all in a privacy preserving way. However, the difference with Memri is that you own and control all your data, so also the inferred data, and you have all the freedom to use any indexer you like.

There are many interesting use cases for indexers: your messages could be labeled to be work related or private. Your digital receipts could be categorized. Your email could be separated in a folder structure you like (work, advertisements, newsletters, etc.).  You could cluster your browsing history when you are looking for that holiday home so you can find it back easily. There are too many use cases to think of, and with this post we want to start an infrastructure to make it easy to create and share indexers as a community.

Building an indexer

In this tutorial we will be building a very simple indexer. Assume we have some location data in the form of GPS coordinates, and we want to augment them with their corresponding cities and countries.

We start backwards: we first look at how we can describe and interact with an indexer in the iOS client. Then we can see how we can make a call to the pod from the client to start the indexer, and finally we will write the actual indexer.

Lets start with adding the indexer in the iOS application, if you are not familiar with the application, check out these resources to get started. To display our indexer in the application we need 3 things: 1) a list CVU definition to describe how a list of indexers should be displayed 2) a single item CVU definition for indexers to show a single indexer 3) an indexer DataItem. Luckily, we already have 1&2, and 3 is super easy to make. If you are interested in how you can modify the appearance of the indexers checkout the posts about making your own cvu definitions.

To make an Indexer DataItem, we can add an entry to the default database file. This is a file that contains all data is loaded by default into the app, regardless of the user. There are two important fields here, the type and the IndexerClass. We use the type to indicate that this item is an indexer, and should visualized like an indexer. We use the indexerClass to make a link between our indexer here, and the class in the code of the backend.

{
    "type": "Indexer",
    "name": "Geolocation indexer",
    "indexerClass": "GeoIndexer",
    "targetDataType": "Address",
    "indexerDescription": "Enriches datatypes with geographic coordinates with their corresponding cities and countries",
    "icon": "mappin.and.ellipse"
}

Memri uses two classes to represent Indexers and their usage. The first class is the Indexer, which defines what the computations that an indexer makes and stores meta information like a name and description. The second class is the an IndexerRun, which specifies the data on which the indexer is ran, and stores the progress of the indexing process.

Now that we have our Indexer DataItem defined, it will use the CVU definitions as defined in 1&2 by default to show our Indexer. Lets check it out:

When we click on the Indexer in the list above, it creates and IndexerInstance with the same name, and opens it in a new window. Now, we can press "start run" to make an API call to the pod to start the indexer.

Now it's time to set up the code for the corresponding indexer in the backend. When the backend receives an API call to start an indexer, it executes run_indexer.py and passes the IndexerRun id. We'll walk you through the file now.

from indexers import *


def run_indexer(run_uid):
    api = podAPI(url=f'http://localhost:3030/v1')

    indexer_run = api.get(run_uid, expanded=True)

    # Get Indexer
    indexer = indexer_run.traverse("indexer")[0]

    # Index
    updated_items, new_nodes = indexer.index(api, indexer_run)
    indexer.populate(api, updated_items, new_nodes)


if __name__ == '__main__':
    run_uid = os.environ.get('RUN_UID', None)
    run_indexer(run_uid)

First, we get the indexer run id, which has been set by the pod as an environment variable. Then we initialize an api object to talk to the pod, and use it to get our indexer_run and indexer. The api object here tries to cast the data that it gets from the POD into the right format, and when it gets an item with type "Indexer", it looks at the IndexerClass to find a class in the python code with the same name. In our case, that name is "GeoIndexer", so we need to make a class with same name, and a function "index" that take the api and indexer_run as inputs.

This sounds pretty complicated, but you don't need to remember all of it, just the last sentence. So let's do that now. We create a GeoIndexer class here, we omit a few functions, but you can find the full code here.

LOCATION_EDGE = "location"

class GeoIndexer(Indexer):
    
    def __init__(self, properties, edges=None):
        super().__init__(properties, edges)

    def index(self, api, indexer_run):
    
        items_expanded = [d.expand(api) for d in
        		 indexer_run.get_data(api)]
                    
        items_with_location = [x for x in items_expanded 
                               if any(["latitude" in node.properties for
                               node in x.traverse(LOCATION_EDGE)])]
        
        new_nodes = []
        for n, item in enumerate(items_with_location):

            latlong = self.get_lat_long(item)

            # get geo info
            city_name, country_name = self.latlong2citycountry(latlong)

            # add information to indexer objects        
            item.add_property("city", city_name)    
            country = self.get_country_by_name(api, country_name)
    
            if country is None:
                country = api.create_local(
                	Node({"_type": "Country", "name": country_name}))
                new_nodes.append(country)
            edge = Edge(item, country, "country", created=True)
            item.add_edge(edge)

        return items_with_location, new_nodes

The GeoIndexer that we create inherits from the base Indexer class, which automatically adds some functionality that any indexer need. First, the indexer calls get_data, which is such a functionality that queries the pod based on the targetDataType that we defined earlier: Addresses. On all the items, we call expand(), which gets all items that are connected to that item, and loads them in a property called .edges. Next, we filter all the locations, and keep only the ones that have geographical coordinates attached to them.

When we query the Pod from an indexer, the PodAPI class returns Node objects, which have important fields: properties, and edges. Properties can be interpreted just like class properties, edges are of the Edge class, that connect one node to another node.

We continue with the main loop. Conceptually we do two things for each Address: 1) We look up the city name from the geographical coordinates, and add it as a property to the existing Address 2) We look up the country from the geographical coordinates, create a country node locally and attach it to the Address with a "country" edge, also locally. One might debate whether it is a good choice to model cities as properties, and countries as nodes, but that is not the point in this tutorial. When we finish our loop, we return the set of updates Addresses, plus the new Country nodes, the edges are contained in the objects. The run_indexer() function will get these objects and write them to the graph.

This was an example of a simple indexer, but indexers can range from a simple set of rules, to a big machine learning model. We want to note that indexers should be self contained pieces of software, they should not make API calls to other services and send the users data over the network, because that will hurt the users privacy.