Data Oriented Domain Design

2022-04-05

One of the most difficult parts of developing software for a particular task is understanding the (business) domain in which the software needs to operate. This is due to the fact that software engineers often lack the expertise in complex areas such as health, law, finance, etc in which the software they create needs to solve problems. A similar situation holds for domain experts. For example, while lawyers are by definition well versed in the legal domain, it is not necessarily the case that they are also software engineers. This lack of knowledge can make developing software quite difficult. To continue with our legal example, suppose we want to create an application that represents and searches laws. From the software engineers side, the legal terminology can be quite burdensome to understand and it might not be very clear to them what aspects of it would be important for them to implement. On the other hand, it is important to be able to make it understandable to legal experts, without any software engineering knowledge, what the software is doing in relation to the concepts in their domain.

Various solutions have been proposed for this problem, such as Domain Driven Design and formalizing the domain using Linked Data Ontologies. These methodologies can be quite complex, which they need to be in order to capture the nuances of the domain and descriptions of what is possible. This makes the important process of domain formalisation often quite daunting. In this article, I will show a very lightweight approach, that can be applied as a starting point towards formalising the domain. The core idea is to use lightweight data notation languages, such as JSON and EDN to represent elements of the domain. The elements of these languages, such as the notations for sequences, key and value pairs, etc would then be used as a common language for both the software engineer and domain expert to explain the problem and how the software is used to solve it.

For the purpose of this article I will call this method Data Oriented Domain Design (DODD). Not coincidentally the subject that we aim to represent are a few elements of the Dodd-Frank Wall Street Reform and Consumer Protection Act, which we will simply refer as the Dodd-Frank law in this context.

In this article I will make use of the EDN(Extensible Data Notation) data notation language to illustrate the ideas of this Data Oriented Domain Design (DODD). EDN is a language where information is represented through a set of elements as values. The elements are common to many programming languages. For example here are elements for representing text, time, numbers as well as collections, such as lists, sets, etc. of these elements.

EDN is a subset of the programming language Clojure. A large part of a Clojure program is manipulating information expressed in EDN. Due to this reason Clojure is often called a data-oriented or data-driven language. With Data Oriented Domain Design we are going to use this "data orientation paradigm" not just for organizing the software system but also for creating a common language in which software engineers and domain experts can communicate.

In order to make things more concrete lets imagine a scenario where small application is needed to be built that aims to search for definitions within legal documents. One of the nice aspects of many legal texts is that they often have a section of definitions that the reader can refer to. Our application will aim to retrieve these definitions based on a given criteria. As an example use case, given the acronym DODD of our approach, we will use the text of the Dodd-Frank Wall Street Reform and Consumer Protection Act, which we will simply refer as the Dodd-Frank law in this context.

I am going to preface this by saying that "I am not a lawyer" and I am looking at this application from a software- and knowledge engineers perspective. I will simplify out a LOT of the intricacies of legal text search and representation. That said as mentioned this perspective should be illustrative of the issues when developing software for such a new domain and how Data Oriented Domain Design could be a good starting point.

The text of this law has multiple sections for definitions, but here in particular we are looking at the first ten definitions outlined in Section 2 of the Dodd-Frank act.

SEC. 2. <<NOTE: 12 USC 5301.>> DEFINITIONS.

    As used in this Act, the following definitions shall apply, except 
as the context otherwise requires or as otherwise specifically provided 
in this Act:
            (1) Affiliate.--The term ``affiliate'' has the same meaning 
        as in section 3 of the Federal Deposit Insurance Act (12 U.S.C. 
        1813).
            (2) Appropriate federal banking agency.--On and after the 
        transfer date, the term ``appropriate Federal banking agency'' 
        has the same meaning as in section 3(q) of the Federal Deposit 
        Insurance Act (12 U.S.C. 1813(q)), as amended by title III.

[[Page 124 STAT. 1387]]

            (3) Board of governors.--The term ``Board of Governors'' 
        means the Board of Governors of the Federal Reserve System.
            (4) Bureau.--The term ``Bureau'' means the Bureau of 
        Consumer Financial Protection established under title X.
            (5) Commission.--The term ``Commission'' means the 
        Securities and Exchange Commission, except in the context of the 
        Commodity Futures Trading Commission.
            (6) Commodity futures terms.--The terms ``futures commission 
        merchant'', ``swap'', ``swap dealer'', ``swap execution 
        facility'', ``derivatives clearing organization'', ``board of 
        trade'', ``commodity trading advisor'', ``commodity pool'', and 
        ``commodity pool operator'' have the same meanings as given the 
        terms in section 1a of the Commodity Exchange Act (7 U.S.C. 1 et 
        seq.).
            (7) Corporation.--The term ``Corporation'' means the Federal 
        Deposit Insurance Corporation.
            (8) Council.--The term ``Council'' means the Financial 
        Stability Oversight Council established under title I.
            (9) Credit union.--The term ``credit union'' means a Federal 
        credit union, State credit union, or State-chartered credit 
        union, as those terms are defined in section 101 of the Federal 
        Credit Union Act (12 U.S.C. 1752).
            (10) Federal banking agency.--The term--
                    (A) ``Federal banking agency'' means, individually, 
                the Board of Governors, the Office of the Comptroller of 
                the Currency, and the Corporation; and
                    (B) ``Federal banking agencies'' means all of the 
                agencies referred to in subparagraph (A), collectively.

Now lets consider the case of an application searching for definitions in legal text. One of the features we would like to do is that given the exact name of the term, such as "Board of Governors", "Bureau", etc, the text of its definition is found. The name of the term can be represented as data with a string of characters. In Clojure, and many other languages, this is denoted as the text between quotation marks (""). For example, the input for our search can be given as:

"Board of Governors"

Next we have to examine on how to represent the output (the result) of our search. In most cases we would like to have a sequence of results that indicate the found definitions. This is due to a number of reasons. First, a specific term could be defined with multiple definitions over a variety of documents. Second, it provides us a straightforward way to represent the results in the case when there are no results are found. In this case we can return an indicator of a sequence of 0 elements. In Clojure such a sequence of elements can be indicated by elements in between square brackets []. Other data interchange formats and languages call such ordered sequence by different terms: lists, arrays, etc. Such representations are very common in (programming languages) but for in this article we will stick with the EDN definitions.

In our first version of our program we will just return the sequence of the found texts that describes the definition. To give a concrete example lets assume we search for the term "Board of Governors" in the above partial document. The input for our program is the string of characters indicating this term:

"Board of Governors"

and the output would be:

[ "(3) Board of governors.--The term ``Board of Governors'' means the Board of Governors of the Federal Reserve System."]

To show an example where we would not find any definitions in the above text, if we search for the term "Central Bank" in the above fragment using the input

"Central Bank"

we would get an empty sequence as a result:

[]

Giving examples such as this should already give a good indication on how the code could be structured. Just as importantly, given some information on the basic notation of sequences and strings, legal domain experts could understand the program is aiming to achieve, just by looking at a few of such examples and verify whether we are on the right track.

Of course the above example is a very minimal abstraction. Let's try to expand upon it, to capture a few more elements of the legal domain.

The location in which the searched terms are found is also quite important. A program might search legal terms over multiple legal documents, and if found, people would likely want to know exact location in the source material to get more context. This means that we want to represent a few additional values in our search result. For example if the definition was found in a law, we would also want to have the title of the law, the section in which the definition was found, the URL of where the law could be read, etc.

In EDN the format to represent such key-value pairs is called map, represented by pairs of elements between curly brackets: {}. For example if we want to express the term description with a key-value pair, with both the key and value being a string, we could write:

[{ "description" "(3) Board of governors.--The term ``Board of Governors'' means the Board of Governors of the Federal Reserve System."}]

In other languages such maps are called as object, record, struct, dictionary, hash table, etc but the general concept is the same.

The main question is of-course what additional elements we want add here and what we would like to name the keys of these elements. To reiterate, "I am not a Lawyer", but we can look towards the ways of how laws are cited to figure out what additional information would make sense from a legal perspective. Thankfully there are some descriptions online on how to cite laws, such on the site of the University of Cincinnati and the Cornell University Law.

Given this information, we will add the terms "title of the act", "public law number" "statute", "year of enaction" and "url" of the law where this definition was found to our search result. Below is the expanded example:

[{ "description" "(3) Board of governors.--The term ``Board of Governors'' means the Board of Governors of the Federal Reserve System."
  "title-of-act" "H.R.4173 - Dodd-Frank Wall Street Reform and Consumer Protection Act"
  "public-law-number" "111-203" 
  "statute" "1387"
  "year-of-enaction" "2010"
  "url" "https://www.congress.gov/bill/111th-congress/house-bill/4173/text?r=1"}]

Now this might be sufficient, but again there are some features we could use to make this a better representation. First, we use a lot of strings of characters when we ideally want to specify a number or numbers. In EDN, as well as in JSON and other formats, you generally have a bit more precision and safety by expressing these elements as numbers (or more precisely integers). By the program that we intend to build, this helps us automatically invalidate certain wrong values and helps us better describe the intent of what is allowed. For example if we write the string "start" instead of a number such as '111' for the law number, this program should handle this by not allowing such scenaros to occur that contradict the rules of the domain. The extra precision allows us to more precisely declare the range of laws, with a number denoting the starting and the ending point with the keys "from" and "to".

[{ "description" "(3) Board of governors.--The term ``Board of Governors'' means the Board of Governors of the Federal Reserve System."
  "title-of-act" "H.R.4173 - Dodd-Frank Wall Street Reform and Consumer Protection Act"
  "public-law-number" {"from": 111 "to": 203} 
  "statute" "1387"
  "year-of-enaction" 2010
  "url" "https://www.congress.gov/bill/111th-congress/house-bill/4173/text?r=1"}]

We are going to add one additional feature to improve this representation. Note that depending on what we search for, certain the strings representing the values will vary a lot while those for the keys will remain the same. For example the value for the returned description can be "(3) Board of governors.--The term ``Board of Governors'' means the Board of Governors of the Federal Reserve System." if we search for the term "Board of governors", while it will be "(4) Bureau.--The term ``Bureau'' means the Bureau of Consumer Financial Protection established under title X." if we search for "Bureau". However the key for both of these values would be "description".

The solution in EDN is to use keywords, for the commonly used strings. The names of these are prefaced by : instead of putting them in quotation marks. While JSON does not have such keywords, many other languages do. Notably JSON-LD can use Uniform Resource Identifiers (URI) for a similar purpose.

Using keywords we can give an output for our definition search as follows:

[{ :description "(3) Board of governors.--The term ``Board of Governors'' means the Board of Governors of the Federal Reserve System."
  :title-of-act "H.R.4173 - Dodd-Frank Wall Street Reform and Consumer Protection Act"
  :public-law-number" {"from": 111 "to"": 203} 
  :statute" "1387"
  :year-of-enaction 2010
  :url "https://www.congress.gov/bill/111th-congress/house-bill/4173/text?r=1"}]

As one can see from just these few examples there can be quite a few ways in which domain concepts are represented. A small set of examples, that describe the elements of the domain in a way that is understandable for both the domain expert and the software engineer can be invaluable. The great benefit of Data Oriented Domain Design, is that it co-opts some battle tested light-weight data representation languages, such as EDN, for this purpose. This allows for some excellent test cases to use when developing the software with methodologies such as Test Driven Development (TDD).

Of course the above approach has some limitations.

A notation such as JSON-LD can be more expressive with representing the domain and is more capable of representing linked data (i.e.: data that is interconnected with other data). However starting to model the domain with explicit links to other data sources can be more complex especially in cases where domain experts formalize the domain for the first time. As JSON-LD is designed to provide a smooth upgrade path from JSON, starting out with a pure JSON based modelling of the domain can be a great initial step.

DODD also explains the domain through a set of examples, but not through a comprehensive set of restrictions that model the domain in a more complete way. However there exist schema languages, such as Clojure Spec or OWL Ontologies that can model more of such restrictions, and could extend validation a lot further.

Another issue is that these lightweight data languages are often designed for the perspective of a software engineer, that most likely is going to utilize them. However with some tooling they could be made more accessible with domain experts, especially as they are relatively straightforward, with fewer elements, compared to more complex representations.

Nonetheless even with these limitations, Data Oriented Domain Design in which a lightweight data notation language is used to help express examples of a domain to both domain experts and software engineers, can provide a relatively gentle start to modelling the domain through a set of examples. Due to its relative simplicity it could be applied as an initial step, before more "heavyweight" models are brought into the picture. It can provide a great tool for some frank discussions on how a software should function in a particular domain.