Vocabulary Types

Published
Last Updated
Author(s)
Enimihil
Tags
#language #philosophy #programming #types

Introduction

A vocabulary type is a term I've come across a few times in discussions of programming; "Vocabulary types" represent a category of programming language types.

As I've understood the term, a vocabulary type is one that is appropriate for use in interfaces; it is passed as a parameter, returned, or otherwise visible outside the immediate module (unit of composition of the software). The type must be usable in APIs, and so has some common (language-dependent) properties, which define the category.

I have not seen too much discussion about what those properties would be, and if any readers have sources that try to piece this together (or even sources mentioning vocabulary types in specific contexts), this is a relatively difficult line of research to search. (Both "vocabulary" and "type" show up in a lot of research and technical contexts unrelated to the concept of a "vocabulary type", so the search results are messy, at beast.)

What properties do vocabulary types share, then?

It should immediately be somewhat obvious that any primitive types in a language make good candidates for a vocabulary type; almost by default anything the language supplies as built-in (whether as part of the standard library or within the language itself) rises to this level.

These types are generally stable, as robust and well-tested as possible, and strictly defined. The types provided by the language environment are necessarily always available as well. A 'good' vocabulary type would be easily accessible. In the same way that using obscure language makes writing less accessible to a broad audience, using an obscure type as a vocabulary type in your software would make it less accessible to the broadest audience of programmers.

The analogy seems to be that the set of vocabulary types defines the 'dialect' of programming that you are doing. Beyond whichever specific language (whether C++, Rust, Python, or Haskell), physics simulations will have a common vocabulary. Both in the regular sense of having specialized terminology and jargon, and in the sense of "vocabulary types" with similar purposes. e.g. A 3d vector in a cartesian space, a 3d point in a cartesian space, affine transformations, etc.

Somewhat more fundamentally than domain specific vocabulary is the question: Are there programming-language agnostic, but programming-specific vocabulary types?

Can we then design better languages and software libraries by understanding these comonalities?

Concrete Examples

Details in various programming languages, of course, will still differ, but I think candidates for these sorts of types are:

Simple

fixed-size integers (modular/unsigned)
Examples: uint32_t, unsigned int
fixed-size integers (two's-complement/signed)
Examples: int16_t, int, long int
IEEE754 float32 and float64 (float128? 80-bit x87 floats?)
Examples: float, double
byte (an uninterpreted unit of memory that is 8-bits in size)
Examples: std::byte, char (usually), unsigned char
text string (whether unicode, ASCII, unspecified, represents human-language text)
Examples: std::string, char const*
any/opaque
Examples: boost::any, void*

Parameterized

optional/maybe/nullable
Examples: std::optional, T*
homogenous sequence
Examples: int[], std::vector<int>, std::span<std::byte, 16>, std::array<float, 16>, generator<int>
key->value mapping between two specific types
Examples: std::map<int, std::string>, std::unordered_map<std::string, void*>

I'll admit a C++ bias in the above, as that feels to me the most precise way to name the ideas to me.

Further constraining to abstract types that aren't concrete programming language primitives at all can further simplify and rtestrict this idea of universal vocabulary types.

Perhaps the success of JSON (now standardized in RFC 8259) is just how clearly it constrains data to a universal set of broadly applicable vocabulary types:

number
Integral or floating point, exact integers are only guaranteed interoperable within the INT32 range, and floating point representation is likely only compatible if exact float64 is used.
boolean
A value usable in a conditional context selecting between two alternatives true and false.
null
A value that is present but otherwise empty/unspecified. (Analogous to an empty optional<T>, or a nullptr-valued T*
string
A value representing a sequence of Unicode codepoints in the printable range.
array
A sequence of any type, (homgenous in that the any type is the corresponding representation).
object
A mapping from string values to any.

Though rarely will a particular use cosntrain itself so heavily unless the goal is truly broad interoperability.