2013/06/29

Software Rants 13: Python Build Systems

So after delving into pcsx2 for a week and having the wild ride of a mid-sized CMake project I can officially say any language that makes conditionals require a repetition of the initial statement dumb as hell. But CMake proves a more substancial problem - domain languages that leak, a lot.

Building software is a complex task. You want to call external programs, perform a wide variety of repetitious tasks, do checking, verifying, and on top of that you need to be able to keep track of changes to minimize time to build.

Interestingly, that last point leads me to a tangent - there are 3 technologies that are treated pretty much independently of one another but overlap a lot here. Source control, build management, and packaging all involve the manipulation of a code base and its outputs. Source control does a good job managing changes, build systems create conditional products for circumstance, and packagers prepare the software for deployment. 

I think it would be interesting if a build system took advantage of the presence of the other two dependencies of a useful large software project - maybe using git staging to track changes in the build repository. Maybe the build system can prepare packages directly, rather than having an independent packaging framework - after all, you need to recompile most of the time anyway.

But that is aside the point. The topic is build systems - in particular, waf. Qmake is too domain specific and has the exact same issues as make, cmake, autotools, etc - they all start out as domain languages that mutate into borderline turing complete languages because their domain is hugely broad and complex, and it has evolved more complex over time. This is why I love the idea of python based build systems - though at the same time, it occurs to me most python features go unused in a build system and just waste processor cycles too. 


But I think building is the perfect domain of scripting languages - python might be slow, but I could care less considering how pretty it is. However, my engagements with waf have made me ask some questions - why does it break traditional pythonic software development wholesale (from bundling the library with source distribution, to expecting fixed name files of wscript that provide functions with some wildcard argument that acts really magical).

What you really want is to write proj.py and use traditional pythonic coding practices with a build system library, probably from pypi, You download the library, do an import buildsystem, or from buildsystem import builder or something, rather than pigeonhole a 2 decade old philosophy of files without extensions in every directory with a fixed name.

Here is an example I'd like to write in this theoretical build system covering pretty much every aspect off the top of my head:

# You can play waf and just stick the builder.py file with the project, 
# without any of the extensionless fixed name nonsese.
from builder import recurse, find_packages, gcc, clang
from sys import platform

subdirs = ('sources', 'include', ('subproj', 'subbuilder.py'))
name = 'superproj'
version = '1.0.0'
args = (('install', ret='inst'),)
pkg_names = ('sdl', 'qt5', 'cpack')

builder.lib_search_path += ('/lib','/usr/lib','/usr/local/lib', '~/.lib', '/usr/lib32', '/usr/lib64', './lib')

# Start here, parse the arguments (including optional specifiers in args) a lot of the builder. global members
# can be initialized with this function via default arguments.
todo = builder.init('.', opt=args, build_dir='../build')

if(todo = 'configure'):
  # builder packages are an internal class, providing libraries, versioning, descriptions, and headers.
  # when you call your compiler, you can supply packages to compile with.
  pkgs += builder.find_packages(pkg_names)
  pkgs += find_packages('kde4')
  utils += builder.find_progs('gcc', 'ld', 'cpp', 'moc')
  # Find a library by name, it will do case insensitive search for any library file of system descript,
  # like libpulseaudio.so.0.6 or pulseaudio.dll. It would cache found libraries already and not repeat
  # itself on subsequent builds.
  libs += builder.find_lib('pulseaudio')
  otherFunction()
  builder.recurse(subdirs)
elif(todo = 'build'):
  # You can get environments for various languages from the builder, supplying them with 
  cpp = builder.env.cpp
  py = builder.env.py
  qt = builder.env.qt # for moc support
  
  # you can set build dependencies on targets, so if the builder can find these in the project tree
  # it builds them first
  builder.depends('subproj', 'libproj')
  
  # builder would be aware of sys.platform
  if platform is 'linux': # linux building
    qt.srcs('main.cpp', 'main.moc')
    qt.include('global.hpp')
    qt.pkgs = pkgs['qt5']
    qt.jobs = 8 # or the .compile syntax
    qt.cc = gcc # set the compiler
    qt.args = ('-wstring',)
    # qt.compile would always run the MOC
    qt.compile(jobs=8,cc=gcc,args=self.args+(, '-O2', '-pthread'),warn=gcc.warn.all,out=verbose)
    # at this point, you have your .o files generated and dropped in your builder.build_dir directory.
    builder.recurse(subdirs, 'build')
  if platform is 'darwin': # osx building
  if platform is 'win32': # windows building
elif todo='link':
  # do linking
elif todo='install':
  # install locally
elif tood='pack':
  # package for installation, maybe using cpack

Basically, you have a library to enable building locally, and you use it as a procedural order of operations to do so, rather than define black box functions you want some builder program to run. There could also be prepared build objects you could get from such a library, say, builder.preprocess(builder.defaults.qt) would supply an object to parse whatever operation is being incited (so you would use it regardless of the calling function on your script) to do the boilerplate for your choice platform. 

I imagine it could go as far as to include anything from defaults.vsp to defaults.django or defaults.cpp or defaults.android. It would search on configure, include on build, and package on pack all the peripheral libraries complementing the choice development platform in one entry line.

The principle concerns with such a schema are mainly performance. You want a dependency build graph in place so you know what you can build in parallel (besides inherently using nprocs forked programs to parse each directory independently, where the root script starts the process, so you need builder.init() in any script that is meant to start a project build, but if you recurse into a subproject that calls that function it doesn't do anything a second time).

You would want to support a lot of ways to deduce changes, besides just hashes, you could use file system modification dates, or maybe even git staging and version differences (ie, a file that doesn't match the current commit version is assumed changed). You would cache such post-changes afterwards. You would probably by default use all available means and the user can turn the up for speedups with potential redundant recompilation (ie, if you move a file, its modification date changes, the old cache is invalidated, but if they hash the same it is assumed the same file moved and isn't recompiled).

If you support build environments, you can support radically different languages. I just think there are some shortcomings in both scons and waf that prevent them from truly taking advantage of their pythonic nature, and using all the paradigms available to python is one of them, I feel.

2013/06/24

Magma Rants 5: Imports, Modules, and Contexts

One of the largest issues in almost any language is the trifecta of imports, packaging, and versioning. For Magma, I want it to be a well thought out design that enables portable, compartmentalized code, interoperability between code, and the ability to import both precompiled and compiled object code.

First, we inherit the nomenclature of import <parent>:<child>, where internally referencing such a module is through the defined <parent>:<child> namespacing. Imports are filesystem searched, first locally (with a compiler limited depth, blacklist, and whitelist available) then on the systems import and library paths. You can never define a full pathname import to a static filesystem object with the import clause, but the internal plumbing in std:module includes the necessary woodwork to do raw module loading.

The traditional textual headers and binary libraries process still works. You don't want to bloat deployment libraries with development headers, though if possible I'd make it an option. Magma APIs, with the file suffix of .mapi, are the primary way to provide an abstract view of a library implementation.

In general practice though, we want to avoid the duplication of work in writing headers and source files for every part of a program to speed up compile times. This is mostly a build system problem, in that you want to verify (via hash) a historic versioning of each module, so if it changes its hash you know to recompile it. This means you should should write APIs for libraries or externalized code - which is what a c++ header really should be for.

In addition, an API only describes public member data - you don't need to describe the memory layout of an object in an API so that the compiler can resolve how to allocate address space, you just specify the public accessors. When you compile a shared object, the public accessors are placed in a forward table that a linker just needs to import out. Note that since a library can contain multiple api declarations in one binary, the format also has a reference table to the API indexing arrays.

The workflow becomes one of importing APIs where needed, and using compiler flags and environment variables to search and import the library describing that api. One interesting prospect might be to go the other way - to require compiled libraries be named the same as their apis, and to have one api point to one binary library with one allocator table. It would mean a lot of smaller libraries, but that actually makes some sense. It also means you don't need a seperate linker declaration because any imported api will have a corresponding (for the linkers sake) compiled binary of the same name in the library search path.

I really like that approach - it also introduces the possibility of delayed linking, so that a library isn't linked in until its accessed, akin to how memory pages work in virtual memory. You could also have asynchronous linking, where accessing the libraries faculties before it is pulled into memory causes  a lock. Maybe an OS feature?

As a thought experiment I'm going to document what I think are all the various flaws in modern shared object implementations and how to fix them in Altimit / Magma:

  • You need headers to a library to compile with, and a completely foreign binary linkable library or statically included library to link in at build or run time.
  • You need to describe the complete layout of accessible objects and functions in a definition of a struct or class, so that the compiler knows the final size of an object type.
  • You need to make sure the header inclusions and library search path contained the desired files, even on disparate runtime environments.
  • Symbol tables in binaries can be large and cumbersome to link at runtime and can create sizable load times.

2013/06/06

Magma Rants 4: Containers and Glyphs

Containers are the most core pervasive aspect of any languages long term success. In Magma, since () denotes scope blocks (and can be named), and [] are only used for template declarations, {} and [] are available stand alone to act as container aliases like in Python. [] is an std:array, the primitive statically sized array of homogenous  elements. If it has multiple types in one array, it uses std:var as the array type and uses the natural conversion from[object] conversion available in var if it a user defined type, or an overridden more precise conversion function.

{X, X, X} is for unique sets, and {(X,Y),(X,Y)} is for maps. In the same line of thinking, the language tries to find a common conversion type these objects fit in (note: the compiler won't trace the polymorphic inheritance tree to try to find a common ancestor) and casts them, or throws them in vars. The indexing hash functions for sets and maps that determine uniqueness are well defined for std types and you can implement your own as a template override of std:hash[T](T), which needs to return a uint.

Python (since I love Python) also includes the immutable list type () as a tuple, but since Magma [] is already a static contiguous std:array and not an std:dynArray, there is no performance benefit. Note that, like everying in Magma, [] and {} are implied const and can be declared muta {} or muta [] to construct a mutable version.

One of the principle goals I had in thinking about Magma is that a majority of languages overload and obfuscate the implication of glyph patterns, which makes compilation time consuming in a complex parser since syntax is very situational depending on the surrounding text in a source file. Additionally, any time you use multiple sequential glyphs to represent some simple concept (equality as ==, scope as ::, // for comments) I feel is a failure of the language to properly balance glyph allocation and behavior. Albeit, in the current documentation on Magma, I'm using == for logical equality because I turned = back into the assignment operator instead of :, solely on the basis because += is way too baked into my brain to see +: and not think that is strange, and it allowed me to use : for scope and . for property access (which are different, Java).

In conceptualizing Magma, I drafted out all the available glyphs on a standard keyboard and assigned those to language functions. As a result, glyphs like $ became available substitutes for traditional named return in other languages, and made function declarations more obvious because you declare a return type in a function definition (or fn).

2013/06/05

Magma Rants 3: Powerful variance and generics

Magma uses the same compile time template checking that C++ uses - templates (defined with square braces [] in class and function definitions). The distinction between polymorphism and templates is, I feel, still valuable, and unlike Go, I don't see anything inherently wrong with native compiled templates in the C++ vein - if a template usage doesn't support, at compile time, the functions and typing the template uses, it is a compiler error. The implementation will try to coerce types using the object generic functions object.to[T](T) and object.from[T](T), if either direction is defined (because either class could define the conversion to another type) the cast is done. This avoids the ambiguity of dynamic casting in C++, because there is a well defined set of potential casts for every object and the only difference between static_cast and dynamic_cast are if the casts themselves are implemented as const or not. Const casting still exists but requires the "undefined" context to allow undefined behavior (ie, mutating a passed-in const object can be very bad). Const cast is found in std:undefined:ConstCast().

From the other direction, Magma contains std:var, which is the stratos autoboxing container type. It is used pervasively as a stand in for traditional polymorphic [Object] passing because you can retrieve the object from a var with compile time guarantees and type safety, and var includes a lot of additional casting functionality not found in native Magma casts in strings and numbers. If you have a heterogeneous collection, you almost always want the contents to be vars, unless you have a shared common restricting ancestor to denote a subset of objects. You can still query the type of a var, and it delegates all interactions with it besides the ? operator to the descendant. If you really need to call query() / ?, call var:get[T](). You can also get the contents get function by getting it and calling get on it.

Magma also has the auto keyword as a type specifier to reduce an rvalue equality, in the same way C++ does. It statically reduces type from an rvalue statement and parses as such.

2013/06/02

Magma Rants 2: Low level abstractions in the base language

One thing that I don't like in modern low level language design is how you easily hang yourself on aged constructs from the 70s, like jumps, and how valuable glyphs like the colon are consumed to maintain a very niche feature. As a result, the base language itself is just the core components of modern language paradigms, with a lot of traditional behavior that goes rarely used and can easily play a gotcha on a developer. Here is a collection of traditionally core language features in C and its ilk that are available only under the System context:
  • std:bit contains bitwise operations (bit and and or, shift left and right, bitwise negation). Many std classes like bitfield, flags, and the compiler in the presence of multiple same-scope booleans will use bitwise arithmetic and bit operations but they aren't user facing because a user rarely needs them. The traditional "or" flags syntax of FLAG1 | FLAG2 | FLAG3 is instead a function of the flags addition, in the form FLAG1 + FLAG2 + FLAG3, and - a flag means to remove it from the bitfield.
  • std:flow imports enable (as a compiler feature) goto, continue, break, and label. They take the form of functions, std:goto(label), std:label(name), std:continue(), and std:break().
  • std:ptr  contains the raw pointer construct ptr[T], and the alloc() and free() functions.
  • std:asm introduces the asm(TYPE) {} control block to inline assembly instructions in Magma code.
The base language is thus memory safe, doesn't enable bit overflow of variables, and has consistent control flow. This functionality is stil available for those who need it, but excluded for the consideration of any large project that wants to avoid the debugging nightmares that emerge from using these low level tools.