Working behind the corporate proxy

Suppose you are working in an R&D environment with only intranet access. And you can only browse the internet in the browser through a proxy, this article may help you configure your environment to make your daily life easier.

We only consider a Linux environment, or Windows with cygwin.

Check the proxy address in your browser

Let’s use IE as an example. It is under menu [Tools] -> [Internet Options] -> [Connections] -> [Lan settings] -> [proxy server]. If you use a pac file to automatically configure the proxy, you need to check the content in the pac file to find the proxy address out.

Let use as the proxy address, and port 8080 is the proxy’s port. It’s possible the proxy also requires authentificaiton. Let’s suppose the domain name in a Windows domain is CHINA, with the domain username is egUserName and password is egPassword.

Configure http_proxy and https_proxy in Shell

Please add the two lines into your linux’s .bashrc or .bash_profile

export http_proxy="http://CHINA\\"
export https_proxy="https://CHINA\\"

After this, many tools can work properly, like wget, curl, etc.

Configure http_proxy for Git

You can use git commands to do the configuration. Let’s modify ~/.gitconfig file directly.

    proxy = "https://CHINA\\"
    sslVerify = false
    proxy = "https://CHINA\\"

We use https proxy address for both http and https. You can tweak it to see whether it works in your environment. The “sslVerify = false” is used to disable the ssl verification during git data transmission. Otherwise, if your company does deep packet inspection in the https proxy connection, it will cause troubles for the git to verify the data.

After this configuration, all the http/https git address should work, like You may still need to attach the username and password, e.g. yourGitAccout and yourGitPassword in the address, like If you only want to attach the username not the password in the address, you can add this line into your .gitconfig to support cache your password in memory for 1 hour.

        helper = cache --timeout=3600

However, it doesn’t support git@ protocol to clone a project. One way to solve it is to let git automatically replace  “git@” with https:// For example, replace git address could be done with this configuration in the .gitconfig file.

[url ""]
        insteadOf =

If you think this is still not the way you want, let’s switch to the powerful tool corkscrew.

SSH over Http Proxy through corkscrew

You can Google corkscrew and ssh over http to learn more about the mechanism. Here let’s only focus on the usage. Install corkscrew, e.g. use Ubuntu’s “apt-get install” or search it in your Cygwin’s package list. And suppose you want to ssh to and you have a git address , you can create a file ~/.ssh/config with the content

ProxyCommand /usr/bin/corkscrew 8080 %h %p ~/.ssh/authfile

And you create the ~/.ssh/authfile


Now, you can try these commands, like “ssh” or “git clone”.

Enable Ubuntu apt-get to use proxy

Ubuntu apt-get can use http_proxy and https_proxy to download package and updates. However, if you run these commands with sudo, the root’s environment has no such environment variables. Add this line into /etc/sudoers will solve this problem.

Defaults env_keep = "http_proxy https_proxy ftp_proxy"


Last word, please change the mode to 600 of all the files that contains your username and password if you are working in a shared server.


Using State Machine to Solve Problems

There is a popular interview problem in the web “Given a string, find the longest substring that contains at most 2 distinct characters” with the test string “ababcbcbaaabbdefggggg” and answer “baaabb”. Many sample solutions are available if you just search the test string.

But how about the question is changed to finding at most n distinct characters where n is a input? Let’s solve it with a state machine. The basic approach is very similar to the answers you can find online, we use two pointers, right and left, which point to the start and end of the substring. And we use a map to record how many distinct characters are in the substring. Regarding the state machine, it only has two states, valid (the number of distinct characters in the substring is less or equal than n) and invalid.


The iterating of the input string starts from the valid states and both the right and the left pointers are set to the beginning of the input string. In the valid state, we just move the right pointer to increase the substring. As a result, if the new substring is valid, the state is still valid. Otherwise, the state is changed to the invalid. In the invalid state, we can only move the left pointer to remove a character from the substring, and the state is changed accordingly. A record action is triggered each time the state goes to the valid, and the record action update the longest substring if the current substring is longer than the previous recorded one.

Because I’m using Lua recently, and Lua has built-in Map support, I provide Lua implementation. The code should be self-explained. The only special part of Lua is the substring uses inclusive indices.You can read more about the string library of Lua.

local args = {...}
local str = args[1] or "ababcbcbaaabbdefggggg"
local allowed = tonumber(args[2]) or 2

local tbl = {} -- The map to store char and its count

-- The following functions will directly use tbl
local function addchar(c)
  tbl[c] = (tbl[c] or 0 ) + 1

local function rmchar(c)
  assert(tbl[c], "Try to remove a non-existent char:"..c)
  tbl[c] = tbl[c] - 1
  if tbl[c] == 0 then tbl[c] = nil end --make sure move the key

local function isvalid()
  local numkeys = 0
  for k,_ in pairs(tbl) do
    numkeys = numkeys + 1
  return numkeys <= allowed

local function main()
  print(string.format("Find the longest substring in \"%s\" with %d distinct chars:",str, allowed))
  local state = true -- The state valid at the beginning
  local left,right = 0,0

  local recleft, recright = 0,0 --used to record the final result
  local function record(l,r)
    if r-l > recright - recleft then
      recleft,recright = l,r

  while right <= #str do
    if state then --valid state
      right = right + 1
      if isvalid() then --added success
        record(left, right)
        state = false
    else --invalid state
      left = left + 1
      if isvalid() then
        state = true

  local res = str:sub(recleft+1, recright)
  print(string.format("The result substring is \"%s\" with length %d,\n", res, #res))


In fact, there is another coding question. Translating all the CPP style comments “//” into C style “/* */” comments of a input source file. If you took the compiler course before, you should know it’s a parser problem, and you may use state machine to solve it. In fact, the state machine is more complex than the previous problem. For example, there are code state, and comment state, and there are sub states that is in string state of a code state. The state changes are triggered by the input character or characters. But the basic idea is similar to the above question.

Compile Scripting Language into Native Code without Specialization – Another Example

I’m working on optimizing Lua language based applications recently. We can either optimizing the source code of the application, or improving the runtime (Lua interpreter). But there is another approach, just compiling the Lua code into native code (or C code and use C compiler), and hope it can run faster.

There are at least two projects in Lua domain take this approach, llvm-lua, and lua2cee(or lua2c). Let’s use simple example to see how they work.

a = {}
a.c = "HelloWorld"
b = a.c
c = a.c
print(b, c)


llvm-lua’s path is [lua src]->(1)->[lua byte-code]->(2)->[llvm byte code]->(3)->[native code]. Luac compiler does the first transformation, and the project does the second. Then it relies on LLVM to do the optimization in the third. The lua byte-code after the stage one is
PC Line Opcode Operands
1 [1] NEWTABLE0 0 0
2 [1] SETGLOBAL 0 -1 ; a
3 [2] GETGLOBAL 0 -1 ; a
4 [2] SETTABLE0 -2 -3 ; "c" "HelloWorld"
5 [3] GETGLOBAL 0 -1 ; a
6 [3] GETTABLE0 0 -2 ; "c"
7 [3] SETGLOBAL 0 -4 ; b
8 [4] GETGLOBAL 0 -1 ; a
9 [4] GETTABLE0 0 -2 ; "c"
10 [4] SETGLOBAL 0 -2 ; c
11 [5] GETGLOBAL 0 -5 ; print
12 [5] GETGLOBAL 1 -4 ; b
13 [5] GETGLOBAL 2 -2 ; c
14 [5] CALL 0 3 1
15 [5] RETURN 0 1

And the llvm byte-code after the stage two is (with -O0 flag in llvm-lua)

define i32 @_test_lua_0_0(%struct.lua_State* %L) {
%0 = call %struct.LClosure* @vm_get_current_closure(%struct.lua_State* %L)
%1 = call %struct.lua_TValue* @vm_get_current_constants(%struct.LClosure* %0)
call void @vm_OP_NEWTABLE(%struct.lua_State* %L, i32 0, i32 0, i32 0)
call void @vm_OP_SETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 0)
call void @vm_OP_GETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 0)
call void @vm_OP_SETTABLE(%struct.lua_State* %L, %struct.lua_TValue* %1, i32 0, i32 257, i32 258)
call void @vm_OP_GETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 0)
call void @vm_OP_GETTABLE(%struct.lua_State* %L, %struct.lua_TValue* %1, i32 0, i32 0, i32 257)
call void @vm_OP_SETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 3)
call void @vm_OP_GETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 0)
call void @vm_OP_GETTABLE(%struct.lua_State* %L, %struct.lua_TValue* %1, i32 0, i32 0, i32 257)
call void @vm_OP_SETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 1)
call void @vm_OP_GETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 0, i32 4)
call void @vm_OP_GETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 1, i32 3)
call void @vm_OP_GETGLOBAL(%struct.lua_State* %L, %struct.lua_TValue* %1, %struct.LClosure* %0, i32 2, i32 1)
%retval = call i32 @vm_OP_CALL(%struct.lua_State* %L, i32 0, i32 3, i32 1)
%retval1 = tail call i32 @vm_OP_RETURN(%struct.lua_State* %L, i32 0, i32 1)
ret i32 %retval1

llvm-lua project provides all the byte-code functions, such as OP_GETGLOBAL(…). The implementation of these functions is just from the lua interpreter’s code. For example, GETGLOBAL’s llvm-lua implementation is

void vm_OP_GETGLOBAL(lua_State *L, TValue *k, LClosure *cl, int a, int bx) {
TValue *base = L->base;
TValue *ra = base + a;
TValue *rb = k + bx;
TValue g;
sethvalue(L, &g, cl->env);
luaV_gettable(L, &g, rb, ra);

while the lua interpreter’s implementation is
TValue g;
TValue *rb = KBx(i);
sethvalue(L, &g, cl->env);
Protect(luaV_gettable(L, &g, rb, ra));

It uses clang to transform the llvm-lua opcodes functions into llvm byte-code, too. All these byte-codes are sent to opt to do optimization and llc to generate the final native. However, due to the complexity of the llvm byte-code, all function calls and very complex data structures, opt cannot do a lot of optimizations. For example the -O3’s result of the previous code is just some function inlining.


lua2c takes a slightly different architecture. The flow is [lua src]->(1)->[lua ast]->(2)->[c src]->(2)->[native code]. Metalua does the first transformation, and the project does the second. Any C compiler can do the last. The result after the second transform is

/* a = {} */
assert(lua_gettop(L) - lc_nextra == 0);

/* a.c = "HelloWorld" */
assert(lua_gettop(L) - lc_nextra == 0);

/* b = a.c */
assert(lua_gettop(L) - lc_nextra == 0);

/* c = a.c */
assert(lua_gettop(L) - lc_nextra == 0);

/* print(b, c) */

You can see from the C code that lua2c always uses standard Lua standard C API to perform the operation. Regarding control flow related, lua2c generates C control code. Still, it’s a one to one mapping of the original code to the C code, no any complex optimizations. And still similar to the llvm-lua, all the code are just function calls over complex data structure, the Lua state.


You may have expected that the final performance of this kind of compiling to native is not good, because the traditional static compiler is not good at analyzing the boxed data structures of dynamic scripting language, and the function calls to implement the behaviors. llvm-lua is much slower than LuaJIT, which uses trace info and does specialization. The author of lua2c mentioned lua2c is slower than the standard lua interpreter. In fact, some research paper pinpointed the issue before, for example “On the benefits and pitfalls of extending a statically typed language JIT compiler for dynamic scripting languages” in OOPSLA2012.

Specialization is required and a must in optimizing dynamic scripting language. Basically, we should understand all the variable’s type and do many of the optimization before we lower the code representation to the lower IR. These are the stuff I’m currently working on.

The Dual-Stack Design of ORBIT (Optimized R Byte-code InterpreTer) VM

I have been working on improving the running speed of R programming language for a while. The current research result is the ORBIT (Optimized R Byte-code InterpreTer) VM, which includes a JIT compiler and a specialized runtime. We published a paper about the design and the performance result. Due to the page limit, we only described the implementation of the components, but did not explain the reason of some design decisions, for example the ORBIT interpreter’s dual-stack design. I got a question about it recently. I replied the mail and also post the reason here.

The GNU R byte-code interpreter uses a stack VM based interpreter. Because all the objects in GNUR VM is boxed, the elements in the stack are all pointers.


One common optimization for dynamic scripting language is using unboxed representation to express the original boxed objects. ORBIT also uses the approach. The size one boolean, integer and real vectors of R are expressed as unboxed raw values. After the optimization, these raw values will be stored in the stack. Because there are still many boxed objects used, the stack holds mixed pointers and raw values.

The VM must know the exact locations of these unboxed raw values and their types in the stack. Firstly, these raw values should be ignored in the garbage collection traversing. Secondly, if the execution context is turned back to the original full boxed context, the VM need to use the types to transform these raw values back to the boxed format. Then, the VM must store the locations and types somewhere.

The first solution is to store the raw values with a tag. Then we also need to add all the pointers with the tag. Adding a tag to a pointer is not complex. Typically we can use the last two bits or three bits of the pointer’s value. But adding a tag to a raw integer or double value means that we had to use more space. In order to keep each stack slot the same size, all the pointers will also use more space. Basically we need to double the stack space usage.

The second solution is purely based on static information. Because the byte-code is fixed, we could know the exact stack shape at every PC in the interpretation with static analysis. Then we can store the stack shape (unboxed values’ locations and types) with the code. There are two problems. First, still the space issue. we need another array space to store the unboxed objects’ locations and types for each PC. Because the unboxed values’ count varies at different PC, we also need to record the count values. Secondly, ORBIT uses a type system that has a number type below the logical, integer and real types. The reason is that the type inference based on runtime profiling feedback can only reason some operands are numbers, but not the exact type. Then we can still use number based operands to manipulate the unboxed values in the real execution with better performance. But as a result, we don’t know the exact types at the compiling time, and we cannot use the number type for boxing usage in a context transformation. 


The third solution uses a dual-stack system. The original VM stack holds both the original pointers and the unboxed values, and a new type stack is used to hold the tags ( address of the unboxed values and the real type). All the unboxed operands related byte-code manipulate the type stack during the execution. And the types in the type stack is from the real execution and precise. The drawbacks is the overhead to maintain the additional stack. However, the stack operations are quite simple, and our experiment shows that the additional operations add very little overhead. So finally we take this design.




Closure implementation in R and Lua

My current research area is dynamic scripting language, mainly focusing on R programming language in my thesis research. I’m studying Lua language in my summer intern project recently. Both of them are dynamic scripting languages, and they share many common features, such as supporting closure.

A simple example of Lua

> A = function() 
    local a = 100
    local B = function() print(a) end
    a = 200
    return B
> aA= A()
> aA()

And the similar code in R

> A = function() {
  a = 100
  B = function() {
  a = 200
> aA = A()
> aA()
[1] 200

Both of them have the function as the first class object. So we can create a function, assign it to a variable, and return it. And we can invoke it later. At the same time, the closure captures the value of the local variable (variable a in the example), and we can use it even the lexical parent function (function A in the example). The variable’s value reflects the last update in the closure’s lexical parent scope (a was assigned to 200 before function A returns).

In the above example, both Lua and R have the same behavior. But their underlying implementations are totally different.

Closure implementation in R

The local frame environment frame of for a function call is just a linked list allocated on heap (you may refer R code src/include/Rinternal.h).  The function define operation (B = function() {…}) will invoke the R’s internal function mkCLOSXP (source code at src/main/distruct.c) with the current frame. And the new closure will have the current frame set to its env field through the SET_CLOENV macro. In a function invocation, the argument list (a linked list) of a call will be changed to the new local frame environment of the new function invocation (source code at src/main/eval.c applyClosure function), and the new local frame’s enclosing frame (parent frame) will be set to the closure object’s env field (the lexical scope parent frame). Finally, in the variable lookup process, if a variable is not found in the new local frame ( the a variable in the example), the R interpreter will search the current frame’s enclosing frame (lexical parent frame). Even after the parent function (function A in the example)returns, because the frame is allocated on heap, and the child function (function B in the example) is still alive, the parent’s frame will not be garbage collected. So we can still get the variable a‘s value.

This implementation has two disadvantages. First, using the whole parent’s local frame to store the out scope variable access in the child closure. The parent’s local frame will be alive if the child closure is alive. If there are a lot of local definitions in the parent frame, even if they will never be used in the child frame, all these values will be alive as the parent’s local frame. It’s some kind of memory leak. Second, all the child closures defined in the frame will share the same parent frame, and they cannot capture their private parent local variable. Here is one question on stackoverflow about this issue. I use a simplified example to show it.

> getfuns = function() {
    fs = list()
    for(i in 1:5) { fs[[i]] = function() { print(i) }}
> funs = getfuns()
> funs[[1]]()
  [1] 5
> funs[[2]]()
  [1] 5

Because all the child closures share the same parent frame, the enclosed variable’s values are the same (i is 5 in the above example). The stackoverflow question’s answer provided a solution to create the private variable. The idea is to insert another local frame to store each i here.

Closure implementation in Lua

The closure implementation in Lua was described in this paper. The basic idea is upvalue. All the variables defined out the current function scope are either global variables or upvalues. If the variable is defined somewhere in the lexical parent functions, the variable is an upvalue, for example, the variable a referenced inside function B in the first example. All the upvalues can be identified statically at the compile time. So the function knows the number of upvalue it has. Let’s use the byte-code of the the first example to show how it works.

function <stdin:1,6> (6 instructions, 24 bytes at 0x8004ed78)
0 params, 2 slots, 0 upvalues, 2 locals, 2 constants, 1 function
        1       [2]     LOADK           0 -1    ; 100
        2       [3]     CLOSURE         1 0     ; 0x8004ef70
        3       [3]     MOVE            0 0
        4       [4]     LOADK           0 -2    ; 200
        5       [5]     RETURN          1 2
        6       [6]     RETURN          0 1
constants (2) for 0x8004ed78:
        1       100
        2       200
locals (2) for 0x8004ed78:
        0       a       2       6
        1       B       4       6
upvalues (0) for 0x8004ed78:

function <stdin:3,3> (4 instructions, 16 bytes at 0x8004ef70)
0 params, 2 slots, 1 upvalue, 0 locals, 1 constant, 0 functions
        1       [3]     GETGLOBAL       0 -1    ; print
        2       [3]     GETUPVAL        1 0     ; a
        3       [3]     CALL            0 2 1
        4       [3]     RETURN          0 1
constants (1) for 0x8004ef70:
        1       "print"
locals (0) for 0x8004ef70:
upvalues (1) for 0x8004ef70:
        0       a

The first chunk is function A’s byte-code, and the second chunk is function B’s byte-code.

The function’s prototype data structure stores the number of upvalues (src/lobject.h struct Proto, nups field). In the example, function B has one upvalue, a. In the instantization of the proto struct into a Lua closure (src/lobject.h struct LClosure), nups of Upvalue objects (src/lobject.h struct UpVal) will be located (shared, or previously created) or created (private, or shared but not created yet). In the example, when executing “local B = function …”, the closure of B will create an upvalue for accessing upper frame’s variable a.

Lua’s local frame is purely stored in a real stack, each local variable occupies one slot (a register) in the stack. (Note, let’s ignore the case the number of local variables is very large that name based lookup is required.) So one the upvalue should point to one stack slot by default. The upvalue’s pointer must be initialized correctly. The initialization process is very interesting. You can check the code at case OP_CLOSURE in src/lvm.c. OP_CLOSURE’s first operand is the dst register for the closure, 1 here, which is variable B. The second operand is the location index of the child protos, 0, here, which means the first child proto of function A’s child protos. The following instructions after OP_CLOSURE are used to set the upvalue. They can only be OP_GETUPVAL or OP_MOVE. In fact these instructions will not be executed directly by the Lua interpreter. They are used to tell the OP_CLOSURE interpretation process where to find the upvalue. For example, here (MOVE 0,0)’s second operand is 0, which indicates the upvalue of closure B should be pointed to A’s local stack’s position 0, which is the variable a.

If the function A is not return, the upvalue is called open. The value is always on stack. After function A returns, the upvalue is called closed. In this case, the stack element’s value will be copied to the upvalue. So that the function B can still access the variable a. During the initialization of the upvalue, the open upvalues are stored in a linked list “L->openupval” (src/lfunc.c luaF_findupval). The closing logic is written in the source file src/lfunc.c luaF_close, which traverses the linked list, and closes all the open upvalues.

Although the logic looks like very complex, the Lua closure’s implementation is very efficient. It overcomes the two problems of the R’s closure. The parent’s local frame will be freed just by stack pop. And the closure can support enclosing private variables, like this.

> function getfuns()
    local fs = {}
    for i = 1, 5 do fs[#fs+1] = function() print(i) end end
> funs = getfuns()
> funs[1]()
> funs[2]()

The byte-code for the above code is here. The most important instruction is PC=11, CLOSE for end of the for loop. In the OP_CLOSE implementation of src/lvm.c, the luaF_close() is also called. So the variable i is copied out to the private upvalue for one instance of the function “function() print(i) end”.

function <stdin:1,5> (14 instructions, 56 bytes at 0x80061978)
0 params, 7 slots, 0 upvalues, 5 locals, 2 constants, 1 function
        1       [2]     NEWTABLE        0 0 0
        2       [3]     LOADK           1 -1    ; 1
        3       [3]     LOADK           2 -2    ; 5
        4       [3]     LOADK           3 -1    ; 1
        5       [3]     FORPREP         1 6     ; to 12
        6       [3]     LEN             5 0
        7       [3]     ADD             5 5 -1  ; - 1
        8       [3]     CLOSURE         6 0     ; 0x80061ad8
        9       [3]     MOVE            0 4
        10      [3]     SETTABLE        0 5 6
        11      [3]     CLOSE           4
        12      [3]     FORLOOP         1 -7    ; to 6
        13      [4]     RETURN          0 2
        14      [5]     RETURN          0 1
constants (2) for 0x80061978:
        1       1
        2       5
locals (5) for 0x80061978:
        0       fs      2       14
        1       (for index)     5       13
        2       (for limit)     5       13
        3       (for step)      5       13
        4       i       6       11
upvalues (0) for 0x80061978:

function <stdin:3,3> (4 instructions, 16 bytes at 0x80061ad8)
0 params, 2 slots, 1 upvalue, 0 locals, 1 constant, 0 functions
        1       [3]     GETGLOBAL       0 -1    ; print
        2       [3]     GETUPVAL        1 0     ; i
        3       [3]     CALL            0 2 1
        4       [3]     RETURN          0 1
constants (1) for 0x80061ad8:
        1       "print"
locals (0) for 0x80061ad8:
upvalues (1) for 0x80061ad8:
        0       i

Use Git in a sneakernet environment with git-bundle

I’m working in a project that requires a synchronized source code control across two separated local networks. The only communication channel between the two networks is e-mail. I plan to use git to do the source code management.

There are some discussions on stackoverflow about how to solve the similar problems, like Using GIT on USB stick for “travelling code” and How to use Git in a sneakernet environment. One user mentioned git bundle. But there is no full examples. Here I posted my steps just for reference.

Suppose machine A and machine B are in the two networks. Machine A has the project directory AppProj managed by git.

# Create the initial bundle from A. Suppose A's current version is 123456
@A ~$ cd AppProj
@A ~/AppProj$ git bundle create ../A2B.bundle master

# Now transfer/e-mail A2B.bundle file to machine B

# Create the project on machine B from the bundle
@B ~$ git clone -b master A2B.bundle AppProj
@B ~$ cd AppProj
# Now B has the local project directory AppProj.
# AppProj's remote is pointed to A2B.bundle

# Some modifications on B. Local git add/commit. And the current version is abcdef
# Create the change bundle on B
@B ~/AppProj$ git bundle create ../B2A.bundle 123456..master

# Now transfer/e-mail B2A.bundle file to machine A

# A has its own remote url. So add the bundle as another remote
@A ~/AppProj$ git remote add B2A ../B2A.bundle
# Pull content from the remote B2A.
@A ~/AppProj$ git pull B2A master
# Or if you don't want to the merge.
@A ~/AppProj$ git fetch B2A
# Now A has the modifications from B. And A can push these changes to its original remote.

# Some modification on A. Local git add/commit. And the current version abc123.
# Package the modifications into A2B.bundle again
@A ~/AppProj$ git bundle create ../A2B.bundle abcdef..master

# Now transfer/email- A2B.bundle file to B, and overwrite the original A2B.bundle

# B pulls the new changes
@B ~/AppProj$ git pull

We can add some variances, such as adding another remote on B so that B can push to its local network’s server git. Another point here is I always use version number, and the git bundle documents uses tag. Both ways are fine.

BTW, found the post How to use git-bundle for keeping development in sync. The example is very similar to my steps.

A faster OpenMP Radix Sort implementation

I was the TA of CS484 Parallel Programming at UIUC. One of the course machine problem is implementing a OpenMP version radix sort. There are many implementations in the web. And the top one result from google search “openmp radix sort” is the version by Alexis Daboville on Github.

The code is pretty code. It uses MSD based radix sort. It parallelized the counting part, and uses the parallel recursive call. And it switches to the quick sort when the bin is small enough. It also uses a lot of coding tricks to make it run faster.

But the absolute performance of Alexis’s implementation is not good. The reasons include limited parallel portions, the MSD algorithm, and recursive call. In fact, there are three major phases in a radix sort, counting, prefix, and data moving. We should try to parallel all the three portions, not just one. Furthermore, a LSD based algorithm can remove the overhead from the recursive call. By the way, I’m a compiler guy, and I don’t like the programming tricks here and there in the code to make it run faster. A compiler may do better than these tricks most of the time. And some of these tricks may interfere the compiler analysis.

Here is my implementation.

#define BASE_BITS 8
#define BASE (1 << BASE_BITS)
#define MASK (BASE-1)
#define DIGITS(v, shift) (((v) >> shift) & MASK)

void omp_lsd_radix_sort(size_t n, unsigned data[n]) {
    unsigned * buffer = malloc(n*sizeof(unsigned));
    int total_digits = sizeof(unsigned)*8;

    //Each thread use local_bucket to move data
    size_t i;
    for(int shift = 0; shift < total_digits; shift+=BASE_BITS) {
        size_t bucket[BASE] = {0};

        size_t local_bucket[BASE] = {0}; // size needed in each bucket/thread
        //1st pass, scan whole and check the count
        #pragma omp parallel firstprivate(local_bucket)
            #pragma omp for schedule(static) nowait
            for(i = 0; i < n; i++){
                local_bucket[DIGITS(data[i], shift)]++;
            #pragma omp critical
            for(i = 0; i < BASE; i++) {
                bucket[i] += local_bucket[i];
            #pragma omp barrier
            #pragma omp single
            for (i = 1; i < BASE; i++) {
                bucket[i] += bucket[i - 1];
            int nthreads = omp_get_num_threads();
            int tid = omp_get_thread_num();
            for(int cur_t = nthreads - 1; cur_t >= 0; cur_t--) {
                if(cur_t == tid) {
                    for(i = 0; i < BASE; i++) {
                        bucket[i] -= local_bucket[i];
                        local_bucket[i] = bucket[i];
                } else { //just do barrier
                    #pragma omp barrier

            #pragma omp for schedule(static)
            for(i = 0; i < n; i++) { //note here the end condition
                buffer[local_bucket[DIGITS(data[i], shift)]++] = data[i];
        //now move data
        unsigned* tmp = data;
        data = buffer;
        buffer = tmp;

The above code can only support unsigned data type. But it parallelized the counting phase and the data moving phase. Regarding the prefix phase, the total sequence is small, and the sequential version is still faster. There is a minor issue. It relies on the OpenMP’s static scheduling partition the task in the same way in the counting phase and the data moving phase. I believe the OpenMP compiler will do it in this way, but I’m not sure. Another limitation here is the code uses OpenMP runtime function. But I don’t know other ways to pre-allocate data movement buffer for different threads without calling OpenMP runtime functions.

My quick performance testing result. On a server with one Intel E31245, I turned off hyper-threading(only use four threads), turned off turbo boost (fix to 3.3GHz). In sorting 16M unsigned integers, the above code can achieve ~218 M/s.

Can we do better? Yes. Intel published a paper about the implementation in SIGMOD2010. There is the latest report. The performance is about 250M/s on a slower (3.2GHz i7) processor. One of the important implementation is optimizing for memory bandwidth. There is no code available.

But the CPU version is still much slower than a CUDA implementation. There are many gather/scatter like operations in the parallel radix sort, which limit the usage of CPU SIMD. Maybe we can get a better CPU peak performance if we have an efficient gather / scatter SIMD operations available.