OCaml User Survey Analysis¶

Some simple analysis of the OCaml User Survey 2020. I am no data-scientist nor a python developer (where are my types :'( ) so take all of this with a pinch of salt and double-check my code!

For the most part this analysis is really about uncovering the differences in terms of proficiency and years spent using OCaml. Do experts care as much about documentation or does everybody want multi-core! Let's find out :))

Some questions where single answer, some where multianswer. Single answer questions are given as a proportion of how many users (of a specific proficiency etc.) said that answer. Multianswer gives the raw totals for each time an answer appeared.

Filtering and Plotting Functions
A simple look at the data
What could be made state of the art
What new features users want
What other languages and application domains are being used
What tools are users using (opam, dune, editors...) and what flavour of OCaml
How different user types interact with the OCaml community

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np

df = pd.read_csv("ocs.csv")

Filtering by Experience and Proficiency ¶

The first interesting thing I thought to have a look at was what does the community want when we filter by experience and self-perceived "expertise". Is it the case that beginners also crave multicore as much as everyone else. Do people with less than a year's experience want better documents or the state of the art website. Let's see!

years_of_ocaml = "For how long have you been using OCaml?"
# Ignoring blanks and 'you are not using OCaml' for these ones
year_vals = ["less than 1 year", "2-5 years", "5-10 years", "more than 10 years"]
installation = "Which installation methods do you use?"
welcome = "I feel welcome in the OCaml community"
lib_doc = "OCaml libraries are well documented."
impl = "Which of these language implementations are you actively using?"
proficiency = "How do you rate your OCaml proficiency?" 
pain_point = "What is a pain point when learning the OCaml language?"
new_feature = "If I was granted one new language feature today, I would ask for:"
state_of_the_art = "If one piece of the ecosystem could magically be made state-of-the-art, I would ask for:"
langs = "Which of these other programming languages are you fluent in?"
domains = "Which types of software do you develop with OCaml?"
interaction = "Where do you interact with the OCaml community?"
pain = "What do you think are the main pain points that prevent OCaml adoption for new projects?"
editor = "Which editors do you use?"
build = "Which build tools are you actively using?"

agree_index = ["Strongly disagree", "Disagree", "I don't know", "Neutral", "Agree", "Strongly agree"]

def filter(df, col, value): 
    f = df[col] == value
    return df[f]

def totals_and_proportions(df, col): 
    print("Totals")
    print(df[col].value_counts())

    print("\nProportions")
    print(df[col].value_counts(normalize=True))

def plot(df, col, values, plot_col, title_prefix, ylabel="", agree=False, normalize=True, drop=0):
    cols = 2 
    rows = len(values) / cols 
    fig, axs = plt.subplots(rows, cols, figsize=(16, 16))
    for r in range(rows):
        for c in range(cols):
            ax = axs[r, c]
            value = values[r * cols + c]
            ax.set_xticklabels([], ha='right')
            ax.set_ylabel(ylabel)
            if agree: 
                filter(df, col, value)[plot_col].value_counts(sort=False, normalize=normalize)[df[plot_col].value_counts() >= drop].reindex(agree_index).plot.bar(rot=45, title=title_prefix + value, ax=ax, fontsize=12, colormap="autumn")
                labels = [l.get_text() for l in ax.get_xticklabels()]
                ax.set_xticklabels(labels, ha='right')
            else:
                filter(df, col, value)[plot_col].value_counts(normalize=normalize)[df[plot_col].value_counts() >= drop].plot.bar(rot=45, title=title_prefix + value, ax=ax, fontsize=12, colormap="autumn")
                labels = [l.get_text() for l in ax.get_xticklabels()]
                ax.set_xticklabels(labels, ha='right')
    plt.tight_layout ()
    plt.show()

Poking around the Data ¶

This first section is exploratory in that we'll just look at some counts and some correlations to get an idea of the data. First of all what is the spread of proficiency for the survey? It's important to bare these numbers in mind when looking at data plotted as "Proportion of Users". The smaller groups are probably a little less representative.

totals_and_proportions(df, proficiency)

Totals
advanced        262
intermediate    255
expert          110
beginner        110
Name: How do you rate your OCaml proficiency?, dtype: int64

Proportions
advanced        0.355495
intermediate    0.345997
expert          0.149254
beginner        0.149254
Name: How do you rate your OCaml proficiency?, dtype: float64

There's much more representation for the middle-chunk of people, but how does this reflect with how long people have been using OCaml?

totals_and_proportions(df, years_of_ocaml)

Totals
2-5 years                  273
more than 10 years         166
less than 1 year           148
5-10 years                 128
you are not using OCaml     26
Name: For how long have you been using OCaml?, dtype: int64

Proportions
2-5 years                  0.368421
more than 10 years         0.224022
less than 1 year           0.199730
5-10 years                 0.172740
you are not using OCaml    0.035088
Name: For how long have you been using OCaml?, dtype: float64

Ignoring the missing 1-2 years category where I would assume people would round up or down, there is much more representation for... older... OCaml programmers. This likely impacts things like "New features" when looking at it from a holistic point of view (i.e. not subdiving by proficiency).

Communities can sometime take care of users better depending on their proficiency. The next graph shows how different users with different proficiency feel welcomed in the OCaml community. Overall I would say it is fairly good, with perhaps a slight bias towards expert users. There are still enough disgrees to warrant some more community investigation I think.

plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], welcome, "How welcome users feel for ", agree=True, ylabel="Proportion of users", normalize=True)

State of the Art ¶

ylabel = "Proportion of user type"
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], state_of_the_art, "State of the art for ", ylabel)

Different, more useful, patterns emerge from the data when filtered by user type (beginner, intermediate, advanced and expert). Some of the key points include:

Documentation for user libraries is top 3 for all of the different users
More proficient users care more about build tools and compiler, whereas less proficient users care about documentation a little more
Documentation for the core language progressively moves further and further down the list, one possible interpretation of this is that people eventually learn the language but at the start it would be nicer to have a better way to do it. Perhaps the "learning OCaml" journey is too difficult at the moment.

In fact let's see how users feel about how well libraries are documented when split across these categories as well.

plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], lib_doc, "Are user libraries well documented for ", ylabel="Proportion of Users", agree=True)

Hmm, yeah. Less proficient OCamlers find documentation worse for libraries. Not many really think they are dreadful nor great. One problem, anecdotally, has been full application-oriented documentation, rather than the API documentation.

New Features ¶

plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], new_feature, "Desired new feature for ", ylabel="Proportion of Users", normalize=True)

Well, no real surprises here:

Everybody wants multi-core!
Interestingly the top features tend to be the same across the board with just there order shifting slightly
Namespaces seem more desirable amongst the expert community... which makes some sense anecdotally as I'm not entirely sure what they are and would class myself somewhere near beginner and intermediate!

plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], years_of_ocaml, "How long does it take to get to the OCaml level of a ", ylabel="Proportion of Users", normalize=True)

These plots can be read as, x% of intermediates users having been using OCaml for z years. For example rouughly 60% of beginners have been using OCaml for less than a year.

This interesting plot looks at how long it takes for OCaml developers to reach different stages in proficiency. Of course there is nothing that definitive here because OCaml isn't necessarily the first language of people so they have been using it for 10+ years but are still intermediate (plus the proficiencies are subjective).

Fluency in other languages & Application domains ¶

Looking at fluency in other languages might hint at where people have come from and/or what other languages they use in conjuction with OCaml.

def split_plot(col, new_col_name):
    lists = df[col].str.split(';', expand=True)
    split_df = lists.stack().to_frame().reset_index()
    split_df.columns = ["index", "", new_col_name]
    split_df = split_df.join(df, on="index")
    return split_df

# My lack of pandas and python skills are showing here 
lang_df = split_plot(langs, "languages")
plot(lang_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "languages", "Other fluent languages for ", ylabel="Occurences of language within a specific user proficiency", normalize=False)

Some intial thoughts:

Coq and friends progressively move up the chart as a proportion of how many people said it. More formal methods and proof theorem things needed for the less proficient people!
JS, Python and C battle it out for the top place each time, JS taking it for less proficient (likely younger or with less experience in OCaml) and C for more proficient.

dom_df = split_plot(domains, "domains")
print ("Note occurences less than 3 have been dropped for easier viewing")
plot(dom_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "domains", "Application domains for ", ylabel="Occurences of application domain for specific user", normalize=False, drop=3)

Note occurences less than 3 have been dropped for easier viewing

A not so surprising pattern emerges here about what the most common applications are:

Amongst less proficient users we see a trend of using OCaml in web backend & frontend and data processing
More proficient users are using it for programming language implementations, systems and tooling
Formal methods slowly creeps its way upwards as proficiency increases.

This is quite interesting, is this reflect in the years spent using OCaml?

plot(dom_df, years_of_ocaml, year_vals, "domains", "Application domains for OCaml users of ", ylabel="Occurences of application domain", normalize=False, drop=3)

Hmm kind of. Except now formal methods has bubbled much higher to the top. Although hard to say definitively, if at all, there does seem to be a sligh trend that more beginners/people using OCaml for less time have more application domains happening together.

Also, web backend and web frontend almost always share the same number of users. This could indicate that people are going the full way and doing full-stack OCaml (cf. discuss thread?)

OCaml Tools ¶

Another interesting thing to look at is how people are installing OCaml, especially the divided between opam and esy.

install_df = split_plot(installation, "installation")
plot(install_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "installation", "Installation methods for ", ylabel="Occurences of installation method", normalize=False, drop=3)

As probably suspected, opam with the public repository is by far the most used method but some trends seem to be here:

Esy (as a proportion of the users) is more popular amongst less proficient.

How does the proficiency impact what implementation they are using, is there a correlation with the application domain and packaging solution?

impl_df = split_plot(impl, "implementation")
plot(impl_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "implementation", "Implemention methods for ", ylabel="Occurences of implementation", normalize=False)

This is probably quite inline with the application domains of the different users, there is more Reason implementations for beginners who are also building more web-based applications. Interestingly the Reason implementations slip as proficiency rises with js_of_ocaml much more prominent for experts.

Continuing in this vain, let's look at build tool:

build_df = split_plot(build, "build")
plot(build_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "build", "Build tools for ", ylabel="Occurences of build tool", normalize=False)

And editors?

editor_df = split_plot(editor, "editor")
plot(editor_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "editor", "Editors for ", ylabel="Occurences of editor", normalize=False, drop=2)

If I had more time you could clean the other choice to try and join things together like "onivim 2" but I don't have the time. Even without this we see greater popularity for VSCode in less proficient users with Emacs taking over.

Community Interaction ¶

A look at how different segments of the community interact with the OCaml community.

inter_df = split_plot(interaction, "interaction")
plot(inter_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "interaction", "Interaction methods for ", ylabel="Occurences of interaction medium", normalize=False)

pain_df = split_plot(pain, "pain")
plot(pain_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "pain", "Pain points for ", ylabel="Occurences of pain points", normalize=False)