Some simple analysis of the OCaml User Survey 2020. I am no data-scientist nor a python developer (where are my types :'( ) so take all of this with a pinch of salt and double-check my code!
For the most part this analysis is really about uncovering the differences in terms of proficiency and years spent using OCaml. Do experts care as much about documentation or does everybody want multi-core! Let's find out :))
Some questions where single answer, some where multianswer. Single answer questions are given as a proportion of how many users (of a specific proficiency etc.) said that answer. Multianswer gives the raw totals for each time an answer appeared.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("ocs.csv")
The first interesting thing I thought to have a look at was what does the community want when we filter by experience and self-perceived "expertise". Is it the case that beginners also crave multicore as much as everyone else. Do people with less than a year's experience want better documents or the state of the art website. Let's see!
years_of_ocaml = "For how long have you been using OCaml?"
#Â Ignoring blanks and 'you are not using OCaml' for these ones
year_vals = ["less than 1 year", "2-5 years", "5-10 years", "more than 10 years"]
installation = "Which installation methods do you use?"
welcome = "I feel welcome in the OCaml community"
lib_doc = "OCaml libraries are well documented."
impl = "Which of these language implementations are you actively using?"
proficiency = "How do you rate your OCaml proficiency?"
pain_point = "What is a pain point when learning the OCaml language?"
new_feature = "If I was granted one new language feature today, I would ask for:"
state_of_the_art = "If one piece of the ecosystem could magically be made state-of-the-art, I would ask for:"
langs = "Which of these other programming languages are you fluent in?"
domains = "Which types of software do you develop with OCaml?"
interaction = "Where do you interact with the OCaml community?"
pain = "What do you think are the main pain points that prevent OCaml adoption for new projects?"
editor = "Which editors do you use?"
build = "Which build tools are you actively using?"
agree_index = ["Strongly disagree", "Disagree", "I don't know", "Neutral", "Agree", "Strongly agree"]
def filter(df, col, value):
f = df[col] == value
return df[f]
def totals_and_proportions(df, col):
print("Totals")
print(df[col].value_counts())
print("\nProportions")
print(df[col].value_counts(normalize=True))
def plot(df, col, values, plot_col, title_prefix, ylabel="", agree=False, normalize=True, drop=0):
cols = 2
rows = len(values) / cols
fig, axs = plt.subplots(rows, cols, figsize=(16, 16))
for r in range(rows):
for c in range(cols):
ax = axs[r, c]
value = values[r * cols + c]
ax.set_xticklabels([], ha='right')
ax.set_ylabel(ylabel)
if agree:
filter(df, col, value)[plot_col].value_counts(sort=False, normalize=normalize)[df[plot_col].value_counts() >= drop].reindex(agree_index).plot.bar(rot=45, title=title_prefix + value, ax=ax, fontsize=12, colormap="autumn")
labels = [l.get_text() for l in ax.get_xticklabels()]
ax.set_xticklabels(labels, ha='right')
else:
filter(df, col, value)[plot_col].value_counts(normalize=normalize)[df[plot_col].value_counts() >= drop].plot.bar(rot=45, title=title_prefix + value, ax=ax, fontsize=12, colormap="autumn")
labels = [l.get_text() for l in ax.get_xticklabels()]
ax.set_xticklabels(labels, ha='right')
plt.tight_layout ()
plt.show()
This first section is exploratory in that we'll just look at some counts and some correlations to get an idea of the data. First of all what is the spread of proficiency for the survey? It's important to bare these numbers in mind when looking at data plotted as "Proportion of Users". The smaller groups are probably a little less representative.
totals_and_proportions(df, proficiency)
There's much more representation for the middle-chunk of people, but how does this reflect with how long people have been using OCaml?
totals_and_proportions(df, years_of_ocaml)
Ignoring the missing 1-2 years category where I would assume people would round up or down, there is much more representation for... older... OCaml programmers. This likely impacts things like "New features" when looking at it from a holistic point of view (i.e. not subdiving by proficiency).
Communities can sometime take care of users better depending on their proficiency. The next graph shows how different users with different proficiency feel welcomed in the OCaml community. Overall I would say it is fairly good, with perhaps a slight bias towards expert users. There are still enough disgrees to warrant some more community investigation I think.
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], welcome, "How welcome users feel for ", agree=True, ylabel="Proportion of users", normalize=True)
ylabel = "Proportion of user type"
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], state_of_the_art, "State of the art for ", ylabel)
Different, more useful, patterns emerge from the data when filtered by user type (beginner, intermediate, advanced and expert). Some of the key points include:
In fact let's see how users feel about how well libraries are documented when split across these categories as well.
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], lib_doc, "Are user libraries well documented for ", ylabel="Proportion of Users", agree=True)
Hmm, yeah. Less proficient OCamlers find documentation worse for libraries. Not many really think they are dreadful nor great. One problem, anecdotally, has been full application-oriented documentation, rather than the API documentation.
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], new_feature, "Desired new feature for ", ylabel="Proportion of Users", normalize=True)
Well, no real surprises here:
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], years_of_ocaml, "How long does it take to get to the OCaml level of a ", ylabel="Proportion of Users", normalize=True)
These plots can be read as, x%
of intermediates users having been using OCaml for z
years. For example rouughly 60%
of beginners have been using OCaml for less than a year.
This interesting plot looks at how long it takes for OCaml developers to reach different stages in proficiency. Of course there is nothing that definitive here because OCaml isn't necessarily the first language of people so they have been using it for 10+ years but are still intermediate (plus the proficiencies are subjective).
Looking at fluency in other languages might hint at where people have come from and/or what other languages they use in conjuction with OCaml.
def split_plot(col, new_col_name):
lists = df[col].str.split(';', expand=True)
split_df = lists.stack().to_frame().reset_index()
split_df.columns = ["index", "", new_col_name]
split_df = split_df.join(df, on="index")
return split_df
#Â My lack of pandas and python skills are showing here
lang_df = split_plot(langs, "languages")
plot(lang_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "languages", "Other fluent languages for ", ylabel="Occurences of language within a specific user proficiency", normalize=False)
Some intial thoughts:
dom_df = split_plot(domains, "domains")
print ("Note occurences less than 3 have been dropped for easier viewing")
plot(dom_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "domains", "Application domains for ", ylabel="Occurences of application domain for specific user", normalize=False, drop=3)
A not so surprising pattern emerges here about what the most common applications are:
This is quite interesting, is this reflect in the years spent using OCaml?
plot(dom_df, years_of_ocaml, year_vals, "domains", "Application domains for OCaml users of ", ylabel="Occurences of application domain", normalize=False, drop=3)
Hmm kind of. Except now formal methods has bubbled much higher to the top. Although hard to say definitively, if at all, there does seem to be a sligh trend that more beginners/people using OCaml for less time have more application domains happening together.
Also, web backend and web frontend almost always share the same number of users. This could indicate that people are going the full way and doing full-stack OCaml (cf. discuss thread?)
Another interesting thing to look at is how people are installing OCaml, especially the divided between opam and esy.
install_df = split_plot(installation, "installation")
plot(install_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "installation", "Installation methods for ", ylabel="Occurences of installation method", normalize=False, drop=3)
As probably suspected, opam with the public repository is by far the most used method but some trends seem to be here:
How does the proficiency impact what implementation they are using, is there a correlation with the application domain and packaging solution?
impl_df = split_plot(impl, "implementation")
plot(impl_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "implementation", "Implemention methods for ", ylabel="Occurences of implementation", normalize=False)
This is probably quite inline with the application domains of the different users, there is more Reason implementations for beginners who are also building more web-based applications. Interestingly the Reason implementations slip as proficiency rises with js_of_ocaml
much more prominent for experts.
Continuing in this vain, let's look at build tool:
build_df = split_plot(build, "build")
plot(build_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "build", "Build tools for ", ylabel="Occurences of build tool", normalize=False)
And editors?
editor_df = split_plot(editor, "editor")
plot(editor_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "editor", "Editors for ", ylabel="Occurences of editor", normalize=False, drop=2)
If I had more time you could clean the other choice to try and join things together like "onivim 2" but I don't have the time. Even without this we see greater popularity for VSCode
in less proficient users with Emacs
taking over.
A look at how different segments of the community interact with the OCaml community.
inter_df = split_plot(interaction, "interaction")
plot(inter_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "interaction", "Interaction methods for ", ylabel="Occurences of interaction medium", normalize=False)
pain_df = split_plot(pain, "pain")
plot(pain_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "pain", "Pain points for ", ylabel="Occurences of pain points", normalize=False)