Some simple analysis of the OCaml User Survey 2020. I am no data-scientist nor a python developer (where are my types :'( ) so take all of this with a pinch of salt and double-check my code!
For the most part this analysis is really about uncovering the differences in terms of proficiency and years spent using OCaml. Do experts care as much about documentation or does everybody want multi-core! Let's find out :))
Some questions where single answer, some where multianswer. Single answer questions are given as a proportion of how many users (of a specific proficiency etc.) said that answer. Multianswer gives the raw totals for each time an answer appeared.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("ocs.csv")
The first interesting thing I thought to have a look at was what does the community want when we filter by experience and self-perceived "expertise". Is it the case that beginners also crave multicore as much as everyone else. Do people with less than a year's experience want better documents or the state of the art website. Let's see!
years_of_ocaml = "For how long have you been using OCaml?"
#Â Ignoring blanks and 'you are not using OCaml' for these ones
year_vals = ["less than 1 year", "2-5 years", "5-10 years", "more than 10 years"]
installation = "Which installation methods do you use?"
welcome = "I feel welcome in the OCaml community"
lib_doc = "OCaml libraries are well documented."
impl = "Which of these language implementations are you actively using?"
proficiency = "How do you rate your OCaml proficiency?"
pain_point = "What is a pain point when learning the OCaml language?"
new_feature = "If I was granted one new language feature today, I would ask for:"
state_of_the_art = "If one piece of the ecosystem could magically be made state-of-the-art, I would ask for:"
langs = "Which of these other programming languages are you fluent in?"
domains = "Which types of software do you develop with OCaml?"
interaction = "Where do you interact with the OCaml community?"
pain = "What do you think are the main pain points that prevent OCaml adoption for new projects?"
editor = "Which editors do you use?"
build = "Which build tools are you actively using?"
agree_index = ["Strongly disagree", "Disagree", "I don't know", "Neutral", "Agree", "Strongly agree"]
def filter(df, col, value):
f = df[col] == value
return df[f]
def totals_and_proportions(df, col):
print("Totals")
print(df[col].value_counts())
print("\nProportions")
print(df[col].value_counts(normalize=True))
def plot(df, col, values, plot_col, title_prefix, ylabel="", agree=False, normalize=True, drop=0):
cols = 2
rows = len(values) / cols
fig, axs = plt.subplots(rows, cols, figsize=(16, 16))
for r in range(rows):
for c in range(cols):
ax = axs[r, c]
value = values[r * cols + c]
ax.set_xticklabels([], ha='right')
ax.set_ylabel(ylabel)
if agree:
filter(df, col, value)[plot_col].value_counts(sort=False, normalize=normalize)[df[plot_col].value_counts() >= drop].reindex(agree_index).plot.bar(rot=45, title=title_prefix + value, ax=ax, fontsize=12, colormap="autumn")
labels = [l.get_text() for l in ax.get_xticklabels()]
ax.set_xticklabels(labels, ha='right')
else:
filter(df, col, value)[plot_col].value_counts(normalize=normalize)[df[plot_col].value_counts() >= drop].plot.bar(rot=45, title=title_prefix + value, ax=ax, fontsize=12, colormap="autumn")
labels = [l.get_text() for l in ax.get_xticklabels()]
ax.set_xticklabels(labels, ha='right')
plt.tight_layout ()
plt.show()
This first section is exploratory in that we'll just look at some counts and some correlations to get an idea of the data. First of all what is the spread of proficiency for the survey? It's important to bare these numbers in mind when looking at data plotted as "Proportion of Users". The smaller groups are probably a little less representative.
totals_and_proportions(df, proficiency)
There's much more representation for the middle-chunk of people, but how does this reflect with how long people have been using OCaml?
totals_and_proportions(df, years_of_ocaml)
Ignoring the missing 1-2 years category where I would assume people would round up or down, there is much more representation for... older... OCaml programmers. This likely impacts things like "New features" when looking at it from a holistic point of view (i.e. not subdiving by proficiency).
Communities can sometime take care of users better depending on their proficiency. The next graph shows how different users with different proficiency feel welcomed in the OCaml community. Overall I would say it is fairly good, with perhaps a slight bias towards expert users. There are still enough disgrees to warrant some more community investigation I think.
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], welcome, "How welcome users feel for ", agree=True, ylabel="Proportion of users", normalize=True)
ylabel = "Proportion of user type"
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], state_of_the_art, "State of the art for ", ylabel)
Different, more useful, patterns emerge from the data when filtered by user type (beginner, intermediate, advanced and expert). Some of the key points include:
In fact let's see how users feel about how well libraries are documented when split across these categories as well.
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], lib_doc, "Are user libraries well documented for ", ylabel="Proportion of Users", agree=True)
Hmm, yeah. Less proficient OCamlers find documentation worse for libraries. Not many really think they are dreadful nor great. One problem, anecdotally, has been full application-oriented documentation, rather than the API documentation.
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], new_feature, "Desired new feature for ", ylabel="Proportion of Users", normalize=True)
Well, no real surprises here:
plot(df, proficiency, ["beginner", "intermediate", "advanced", "expert"], years_of_ocaml, "How long does it take to get to the OCaml level of a ", ylabel="Proportion of Users", normalize=True)
These plots can be read as, x%
of intermediates users having been using OCaml for z
years. For example rouughly 60%
of beginners have been using OCaml for less than a year.
This interesting plot looks at how long it takes for OCaml developers to reach different stages in proficiency. Of course there is nothing that definitive here because OCaml isn't necessarily the first language of people so they have been using it for 10+ years but are still intermediate (plus the proficiencies are subjective).
Looking at fluency in other languages might hint at where people have come from and/or what other languages they use in conjuction with OCaml.
def split_plot(col, new_col_name):
lists = df[col].str.split(';', expand=True)
split_df = lists.stack().to_frame().reset_index()
split_df.columns = ["index", "", new_col_name]
split_df = split_df.join(df, on="index")
return split_df
#Â My lack of pandas and python skills are showing here
lang_df = split_plot(langs, "languages")
plot(lang_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "languages", "Other fluent languages for ", ylabel="Occurences of language within a specific user proficiency", normalize=False)
Some intial thoughts:
dom_df = split_plot(domains, "domains")
print ("Note occurences less than 3 have been dropped for easier viewing")
plot(dom_df, proficiency, ["beginner", "intermediate", "advanced", "expert"], "domains", "Application domains for ", ylabel="Occurences of application domain for specific user", normalize=False, drop=3)
A not so surprising pattern emerges here about what the most common applications are:
This is quite interesting, is this reflect in the years spent using OCaml?
plot(dom_df, years_of_ocaml, year_vals, "domains", "Application domains for OCaml users of ", ylabel="Occurences of application domain", normalize=False, drop=3)