Daoud Clarke2023-12-25T19:49:05+00:00http://daoudclarke.github.comDaoud ClarkeWhy the world needs a non-profit search engine2022-07-10T00:00:00+00:00http://daoudclarke.github.com/search%20engines/2022/07/10/non-profit-search-engine
<p>Sometimes I forget why I’ve taken on this crazy, huge task. Why am I <a href="https://github.com/mwmbl/">building</a> a <a href="https://mwmbl.org">search engine</a>? Will
it really be better than Google one day? Will people support it? Will people even use it?</p>
<p>And then I read something like <a href="https://pxlnv.com/blog/bullshit-web/">The Bullshit Web</a> and I remember, that, yes, there
is a point. Even if I make the web better for one person, it’s worth it. Because the way things are is just wrong.</p>
<p>Search engines are in a unique position to fix the situation. Not only do we create a view on the world’s knowledge, we
influence it too. If we promote bullshit-free sites, then people will create more bullshit-free sites.</p>
<p>More importantly, search engines are a filter on the world’s
knowledge. Do you really want your filter to be “whatever makes
$SEARCH_ENGINE more money”, particularly when that means, “show ads
instead of search results, and prioritise search results that also
make us more money”? We can and should do better.</p>
<h3 id="what-will-it-mean-in-practice">What will it mean in practice?</h3>
<p>What would the ideal non-profit search engine look like in practice,
and what would make it better than what we have now?</p>
<ol>
<li><strong>Ad-free.</strong> While this can be achieved now with ad-blockers,
this is clearly not a sustainable solution if everyone were to
apply it.</li>
<li><strong>Open source.</strong> The technology for organising the world’s
knowledge should be owned by everyone.</li>
<li><strong>Profit agnostic ranking.</strong> Google has an incentive to rank
pages that contain Google ads because it makes them more
revenue. More generally, Google has an incentive to rank
profit-making sites higher so that they make more money. This both
gives them more money to spend on advertising, and makes them
dependent on their Google ranking, making them more likely to
spend on advertising should their ranking get worse.</li>
<li><strong>Community powered ranking.</strong> Google tries to work out which
sites are interesting by how long you spend on the site. This has
an unfortunate side effect for e.g. recipe sites, where there is
an absurd incentive to hide the actual recipe after a ton of
background and repetitive descriptions of the recipe, to make it
more likely for you to get distracted on the way to get to the
actual recipe. Instead of looking at how long people spend on a
site, we would encourage users to give explicit feedback on
rankings and use this to improve our ranking system.</li>
<li><strong>Fast.</strong> Google search is surprisingly slow, taking up to half a
second for a page load to complete in my measurements. It doesn’t
need to be this way. In 2010, Google announced <a href="https://searchengineland.com/google-instant-complete-users-guide-50136">Instant
Search</a>
that would search as you typed. This was meant to <a href="https://www.theatlantic.com/technology/archive/2010/09/the-pros-and-cons-of-google-instant/62666/">save users two
to five seconds per
search</a>. Yet
Google quietly <a href="https://searchengineland.com/google-dropped-google-instant-search-279674">dropped this
feature</a>
in 2017, ostensibly to bring search more in line with mobile. I do
wonder though, if the change was more motivated by some
requirement around adverts. It must be hard to manage auctioning
adverts in real-time as users type, particularly if you want the
adverts to blend into the search results.</li>
<li><strong>Frictionless.</strong> Google has an incentive to show you a results
page, so that you see some adverts and are thus more likely to
click on them. But often you don’t need to see a results page, for
example if there is a single page you need. For example if you are
typing “facebook” or “hmrc login” you <em>could</em> go straight there
from the address bar. But Google wants you to see a results page
first.</li>
</ol>
<p>Our current implementation of <a href="https://mwmbl.org">Mwmbl</a> is a long way
from doing all these things well, but this is what we’re aiming
towards.</p>
<h3 id="the-funding-question">The funding question</h3>
<p>Search funded by advertising is a recipe for disaster because there will always be a conflict of interest. Get the user
to the site as quickly as possible, or show them some ads on the way? Guess which one you will choose if you care about
revenue.</p>
<p>There are two ways forward that I can see:</p>
<ul>
<li>The paid subscription model, like <a href="https://kagi.com">kagi.com</a></li>
<li>Donation funded, non-profit model, like Mwmbl - <a href="https://github.com/sponsors/mwmbl">donate here!</a></li>
</ul>
<p>There is no guarantee that either approach will work - but it’s got to be worth a try.</p>
<h3 id="the-donation-model">The donation model</h3>
<p>While I wish Kagi the best, we have chosen the donation model because
we want to make the best search engine possible available to
everyone. Not everyone can afford to pay for search.</p>
<p>When I read the numbers it makes me feel a little sick. Google’s revenue for search was around <a href="https://searchengineland.com/alphabet-q1-microsoft-q3-earnings-search-advertising-383869">$40 billion in Q1 2022</a>.
A number so large, I can’t even conceive how big it is. Just 1% of 1% of this would be more money than I’d know what to
do with ($4m).</p>
<p>But it also makes me hopeful. If I can create something that just a
tiny fraction of people find useful, then I can create a huge amount
of value. If there is value, then people will pay for it, if we find
the right way to ask. Our current plan is to offer different
sponsorship tiers with intangible rewards, for example virtual badges
displayed on the user’s profile page.</p>
<p>Of course, there are plenty of non-profits in adjacent spaces that have been successful. I think Wikimedia is the best
example to look up to. Also, I believe our values are very closely aligned. If they were open to collaborating or taking
on this project then I would seriously consider it, because I think it would give it a much greater chance of success.</p>
<h3 id="what-next">What next?</h3>
<p>Our current goals are:</p>
<ul>
<li>Index 1 billion pages a month. Help us by <a href="https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/">installing our Firefox
extension</a>
to crawl the web. Our ranking evaluations have shown that the
biggest improvements come from indexing more pages. So that is our
first priority.</li>
<li>Raise enough money to form an official non-profit
organisation. This will be the first step in making Mwmbl
sustainable, beyond being a side-project for a few people.</li>
<li>Get to £50 monthly recurring revenue to enable us to upgrade our
server (currently costing under €5 a month) - <a href="https://github.com/sponsors/mwmbl">donate
here</a>! This will allow us to
increase the size of our index, improving our search results.</li>
</ul>
<p>If you’re interested in helping out, we’re <a href="https://github.com/mwmbl/mwmbl/wiki/Open-positions">recruiting
volunteers</a>, or if
you’re a developer, check out the <a href="https://github.com/mwmbl/mwmbl/issues">open issues</a>.</p>
<h3 id="thanks">Thanks</h3>
<p>Thank you so much to all those that have helped out so far, whether by <a href="https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/">donating
your CPU and bandwidth</a> to crawl the web, <a href="https://github.com/sponsors/mwmbl">giving money</a> to cover our
costs, giving your time and skills to <a href="https://github.com/mwmbl/mwmbl/issues">fix issues</a> or giving feedback on our
<a href="https://matrix.to/#/#mwmbl:matrix.org">Matrix server</a>.</p>
<p>In particular, <a href="https://matrix.to/#/#mwmbl:matrix.org">Colin Espinas</a>
has been instrumental in designing and building the new front end, and
supporting the development of the extension - thanks Colin!</p>
<!-- ### The motivation question -->
<!-- At the moment, I'm loving working on Mwmbl. I relish the technical challenges, and I enjoy seeing something I've created -->
<!-- come to life. But I know I have limits. I tend to get bored of my side projects after a few months. This is by far the -->
<!-- longest I've worked on something - and it hasn't by any means been continuous. I've realised that I sometimes need to -->
<!-- step away for a while, and that's ok. -->
<!-- At this point, I'm not going to force myself to work on something I don't enjoy. If I don't feel like it, I'll take a -->
<!-- break. No one's paying me to do this (see above!) - perhaps if they were, then I could consider this a job, which -->
<!-- would be amazing. -->
Sonnet for Mum in Hospital2018-08-21T00:00:00+00:00http://daoudclarke.github.com/poetry/2018/08/21/sonnet-for-mum-in-hospital
<p>My love she lies alone, she cannot move
<br />And yet her spirit soars across the sky
<br />She’s held aloft by words enough to prove
<br />Sustainer’s breath can make a woman fly</p>
<p>When I am gone I feel I’m still with you
<br />I leave a piece of me when I depart
<br />This truth - how could it not be true
<br />A part of you could never be apart</p>
<p>What gift I have to give is given now
<br />What song I have to sing already sung
<br />You are the only qibla when I bow
<br />You are the dhikr moistening my tongue</p>
<p>O Lord! Shine down on her your Your loving light!
<br />Let angels keep her comfort in the night</p>
Everything you wanted to know about chatbot platforms, and some things you didn't2018-03-21T00:00:00+00:00http://daoudclarke.github.com/chatbots/2018/03/21/chatbot-platform-reviews
<p>It’s quite overwhelming how many chatbot platforms there are. How
should I know which one to choose? I created
<a href="https://chatbottech.io/">ChatbotTech</a> to solve this problem. It’s a
site that reviews and rates all* the chatbot platforms so you don’t
have to.</p>
<p>*some. In fact it’s quite hard to know which ones to review, there are
so many. So far, I’ve opted for this strategy: if I see it, I put it
on my list to review. If you spot one I’ve missed that you think I
should review, let me know.</p>
Manifesto for an Intelligent Chatbot Platform2018-02-06T00:00:00+00:00http://daoudclarke.github.com/chatbots/2018/02/06/manifesto-for-a-new-chatbot-platform
<p>I’ve always found it frustrating that chatbot developers seem to be
satisfied with frameworks that don’t even attempt to mimic anything
close to real intelligence. There seem to be two basic approaches:</p>
<ul>
<li>Intent-based approaches, in which a query is mapped to a template,
perhaps with some slots to be filled, for example, a query like “I
need a train to Norwich” would prompt the chatbot to question the
user with the goal of filling in slots relating to the departure
location and the desired arrival time.</li>
<li>Tree-based approaches, where there is a tree of possibilities the
user can explore, kind of like the “make your own adventure” books
I used to read as a kid. This is useful in informational settings
like customer support, where giving a useful responses depends on
exploring a tree of possibilities to determine the user’s problem.</li>
</ul>
<p>Sometimes these are augmented with features like the ability to
remember past values for slots, that improve the perception that the
chatbot knows what’s going on. But as soon as the user steps off the
beaten track, the chatbot will be confused and the user experience
will suffer. And what about if you want to combine the tree-based
approach with an intent-based approach? So far there’s no clean way of
doing this.</p>
<p>I believe there is a better way, and that’s why I’ve started working
on my own chatbot framework. These are the design goals of the
framework:</p>
<ol>
<li>The chatbot should automatically choose the next best action out
of all possible actions</li>
<li>The chatbot should learn which responses are most likely, and
optimise its behaviour accordingly</li>
<li>The chatbot behaviour should be specified by independent modules
that can be combined freely</li>
</ol>
<p>As an example, I’ll describe how these goals could work out in
practice for a bot to allow users to make purchases on an e-commerce
website. Imagine a user is buying their weekly supermarket shop from
BigMart. The conversation might go something like this:</p>
<p>“I’d like some apples.”</p>
<p>“How about 6 Russet apples for £1.20 or 12 Golden Delicious for
£1.70?”</p>
<p>“I’ve just remembered I need milk”</p>
<p>“1 litre of whole milk like last time?”</p>
<p>“Yes”</p>
<p>“Do you still want apples?”</p>
<p>“Yes, the Golden Delicious.”</p>
<p>“Ok.”</p>
<p>In this conversation, the bot has remembered that the user wanted
apples, even after the distraction of buying milk. The point is that
this behaviour shouldn’t need to be explicitly planned by the bot
designer: the bot should automatically know that the user has a goal
of buying apples that needs to be fulfilled. Also, note that the bot
has learnt the type of milk that the user likes to buy, which saves
the user time. Again, this behaviour should be built into the platform
rather than needing to be programmed by the bot designer.</p>
<p>If the bot designer does not need to program these behaviours, what
would bot development look like? We envisage three types of bot
“modules” that can be developed:</p>
<ul>
<li>Modules that specify the <em>style of conversation</em> between the bot
and the user. This allows the bot designer to specify the preferred
expressions to be used by the bot when interacting with the
user. For example, you could write a module for bots to talk like
pirates, perhaps restricted to a specific domain.</li>
<li>Modules that describe <em>world knowledge</em>. For example, you might try
and write a bot that helps people choose the correct visa for a
journey to the UK (this is actually something I’ve done before, and
it’s a non-trivial problem). Such a bot would need to know about
the different types of visa available, the conditions for each one,
their cost and so on.</li>
<li>Modules that endow the bot with <em>new abilities</em>. For example, a
module may allow a bot to interact with a specific API. Different
e-commerce bots could then choose the correct API module for their
e-commerce platform, while re-using the same style and world
knowledge modules as other bots.</li>
</ul>
<p>The three goals combined should make it very easy for a bot designer
to create a bot: in the most common case, their job would be to simply
choose the best modules for their application, customizing each one
according to their needs.</p>
<h2 id="is-it-possible">Is it possible?</h2>
<p>I can hear you thinking, “It’s all very well having such lofty goals,
but is it achievable?” I believe it is, and in this section I will
outline my proposed solution.</p>
<p><img src="/img/chatbot-platform.png" alt="Chatbot platform architecture" /></p>
<p>The above diagram is a very rough idea of what the new platform might
consist of. My goal is just to show how I think the proposed goals can
be achieved using existing technology. I’ll try and flesh out in
future posts what each component might look like, but for now, here’s
a high level summary, following the diagram anti-clockwise from the
user:</p>
<ul>
<li>A natural language query or response from the user is
received. This is parsed by a <em>semantic parser</em>. I’ve not found a
good concise description of semantic parsing on the interwebs,
which is strange, but it’s not the same as parsing (although
similar) and it’s not the same as (traditional) semantics. A
semantic parser takes a natural language expression and translates
it to some “logical form” where the logical form is anything that a
computer would naturally understand, such as a SQL query, an
expression in first order logic, a JSON string or an “intent”. The
typical application is to use natural language to perform database
queries. Anyway, this is a well studied sub-field of natural
language processing (despite its lack of a Wikipedia page). An
example of an almost-state-of-the-art system is the
<a href="https://nlp.stanford.edu/software/sempre/">SEMPRE system</a> from
Stanford.</li>
<li>A <em>planning system</em> then chooses the next best action to take given
its knowledge of the current state of the world and the latest user
input. This problem is also a well studied one. A very general way
of describing planning problems is something called a
<a href="https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process">Partially Observable Markov Decision Process</a>,
or POMDP (pronounced “pom dee pee”) for short. In fact, POMDPs have
been used to plan dialogue, as described in
<a href="http://mi.eng.cam.ac.uk/~sjy/papers/ygtw13.pdf">this overview by Steve Young at Cambridge</a>.
My idea is to use
<a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">Monte-Carlo tree search</a>
to solve our planning problem, an aproach described in
<a href="https://papers.nips.cc/paper/4031-monte-carlo-planning-in-large-pomdps">this paper from NIPS 2010</a>.
I’m really excited about the potential for Monte-Carlo tree search
to do something other than playing games really well (in case you
didn’t know it’s a large component of
<a href="https://en.wikipedia.org/wiki/AlphaGo">AlphaGo</a>). The planning
system makes use of the Knowledge Modules provided by the bot
designer to inform the decisions it makes.</li>
<li>Once an action has been decided upon, an <em>action interpreter</em> makes
use the the ability modules provided by the bot designer to perform
actions on external APIs, or passes on a logical form to the next
system to send a response to the user.</li>
<li>A <a href="https://en.wikipedia.org/wiki/Natural_language_generation"><em>natural language generator</em></a>
interprets the logical forms and sends the response back to the
user. The generator can make use of the style modules to determine
the best expression for each logical form.</li>
</ul>
<p>Hopefully this is enough to convince you that the plan is not entirely
crazy. Each component is well studied (at least in a research
setting), so it is not too far-fetched to assume that they can be put
together into something useful. The biggest uncertainty in my mind is
around the planning system, and exactly how this will work
effectively. I plan to flesh that out in a future blog post.</p>
<p>Some readers may be disappointed that I’m not proposing some
new-fangled deep learning technique to solve this humongous
problem. In fact, I’m pretty much proposing the same good old
fashioned AI techniques that were popular in the 70s and 80s. Actually
I think systems built in that time period got a lot of things right,
but the individual components were not developed enough to make the
system as a whole a success, at least when applied to a general
setting. In fact, in some cases, the improvements in the individual
components are because of algorithmic developments like deep learning,
along with the abundance of data and computing power. There is
definitely potential for making use of deep learning to improve the
three major components of the system:</p>
<ul>
<li><a href="https://arxiv.org/abs/1706.04326">Here’s a paper</a> on using deep
learning for semantic parsing</li>
<li>Deep learning was a large part of AlphaGo’s success so it can
definitely be used to improve
planning. <a href="https://arxiv.org/pdf/1507.06527.pdf">Here’s a paper</a> on
using deep learning to solve POMDPs which happen to be Atari games
(what is it with the games?).</li>
<li>And <a href="http://www.cs.umd.edu/~miyyer/pubs/2014_nips_generation.pdf">here’s a paper from NIPS 2014</a>
on natural language generation using deep learning. Also
<a href="https://en.wikipedia.org/wiki/Language_model">language modeling</a>
is often an important component in natural language generation, and
neural networks have been very successful at this task.</li>
</ul>
<p>It’s almost inevitable that deep learning will take over most
components of my proposed system at some point. But they are not
essential, at least initially.</p>
<p>But still, I should probably try and answer the question of why not
build a single big deep net to rule them all? One answer is that we
don’t know how to do this yet. But even if it were possible, I do not
think I would want to try and do this. The answer is engineering. When
I know how each component is supposed to work, I can fix it. When a
deep net doesn’t work, all I can do is add more data and tweak the
algorithm, which may or may not solve the problem (and may introduce
new ones).</p>
<p>The argument I’m trying to make here is that natural language
interfaces should be a solved problem, given that we have such
sophisticated components around now, and all it requires is putting
them together in the right way and engineering the thing correctly. Of
course, that’s still a huge challenge, but one I’m quite excited about
undertaking. I like big challenges.</p>
<h1 id="practical-considerations">Practical considerations</h1>
<p>Now I can hear you thinking “It’s all very well taking on such a grand
challenge, but who’s going to pay for it?” One option would be to try
and build this thing in academia, after all, I’ve done the academic
thing, so it should be possible to follow that route. The problem is,
speaking with my metaphorical mortarboard on, in my experience, and
speaking for myself, us academics tend not to build useful things
things. You see, our motivation is naturally skewed towards publishing
papers, which is what academics do, rather than building something
that people actually want. And if we can squeeze a few percentage
points of improvement out of a problem, we can publish a paper.</p>
<p>So if not academia, then what? I happen to find myself in the lucky
situation of having some spare time at the moment. My current contract
requires me to work only 15 hours a week, so that leaves plenty of
spare time. I could try and build this just for fun, as a side
project. However, at some point, my spare time will run out and I
suspect this is going to take a lot longer than the three months I
have left on my contract. The next obvious option is to try and build
a company from it. This would either be a traditional startup, with
funding and all the craziness that goes along with it, or a
bootstrapped company. Actually a startup would not be a bad vehicle
for something as ambitious as this. However, there are at least two
reasons I don’t want to go down the traditional startup route:</p>
<ul>
<li>I’m not convinced that this can be a billion dollar business
(yet). The thing is, people want chatbots, but they don’t know they
want what I’m building (although they may well <em>need</em> it). At some
point in the future that may change.</li>
<li>I don’t personally enjoy the pressure to grow quickly that comes
with a startup. I would rather build a successful and sustainable
business slowly. That’s particularly true because I think it’s
going to take a long time to build this properly. Also, I don’t
hold with “stealth mode” - I’d rather do this out in the open.</li>
</ul>
<p>So the current plan is for a simple bootstrapped company selling
chatbots as a service. Yes, I know, there are a lot already, but I
think there is space for another. The market is predicted to grow
quickly in the next few years - we’ll see whether this turns out to be
true or not. Of course I won’t be building my full crazy idea above
straight away, I will only build each component properly as it is
needed, and instead focus on building something that people want,
preferable in focused niche.</p>
<p>One niche that I think is likely to be profitable is
chatbots for marketing, specifically,
<a href="https://blog.whatshelp.io/3-strategies-to-use-messenger-bot-and-facebook-ads-for-lead-generation-8a40f033510c">Facebook Messenger bots as a landing point for Facebook ads</a>.
So, if you’re interested in this idea, please get in touch! I think it
has a lot of potential for increasing the return on investment of
Facebook ads.</p>
Benefit People: Thinking Through Culture and Values2016-12-22T00:00:00+00:00http://daoudclarke.github.com/startups/2016/12/22/culture-and-values
<p>After watching <a href="https://www.youtube.com/watch?v=EMIa3XhQpnk">Michael Skok’s excellent talk on Culture, Vision and
Mission</a>, I was tempted
to have a bash at defining the values that I care about. Here’s my
attempt.</p>
<h3 id="benefit-people">Benefit People</h3>
<p>Benefit as many people as much as possible. The best kind of benefit
is that which helps people benefit people.</p>
<h3 id="make-it-easy">Make it Easy</h3>
<p>Make everything easy. Everything is a conversation. Make conversations
easy. Use simple words and small sentences.</p>
<h3 id="do-it-with-love">Do it with Love</h3>
<p>Love is real. If you do something with love, it will be different,
better.</p>
<h3 id="try-it">Try it</h3>
<p>Every idea is possible. Experiment.</p>
<h3 id="strength-in-diversity">Strength in Diversity</h3>
<p>Be yourself. Encourage diversity. Accommodate diverse needs.</p>
<h3 id="family-before-work">Family Before Work</h3>
<p>Spend time with your family. Take time for yourself.</p>
<h3 id="trust">Trust</h3>
<p>Evaluate yourself. Trust others.</p>
My favourite ICML 2015 papers - part two2015-07-08T00:00:00+00:00http://daoudclarke.github.com/machine%20learning%20in%20practice/2015/07/08/icml2015-favourite-papers-day2
<p>Yesterday I <a href="/machine%20learning%20in%20practice/2015/07/07/icml2015-favourite-papers-day1">posted</a>
on my favourite papers from the beginning of ICML (some of those
papers were actually presented today, although the posters were
displayed yesterday). Here’s today’s update, which includes some
papers to be presented tomorrow, because the posters were on display
today…</p>
<h2 id="neural-nets">Neural Nets</h2>
<h3 id="unsupervised-domain-adaptation-by-backpropagation"><a href="http://jmlr.org/proceedings/papers/v37/ganin15.pdf">Unsupervised Domain Adaptation by Backpropagation</a></h3>
<p><em>Yaroslav Ganin, Victor Lempitsky</em></p>
<p>Imagine you have a small amount of labelled training data and a lot of
unlabelled data from a different domain. This technique will allow you
to build a neural network model that fits the unlabelled domain. The
key idea is super cool and really simple to implement. You build a
network that optimises features such that it is difficult to
distinguish which domain the data came from.</p>
<h3 id="weight-uncertainty-in-neural-networks"><a href="http://jmlr.org/proceedings/papers/v37/blundell15.pdf">Weight Uncertainty in Neural Networks</a></h3>
<p><em>Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra</em></p>
<h3 id="probabilistic-backpropagation-for-scalable-learning-of-bayesian-neural-networks"><a href="http://jmlr.org/proceedings/papers/v37/hernandez-lobatoc15.pdf">Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks</a></h3>
<p><em>Jose Miguel Hernandez-Lobato, Ryan Adams</em></p>
<p>These papers have a very similar goal, namely making neural networks
probabilistic. This is cool because it allows you to not only make a
decision, but know <em>how sure you are about the decision</em>. There are a
bunch of other benefits: you don’t need to worry about regularisation,
hyperparameter tuning is easier etc.</p>
<p>Anyway, the two papers achieve this in two different ways. The first
uses Gaussian scale mixtures together with a clever trick to
backpropagate expectations. The second one computes the distribution
after rectifying and then approximates this with a Gaussian
distribution. Either way, this is an exciting development for neural
networks.</p>
<h3 id="training-deep-convolutional-neural-networks-to-play-go"><a href="http://jmlr.org/proceedings/papers/v37/clark15.pdf">Training Deep Convolutional Neural Networks to Play Go</a></h3>
<p><em>Christopher Clark, Amos Storkey</em></p>
<p>Although I’ve never actually played the game, I have an interest in AI
Go players, because it’s such a hard game for computers, which still
can’t reach the level of human players. The current state of the art
uses <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">Monte Carlo tree search</a>
which is a really cool technique. The authors of this paper use neural
networks to play the game but don’t quite achieve the same level of
performance. I asked the author whether the two approaches could be
combined, and they think they can! Watch this space for a new state of
the art Go player.</p>
<h2 id="natural-language-processing">Natural Language Processing</h2>
<h3 id="phrase-based-image-captioning"><a href="http://jmlr.org/proceedings/papers/v37/lebret15.pdf">Phrase-based Image Captioning</a></h3>
<p><em>Remi Lebret, Pedro Pinheiro, Ronan Collobert</em></p>
<p>This is a new state of the art in this very interesting task of
labelling images with phrases. The clever bit is in the syntactic
analysis of the phrases in the training set, which often follow a
similar pattern. The authors use this to their advantage: the model is
trained on the individual sub-phrases that are extracted, which allows
it to behave compositionally. This means that it can describe, for
example, both the fact that a plate is on a table, and that there is
pizza on the plate. Unlike previous approaches, the sentences that are
generated are not often found in the training set, which shows
that it is doing real generation and not retrieval. Exciting stuff!</p>
<h3 id="bimodal-modelling-of-source-code-and-natural-language"><a href="http://jmlr.org/proceedings/papers/v37/allamanis15.pdf">Bimodal Modelling of Source Code and Natural Language</a></h3>
<p><em>Miltos Allamanis, Daniel Tarlow, Andrew Gordon, Yi Wei</em></p>
<p>Another fun paper; this one tries to generate source code given a
natural language query, quite an ambitious task! It is trained on
snippets of code extracted from StackOverflow.</p>
<h2 id="optimisation">Optimisation</h2>
<h3 id="gradient-based-hyperparameter-optimization-through-reversible-learning"><a href="http://jmlr.org/proceedings/papers/v37/maclaurin15.pdf">Gradient-based Hyperparameter Optimization through Reversible Learning</a></h3>
<p><em>Dougal Maclaurin, David Duvenaud, Ryan Adams</em></p>
<p>Hyperparameter optimisation is important when training neural networks
because there are so many of the things floating around. How do you
know what to set them to? Normally you have to perform some kind of
search on the space of possible parameters, and Bayesian techniques
have been very helpful at doing this. This paper suggests something
entirely different and completely audacious. The authors are able to
compute gradients for hyperparameters using automatic differentiation
<em>after going through a whole round of stochastic gradient descent
learning</em>. That’s quite a feat. What this means is that we can answer
questions about what the optimal hyperparameter settings look like in
different settings - and makes a whole set of things that was
previously a “black art” a lot more scientific and
understandable.</p>
<h2 id="and-more">And more…</h2>
<p>There were many more interesting papers - too many to write up
here. Take a look at the <a href="http://icml.cc/2015/?page_id=825">schedule</a>
and find your favourite! Let me know on <a href="https://twitter.com/daarkecloud">Twitter</a>.</p>
My favourite papers from day one of ICML 20152015-07-07T00:00:00+00:00http://daoudclarke.github.com/machine%20learning%20in%20practice/2015/07/07/icml2015-favourite-papers-day1
<p>Aargh! How can I possibly keep all the amazing things I learnt at ICML
today in my head?! Clearly I can’t. This is a list of pointers to my
favourite papers from today, and why I think they are cool. This is
mainly for my benefit, but you might like them too!</p>
<h2 id="neural-nets--deep-learning">Neural Nets / Deep Learning</h2>
<h3 id="bilbowa-fast-bilingual-distributed-representations-without-word-alignments"><a href="http://jmlr.org/proceedings/papers/v37/gouws15.pdf">BilBOWA: Fast Bilingual Distributed Representations without Word Alignments</a></h3>
<p><em>Stephan Gouws, Yoshua Bengio, Greg Corrado</em></p>
<p><strong>Why this paper is cool:</strong> It simultaneously learns word vectors for
words in two languages without having to learn a mapping between
them.</p>
<h3 id="compressing-neural-networks-with-the-hashing-trick"><a href="http://jmlr.org/proceedings/papers/v37/chenc15.pdf">Compressing Neural Networks with the Hashing Trick</a></h3>
<p><em>Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, Yixin Chen</em></p>
<p><strong>Why this paper is cool:</strong> Gives a huge reduction (32x) in the amount
of memory needed to store a neural network. This means you can
potentially use it on low memory devices like mobile phones!</p>
<h3 id="batch-normalization-accelerating-deep-network-training-by-reducing-internal-covariate-shift"><a href="http://jmlr.org/proceedings/papers/v37/ioffe15.pdf">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a></h3>
<p><em>Sergey Ioffe, Christian Szegedy</em></p>
<p><strong>Why this paper is cool:</strong> Makes deep neural network training super
fast, giving a new state of the art for some datasets.</p>
<h3 id="deep-learning-with-limited-numerical-precision"><a href="http://jmlr.org/proceedings/papers/v37/gupta15.pdf">Deep Learning with Limited Numerical Precision</a></h3>
<p><em>Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan</em></p>
<p><strong>Why this paper is cool:</strong> Train neural networks with very limited
fixed precision arithmetic instead of floating points. The key
insight is to use randomness to do the rounding. The goal is to
eventually build custom hardware to make learning much faster.</p>
<h2 id="recommendations-etc">Recommendations etc.</h2>
<h3 id="fixed-point-algorithms-for-learning-determinantal-point-processes"><a href="http://jmlr.org/proceedings/papers/v37/mariet15.pdf">Fixed-point algorithms for learning determinantal point processes</a></h3>
<p><em>Zelda Mariet, Suvrit Sra</em></p>
<p><strong>Why this paper is cool</strong> If you want to recommend a set of things,
rather than just an individual thing, how do you choose the best
set? This will tell you.</p>
<h3 id="surrogate-functions-for-maximizing-precision-at-the-top"><a href="http://jmlr.org/proceedings/papers/v37/kar15.pdf">Surrogate Functions for Maximizing Precision at the Top</a></h3>
<p><strong>Why this paper is cool:</strong> If you only care about the top <em>n</em> things
you recommend, this technique works faster and better than other
approaches.</p>
<p><em>Purushottam Kar, Harikrishna Narasimhan, Prateek Jain</em></p>
<h2 id="and-finally">And Finally…</h2>
<h3 id="learning-to-search-better-than-your-teacher"><a href="http://jmlr.org/proceedings/papers/v37/changb15.pdf">Learning to Search Better than Your Teacher</a></h3>
<p><em>Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, John Langford</em></p>
<p><strong>Why this paper is cool:</strong> A new, general way to do structured
prediction (tasks like dependency parsing or semantic parsing) which
works well even when there are errors in the training set. Thanks to
the authors for talking me through this one!</p>
So you want to be a data scientist? (Part 1)2014-10-11T00:00:00+00:00http://daoudclarke.github.com/machine%20learning%20in%20practice/2014/10/11/data-science-skills
<p>“Data Scientist” is definitely the hot new job
description. It is such a new title that the roles and
responsibilities associated with it are still not clearly
defined. What skills do you need to be a data scientist? Well, not
everyone agrees. In this article we will try to be objective by
looking at statistics on what skills employers recruiting data
scientists are looking for. We will use data from
<a href="http://www.itjobswatch.co.uk/jobs/uk/data%20scientist.do">IT Jobswatch</a>,
which is a fantastic resource for those looking to upskill themselves:
it provides comprehensive statistics on the keywords that recruiters
mention in job advertisements. I will also give my own opinions based
on my experience recruiting data scientists and working as one at
<a href="http://lumi.do">Lumi</a>.</p>
<p>Some caveats:</p>
<ul>
<li>The data is from jobs advertised in the UK, so your mileage may
vary.</li>
<li>Recruiters might not necessarily know what the employer really
<em>needs</em></li>
<li><em>Employers</em> might not know what they really need</li>
</ul>
<p>The last point is important because it is easy to fall prey to the
bandwagon phenomenon: “Everyone is doing <em>big data</em> so we need
to do it too”. The company may not actually have enough data to
benefit from big data technology, and may benefit more from a careful
statistical analysis of the data that they do have.</p>
<p>So, what do you need to know to be a data scientist? Here’s what
recruiters are asking for, broken down into the following sections:</p>
<ul>
<li>Qualifications</li>
<li>Applied skills</li>
<li>Knowledge-based skills</li>
<li>Programming languages</li>
<li>Technologies</li>
</ul>
<h2 id="qualifications">Qualifications</h2>
<p>If you know you want to be a data scientist, and you’re trying to
decide whether to go to university, or what course you might do, then
the answer is quite clear. 37% of data science jobs advertised mention
the word “degree”, and 34% mention “PhD”. A PhD
involving experimentation, numerical analysis or computer programming
may well be beneficial to getting a data scientist position, but
anyone who can demonstrate analytical ability and knowledge of how to
run experiments is in a good position.</p>
<p>Keywords mentioned relating to degrees are (predictably) Mathematics
(50%), Computer Science (37%) and Physics (25%), and any of these
would provide a good foundation for a career in data science. However,
it is not enough to know just Mathematics or Computer Science - as a
data scientist you will need to combine many skills. As someone who
studied physics myself, I am biased, but a typical physics course will
cover many of the things you need to know: how to run experiments, how
to analyse results and how to program - a very good starting point.</p>
<p>Some universities are now offering courses in data science - these
could be a good choice, but are certainly not necessary if this is
what you want to do.</p>
<h2 id="skills">Skills</h2>
<p>Just as for any other type of scientist, data scientists need to be
inquisitive and enjoy solving hard problems. You need to be able to
truly understand and characterise a problem, perhaps describe it
mathematically, break it down, and come up with a plan for a
solution. 48% of job adverts cite “Analytical Skills” - a
catch-all phrase for this type of ability.</p>
<p>Other skills mentioned in data science job adverts are:</p>
<ul>
<li>Data Mining (38%) - a very general area relating to the use of
statistics and machine learning techniques to extract information
from typically unstructured sources of content such as web pages or
log files.</li>
<li>Statistics (37%) - undoubtedly, every data scientist needs to have
a basic grasp of statistics and know best practices for statistical
analysis of data.</li>
<li>Machine Learning (28%) - the science of analysing data to find
patterns and make predictions. This is a very broad area that is
long established in the academic research community; its usage is
only just starting to become widespread in industry.</li>
<li>Visualisation (23%) - in general, communication skills are very
important for a data scientist, and being able to visualise data
in ways that draw out the patterns of interest is a very useful
skill to help communicate ideas.</li>
<li>Finance (19%) - many jobs in data science are in finance - if you
are interested in working in this area, then any knowledge you have
will be beneficial.</li>
<li>Information Retrieval (10%) - the science of search, typically text
documents (think Google search), but also images and sound.</li>
<li>Natural Language Processing (5%) - technologies related to natural
language such as part of speech tagging, parsing, named entity
recognition, machine translation and question answering.</li>
</ul>
<p>I would add to this that it is generally important for all data
scientists to have some general business acumen, so that they can
focus their efforts on tasks that are strategically beneficial to the
company.</p>
<p>Other skills mentioned are: Analytics, Predictive Modelling, Data
Modelling and Data Analysis (66%, 24%, 23% and 19%
respectively). These are either synonyms for, or applications of the
above skills to specific areas, each with a different emphasis.</p>
<h2 id="technologies">Technologies</h2>
<p>56% of jobs mention “Big Data” a term which basically
means the ability to analyse and work efficiently with very large
datasets. Practically this means experience with Hadoop (49%) and
MapReduce (26%) and perhaps Mahout for machine learning (17%). Other
big data technologies include NoSQL databases (9%) such as MongoDB
(7%), Cassandra (3%) and search technologies such as Elasticsearch
(0.25%) and Solr (0.25%).</p>
<p>Not all data is big however. It is also important to know how to work
with relational databases (16%) such as Oracle (5%), Postgres (3%) and
MySQL (3%), and you will need to have at least a basic knowledge of
SQL (43%).</p>
<p>Other technologies mentioned are more business oriented, such as the
statistics packages SAS (25%) and SPSS (20%), however the overall
demand trend for both these technologies seems to be downward.</p>
<h2 id="programming-languages">Programming Languages</h2>
<p>If you want to be a data scientist, you will have to be able to
program, which means you will need to know at least one language. Here
are your options:</p>
<ul>
<li>By far the most popular is R (55%) because of its very comprehensive
set of libraries for scientific computing;</li>
<li>This is followed by Java (45%), because it’s so darned popular
(personally I’m not a fan);</li>
<li>Python (43%) is my favourite because of its resources for machine
learning
(<a href="http://daoudclarke.github.io/machine%20learning%20in%20practice/2013/09/18/why-i-love-scikit-learn/">I wrote a whole article about it</a>),
it’s really fast to write code, the code is generally very readable
and runs quickly because it uses C libraries internally, and can be
used for production code as well as quick scripts for analysis;</li>
<li>MATLAB (33%) is an old favourite for scientific computing; whilst
it is powerful, unlike other options it is not freely available.</li>
<li>Other popular languages for data scientists are C++
(21%), Scala (17%), C# (7%), Visual Basic (7%), Clojure (7%) and Ruby
(6%).</li>
</ul>
Hyperparameter - Data Science Training2014-09-11T00:00:00+00:00http://daoudclarke.github.com/machine%20learning%20in%20practice/2014/09/11/hyperparameter
<p>Do you want to be a data scientist? Are you a data scientist looking
to gain some new skills?</p>
<p>I am very happy to announce the launch of
<a href="http://www.hyperparameter.com">Hyperparameter</a>, my new company
offering training for data scientists in London. Upcoming courses:</p>
<ul>
<li>Introduction to Python for Data Scientists (20th November 2014)</li>
<li>Introduction to Machine Learning (21st November 2014)</li>
</ul>
<p>We also offer bespoke training courses to meet your needs. Please get
in touch if there is anything we can help you with - email daoud (dot)
clarke (at) gmail (dot) com.</p>
Fear2013-10-29T00:00:00+00:00http://daoudclarke.github.com/poetry/2013/10/29/fear
<p>Did Fear betwixt the winters hide,
<br />In Autumn, donned his mangy hide,
<br />“Hail all, all’s well!” he ravenously cried.
<br />He lied.</p>
<p>And now he darkens darkness skyed,
<br />And now the hands of cold untied,
<br />And now his hunger grows. With every stride
<br />They died.</p>
<p>The mother begs a piece of bread;
<br />She claims the cold’s gone to his head.
<br />“Or maybe you could spare a quid instead,”
<br />She said.</p>
<p>How art thou, banker, in thy lair?
<br />Dost thou know that it’s unfair?
<br />Canst thou see (wouldst thou give the merest care)
<br />Her stare?</p>
<p>When all’s repaid that has been spent
<br />That day there will be no relent
<br />Wouldst thou, before then, offer some consent,
<br />Repent?</p>
17 Great Machine Learning Libraries2013-10-08T00:00:00+00:00http://daoudclarke.github.com/machine%20learning%20in%20practice/2013/10/08/machine-learning-libraries
<p><em>After wonderful feedback on my
<a href="/machine%20learning%20in%20practice/2013/09/18/why-i-love-scikit-learn">previous post on Scikit-learn</a>
from the guys at
<a href="http://www.reddit.com/r/MachineLearning/comments/1mq8fb/why_i_love_scikitlearn/">/r/MachineLearning</a>,
I decided to collect the list of machine learning libraries into this
seperate note. Let me know if there’s a library that should be
included here.</em></p>
<hr />
<p><strong>Update (15 May 2014):</strong> <em>thanks to Djalel Benbouzid and Dwayne Campbell
for additional suggestions. Sorry it’s taken me so long to add them…</em></p>
<hr />
<h3 id="python">Python</h3>
<ul>
<li><strong><a href="http://scikit-learn.org">Scikit-learn</a></strong>: comprehensive and easy
to use, I wrote <a href="/machine%20learning%20in%20practice/2013/09/18/why-i-love-scikit-learn">a whole article</a>
on why I like this library.</li>
<li><strong><a href="http://pybrain.org/">PyBrain</a></strong>: Neural networks are one thing
that are missing from SciKit-learn, but this module makes up for
it.</li>
<li><strong><a href="http://nltk.org/">nltk</a></strong>: really useful if you’re doing
anything NLP or text mining related.</li>
<li><strong><a href="http://www.deeplearning.net/software/theano/">Theano</a></strong>:
efficient computation of mathematical expressions using
GPU. Excellent for deep learning.</li>
<li><strong><a href="http://deeplearning.net/software/pylearn2/">Pylearn2</a></strong>: machine
learning toolbox built on top of Theano - in very early stages of
development.</li>
<li><strong><a href="http://mdp-toolkit.sourceforge.net/">MDP</a></strong> (Modular toolkit for
Data Processing): a framework that is useful when setting up
workflows.</li>
</ul>
<h3 id="java">Java</h3>
<ul>
<li><strong><a href="http://spark.apache.org/">Spark</a></strong>: Apache’s new upstart,
supposedly up to a hundred times faster than Hadoop, now includes
MLLib, which contains a good selection of machine learning
algorithms, including classification, clustering and recommendation
generation. Currently undergoing rapid development. Development can
be in Python as well as JVM languages.</li>
<li><strong><a href="https://mahout.apache.org/">Mahout</a></strong>: Apache’s machine learning
framework built on top of Hadoop, this looks promising, but comes
with all the baggage and overhead of Hadoop.</li>
<li><strong><a href="http://www.cs.waikato.ac.nz/ml/weka/">Weka</a></strong>: this is a Java
based library with a graphical user interface that allows you to
run experiments on small datasets. This is great if you restrict
yourself to playing around to get a feel for what is possible with
machine learning. However, I would avoid using this in production
code at all costs: the API is very poorly designed, the algorithms
are not optimised for production use and the documentation is often
lacking.</li>
<li><strong><a href="http://mallet.cs.umass.edu/">Mallet</a></strong>: another Java based library
with an emphasis on document classification. I’m not so familiar
with this one, but if you have to use Java this is bound to be
better than Weka.</li>
<li><strong><a href="https://code.google.com/p/java-statistical-analysis-tool/">JSAT</a></strong>:
stands for “Java Statistical Analysis Tool” - created by Edward
Raff and was born out of his frustation with Weka (I know the
feeling). Looks pretty cool.</li>
</ul>
<h3 id="net">.NET</h3>
<ul>
<li><strong><a href="http://accord-framework.net/intro.html">Accord.NET</a></strong>: this
seems to be pretty comprehensive, and comes recommended by
<a href="http://www.reddit.com/user/primaryobjects">primaryobjects</a> on
Reddit. There is perhaps a slight slant towards image processing
and computer vision, as it builds on the popular library
<a href="http://www.aforgenet.com/">AForge.NET</a> for this purpose.</li>
<li>Another option is to use one of the Java libraries compiled to .NET
using <a href="http://www.ikvm.net/">IKVM</a> - I have used this approach
with success in production.</li>
</ul>
<h3 id="c">C++</h3>
<ul>
<li><strong><a href="https://github.com/JohnLangford/vowpal_wabbit">Vowpal Wabbit</a></strong>:
designed for very fast learning and released under a BSD license,
this comes recommended by
<a href="http://www.reddit.com/user/terath">terath</a> on Reddit.</li>
<li><strong><a href="http://www.multiboost.org/">MultiBoost</a></strong>: a fast C++ framework
implementing some boosting algorithms as well as some cascades
(like the Viola-Jones cascades). It’s mainly focused on AdaBoost.MH
so it is multi-class/multi-label.</li>
<li><strong><a href="http://www.shogun-toolbox.org/">Shogun</a></strong>: large machine
learning library with a focus on kernel methods and support vector
machines. Bindings to Matlab, R, Octave and Python.</li>
</ul>
<h3 id="general">General</h3>
<ul>
<li><a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/"><strong>LibSVM</strong></a> and
<a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/"><strong>LibLinear</strong></a>:
these are C libraries for support vector machines; there are also
bindings or implementations for many other languages. These are the
libraries used for support vector machine learning in Scikit-learn.</li>
</ul>
<h3 id="conclusion">Conclusion</h3>
<p>This article is a work in progress, so please send me your comments or
criticisms!</p>
Why I Love Scikit-learn2013-09-18T00:00:00+00:00http://daoudclarke.github.com/machine%20learning%20in%20practice/2013/09/18/why-i-love-scikit-learn
<p><em>Scikit-learn is great because it has a clean API, is robust, fast,
easy to use, comprehensive, and well documented and supported,
released under a permissive license and the developers are cool. If
you can implement your project in Python and you don’t need massively
scalable algorithms, then it is probably for you.</em></p>
<hr />
<p>Choosing a library is often a crucial task. In the case of machine
learning, it is likely that the library you choose will form the core
of your project, and your choice will impact on many other decisions
you will make when building your software. If you choose the wrong
library, you may spend weeks wrapping a poorly designed API,
inspecting source code to understand undocumented features and working
around bugs and limitations. If you get it right, you will be able to
write clean, bug free code with a minimum of effort.</p>
<p>I have seen this effect first hand. In this article, I want to talk
about my favourite machine learning library, Scikit-learn, and why I
think it is currently one of the best libraries around for doing
machine learning, both for academic work and in production.</p>
<!-- Scikit-learn is a python library -->
<h2 id="1-clean-api">1. Clean API</h2>
<p>The importance of a clean API cannot be overstated. It is much easier
to write clean code if the underlying API is cleanly designed. Your
code will have to conform to the vision of the library writer, and
they can force you to write convoluted code if they want to. Complex
design may sometimes be justified by increased generality, but if it
is hard to implement the common use cases, then the API is poorly
designed.</p>
<p>The objects provided by the library are forced upon you, and they will
litter your code. Well designed objects will lead to terse, readable
code, while poorly designed objects will have you scratching your head
six months down the line trying to remember how the code you wrote
works.</p>
<p>You may be tempted to take a machine learning library that has a poor
API but more algorithms and wrap it in a clean API, but beware!
Creating a good wrapper for a library is no mean feat. Doing machine
learning properly requires a variety of tools that will need to be
wrapped, and you may find that it’s not worth the overhead (I learnt
this lesson the hard way). In addition, a library with a poor API is
likely to be lacking in other important qualities such as robustness
and good documentation.</p>
<h2 id="2-robust">2. Robust</h2>
<p>If you are planning to use a machine learning library in production
code, then robustness will be a high priority. One of the differences
between Scikit-learn and other machine learning libraries is that the
authors are explicitly targetting not just academic use, but use in
industry as well. They have concentrated on doing a few things really
well, rather than trying to do everything.</p>
<p>Scikit-learn is unit tested, with around 80% unit test coverage,
giving us confidence that old features will not break as new ones are
implemented and bugs are fixed.</p>
<p>UPDATE: <a href="http://www.reddit.com/r/MachineLearning/comments/1mq8fb/why_i_love_scikitlearn/">Edward Raff noted on
/r/MachineLearning</a>
that his experience with SciKit-learn hasn’t been so rosy when the
datasets are large or poorly behaved, so your mileage may vary…</p>
<!-- In my experience, upgrading -->
<!-- Scikit-learn has occasionally broken my code -->
<h2 id="3-fast">3. Fast</h2>
<p>If speed is important to you, Scikit-learn is fast. Despite being
implemented in an interpreted language, Python, its foundations are
the compiled libraries NumPy and SciPy, and in addition, the authors
have implemented a lot of tools in Cython, which compiles to C,
giving blazing fast Python-like code.</p>
<p>The authors have also built on top of existing machine learning
libraries, such as LibLinear and LibSVM for support vector machines,
however they didn’t stop there, optimising the algorithms to make them
even faster.</p>
<h2 id="4-easy-to-use">4. Easy to Use</h2>
<p>Being a fan of the Python language, I am undoubtedly a little biased,
however, it is arguably one of the easier languages to learn and
use. The Scikit-learn team have followed Python conventions as much as
possible, which makes using it a joy if you know Python. There are
several methods which Scikit-learn classes can implement:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fit
transform
fit_transform
predict
decision_function
</code></pre></div></div>
<p>Each type of object will implement a subset of these, and duck typing
determines which objects are appropriate in each circumstance. For
example, classifiers are expected to implement the <code class="language-plaintext highlighter-rouge">fit</code> and <code class="language-plaintext highlighter-rouge">predict</code>
methods.</p>
<p>Here’s an example from the documentation for the Multinomial Naive
Bayes classifier:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> import numpy as np
>>> X = np.random.randint(5, size=(6, 100))
>>> Y = np.array([1, 2, 3, 4, 5, 6])
>>> from sklearn.naive_bayes import MultinomialNB
>>> clf = MultinomialNB()
>>> clf.fit(X, Y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
>>> print(clf.predict(X[2]))
[3]
</code></pre></div></div>
<!-- ## 4. Comprehensive -->
<!-- Machine learning requires a variety of tools for different situations -->
<!-- and purposes, for example, feature extraction, feature selection, -->
<!-- dimensionality reduction, classification and clustering. Scikit-learn -->
<!-- provides most of these tools, while remaining strictly a -->
<!-- general-purpose machine learning library. -->
<h2 id="5-well-documented">5. Well Documented</h2>
<p>I have found the
<a href="http://scikit-learn.org/stable/documentation.html">Scikit-learn documentation</a>
to be comprehensive, readable, and easy to understand. When doing
something new with Scikit-learn, I have quickly been able to get to
get to grips with how to do it after a quick peruse of the
documentation, either using Python’s <code class="language-plaintext highlighter-rouge">help()</code> function, or the
excellent online documentation, which includes tutorials as well as
documenting the API.</p>
<p>Of course, it also helps that the API is well designed: a lot of the
time you can guess the correct usage of a new class once you get to
know a few of the classes.</p>
<p>Only occasionally have I had to fall back to reading the source to
understand a feature (or, more often, a bug in my own code). Since the
code is mainly fairly clean Python, even this is not much of a chore.</p>
<h2 id="6-permissive-license">6. Permissive License</h2>
<p>Scikit-learn is released under the liberal
<a href="http://opensource.org/licenses/BSD-3-Clause">BSD License</a> so you can
use it freely in commercial applications.</p>
<h2 id="7-well-supported">7. Well Supported</h2>
<p>Scikit-learn must be one of the most actively developed open source
machine learning projects. Check out the
<a href="https://github.com/scikit-learn/scikit-learn/pulse/monthly">github stats for the last month</a>:
at the time of writing, there were 734 commits by 42 authors.</p>
<h2 id="and-the-downsides">…And the Downsides</h2>
<p>As well as the benefits of being implemented in a dynamic language,
you also get the downsides: refactoring is potentially tedious, and
because there’s no strong typing, it is easy to break something
without realising it, which is where good unit test coverage becomes
crucial.</p>
<h2 id="alternatives">Alternatives</h2>
<p>Unfortunately, you can’t always have the best. There are numerous
factors to bear in mind when choosing a library that may impact your
decision on what to use:</p>
<ul>
<li><strong>Language</strong>: if you have to integrate your machine learning
functionality with legacy code, then this may restrict your choice
of language, although it is often possible to avoid this by using a
service oriented architecture. Alternatively, you may have to stick
to a particular language because of company policy, or because
the developers in your team don’t want to abandon their favourite
language for something new.</li>
<li><strong>Performance</strong>: for many applications, performance is critical, but
if it is not, then this gives you more freedom in which machine
learning tools you can use.</li>
<li><strong>Scalability</strong>: if you need something that is massively scalable
(which in my opinion is fairly rare), then you might want to
consider something like <a href="http://mahout.apache.org/">Mahout</a> which
is not as comprehensive as Scikit-learn, but is scalable to very
large datasets as it is implemented on top of Hadoop.</li>
</ul>
<p>You may want to consider <a href="/machine%20learning%20in%20practice/2013/10/08/machine-learning-libraries">some of these alternatives</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>Choose your library carefully</li>
<li>Scikit-learn is robust, with a clean API, and fast implementation</li>
<li>It may not suit every application</li>
</ul>