📊 It's Thurs-Fri-day [citation needed], time for a (bi)weekly #quiz! It's been a while, so let's take it slowly... lest we make mistakes. Perfect segue: today, Online #Learning Mistake Bounds!

What?

Well, you want to learn a classifier, here a function f: {0,1}ⁿ→{0,1}...

1/ #weaeklyquiz

... and you observe a sequence of inputs, in arbitrary, possibly adversarial order (no distributional assumption, nothing!). At time t, you're given xₜ in {0,1}ⁿ, need to predict f(xₜ).

If you're wrong? You lose one point (that counts as "one mistake", but are immediately given the true value f(xₜ).

2/

Your goal is to minimize the total number of mistakes M you ever will make, on any sequence of inputs, possibly going on forever. Again, no assumption on the distribution of the inputs, nothing!

On the bright side? You don't have computational limitations, time or memory-wise.

3/

So, the first and obvious question: can we do anything AT ALL? After all, the "challenge" xₜ∈{0,1}ⁿ at each step can be chosen adversarially based on the past, and the game goes on forever..

Q1: If f: {0,1}ⁿ→{0,1} is arbitrary, can we achieve any finite mistake bound M?

4/

No, M=∞ 🥔
9.5%
Yes, M=2ⁿ 🥥
81%
Yes! M=O(n) 🍰
0%
Show me the stats 🥦
9.5%
Poll ended at .

Fine. But what if f was not arbitrary? If I promised you the unknown classifier had some reasonable structure (not some completely horrendous, pathological Boolean function), could we do something good?

For instance, if I told you f was just an OR of some of the n variables?

5/

Let's just suppose f is a monotone disjunction (a giant OR ⋁ of up to n of the variables, without negating any of them). Should be easier? Maybe?

Q2: If f: {0,1}ⁿ→{0,1} is a monotone disjunction, can we achieve any finite mistake bound M?

6/

Yes, M=2ⁿ 🥥
0%
Yes! M=O(n) 🍰
100%
Yes! M=O(log n) 🍦
0%
Show me the stats 🥦
0%
Poll ended at .

Alright, alright. But even monotone disjunctions are a bit much: what are the odds our classifier could actually use ALL n variables? n is big?

What if I promised you it was sparse, i.e., only depended on only k << n of the variables (we just don't know which)?

7/

If f is promised to be sparse, can we guarantee we'll ever make only a very small number of mistakes, no matter what, in this online learning framework?

Q3: If f: {0,1}ⁿ→{0,1} is a k-sparse monotone disjunction, can we achieve any good mistake bound M?

8/

Yes! M=O(kn) 🍰
7.7%
Yes! M=O(k log n) 🍦
69.2%
Yes! M=O(k) 🍪
7.7%
Show me the stats 🥦
15.4%
Poll ended at .

To conclude: let's go beyond monotone disjunctions, sparse of not. Let's do... lists! Lists are good!

A decision list (DL, but not Deep Learning 👀) is a classifier of the form "if x₄=1 then true; otherwise if x₃₉=0 then false; otherwise if [...]; otherwise true."

9/

(Technically, that's a 1-DL since each "if" involves only 1 literal, but let's not nitpick). Every disjunction is a DL, but DLs are more expressive.

The length of a DL is the number of if/then clauses in it.

So, let's see...

If f: {0,1}ⁿ→{0,1} is a length-ℓ DL, can we achieve any finite mistake bound M?

10/

Yes! M=O(ℓn) 🍰
23.1%
Yes! M=O(ℓ log n) 🍦
38.5%
Yes! M=O(ℓ) 🍪
7.7%
Show me the stats 🥦
30.8%
Poll ended at .

Alright, that's all for today! Hope you enjoyed this #weaeklyquiz. As usual, answers and discussions in ≈3 days.

In the meantime, feel free to comment or ask questions below!

11/end

📊 Answers and discussion for last week's quiz on online learning: mistake bounds, and where to get them!

Link to the quiz: https://mathstodon.xyz/@ccanonne/109757821363734340

You want to learn some classifier f: {0,1}ⁿ→{0,1} given an (arbitrary) sequence of inputs. Making few mistakes *in total* over this sequence, no matter how long it is (even infinite): you get an input x, you predict f(x): if you're right, good, if you're wrong, you get told you made a mistake. Rinse and repeat.

1/

Clément Canonne (@[email protected])

Attached: 1 image 📊 It's Thurs-Fri-day [citation needed], time for a (bi)weekly #quiz! It's been a while, so let's take it slowly... lest we make mistakes. Perfect segue: today, Online #Learning Mistake Bounds! What? Well, you want to learn a classifier, here a function f: {0,1}ⁿ→{0,1}... 1/ #weaeklyquiz

Mathstodon

Also, you're not guaranteed to learn f exactly! All you are asked to is to answer according to some g which agrees with f on the sequence of inputs you're given. E.g., if the sequence is *always* the same x, repeatedly... you won't learn all of the 2ⁿ values f takes on {0,1}ⁿ, but you sure can guarantee to make ≤ 1 mistake on this sequence!

2/

So, Q1: given that the sequence of queries (inputs) can be infinite and arbitrary, and so is the classifier (Boolean function) f to learn... can we guarantee anything at all on the # of mistakes we'll make?

https://mathstodon.xyz/@ccanonne/109757852367750506

As 81% of you answered: yes! 🥥 At most 2ⁿ which, admittedly, isn't GREAT, but is finite.

There 2 two ways to see it:

3/

Clément Canonne (@[email protected])

So, the first and obvious question: can we do anything AT ALL? After all, the "challenge" xₜ∈{0,1}ⁿ at each step can be chosen adversarially based on the past, and the game goes on forever.. Q1: If f: {0,1}ⁿ→{0,1} is arbitrary, can we achieve any finite mistake bound M? 4/ [ ] No, M=∞ 🥔 [ ] Yes, M=2ⁿ 🥥 [ ] Yes! M=O(n) 🍰 [ ] Show me the stats 🥦

Mathstodon

First, well, given that you have no computational (time or memory) limits, you can just store all *distinct* inputs xₜ seen so far and the value f(xₜ) (which you know afterwards, mistake or not).

This means that you won't make a mistake on the same input twice. But those inputs are from {0,1}ⁿ, so there are 2ⁿ distinct choices of them at most. So M(f)≤2ⁿ mistakes (but a LOT of memory used to remember all that).

4/

Second approach? Forget about # of possible inputs: consider the number of possible *functions.*

There are 2^(2ⁿ) possible functions (classifiers). Now, here's the claim:

If you are promised that the classifier f belongs to a function class C, then one can always achieve mistake bound M(f) ≤ log₂ |C|

Yup.🤯 And log₂ (2^(2ⁿ)) = 2ⁿ.

Now, how? With the "halving" algorithm.

5/

What's the halving algo, you ask? Simple:
- maintain a set H of "still OK" classifiers, initially H=C (recall: you're promised f∊C)
- on input x, take a vote among H: compute g(x)∊{0,1} for all g∊H, output the majority value.
- if that was a mistake? Remove all g that erred from H.

6/

Why does that work? Well, we can get our mistake bound by combining 3 facts:

* at every mistake, at least 50% of the g's from H must have been wrong (we took the majority!), so |H| is at least halved
* the "true" f will never be removed from H (it doesn't err!), so |H|≥1 always
* at first, H=C

so... we can make at most log₂ |C| mistakes!

7/

This is, of course (1) computationally inefficient (go over all g in H at every step), and (2) brittle to errors (if f is not really in C as promised, then doom).

This can be fixed: (1) randomization (don't take the majority, but instead answer according to a randomly chosen g from H), and... (2) the Multiplicative Weights Update (MWU) algorithm! See, e.g., these slides: https://unisyd-my.sharepoint.com/:p:/g/personal/clement_canonne_sydney_edu_au/EaQorLRv5g9FqJzPF2k0Z5UBr0hGGvWfoaJNmkkZGHBbSw?e=9kMTQ7

8/

COMP3927-mwu-guest-lecture.pptx

Great, we're done with Q1: to summarize, we can always guarantee a mistake bound

M(f) ≤ min(|X|, log₂ |C|)

if f: X→{0,1} is promised to be in some class \(C \subseteq \{0,1\}^X\).

This may not be tight, of course. But what can you do? 🤷

9/

Onto Q2! Q2 asked about a specific class of functions C, the monotone disjunctions: f is of the form f(x) = \(\bigvee_{k∊S} x_k\), with S⊆{1,..,n}.

https://mathstodon.xyz/@ccanonne/109757866288172941

As 100% (holy cow!) of you answered, under this particular promise, we can achieve M(f) = O(n). Namely, M(f) ≤ n.

10/

Clément Canonne (@[email protected])

Let's just suppose f is a monotone disjunction (a giant OR ⋁ of up to n of the variables, without negating any of them). Should be easier? Maybe? Q2: If f: {0,1}ⁿ→{0,1} is a monotone disjunction, can we achieve any finite mistake bound M? 6/ [ ] Yes, M=2ⁿ 🥥 [ ] Yes! M=O(n) 🍰 [ ] Yes! M=O(log n) 🍦 [ ] Show me the stats 🥦

Mathstodon

Now, based on what we saw, this shouldn't be too surprising: |C|=2ⁿ (as many possible functions as subsets S), so the halving algo achieves mistake bound log₂ |C|=n.

But that's computationally inefficient.

Turns out, we can achieve the same bound w/ an efficient algo!

11/

The algorithm is very simple: keep a list of all variable (initially T = {1,...,n}). Answer according to \(g(x)= \bigvee_{k∊T} x_k\).

Any time you make a mistake (can only be saying "1" but it should be "0": can you see why?), remove from T every variable that is 1 in the input.

Clearly, that's efficient (evaluating g takes O(n) time, not exponential time as with the halving algo). But why does it work?

12/

Again, we can analyze this by combining a few simple facts.

If S corresponds to the *true* f:
- no variable from S is ever removed
- initially |T|=n
- each mistake makes removes at least one variable from T

This gives the bound!

13/

Cool. Onto Q3: what if the monotone disjunction is promised to be *sparse* (|S|≤k for some k<<n)? Can we achieve better mistake bound than O(n)?

https://mathstodon.xyz/@ccanonne/109757871611250656

Answer: Yes! 🍾 As 69% of you answered, one can get M(f)=O(k log n). Much better, if k << n/log n!

How?

12/

Clément Canonne (@[email protected])

If f is promised to be sparse, can we guarantee we'll ever make only a very small number of mistakes, no matter what, in this online learning framework? Q3: If f: {0,1}ⁿ→{0,1} is a k-sparse monotone disjunction, can we achieve any good mistake bound M? 8/ [ ] Yes! M=O(kn) 🍰 [ ] Yes! M=O(k log n) 🍦 [ ] Yes! M=O(k) 🍪 [ ] Show me the stats 🥦

Mathstodon

Dang, looks like I broke this branch of the thread.

Here is the rest: https://mathstodon.xyz/@ccanonne/109787226544370495

Clément Canonne (@[email protected])

Sadly, the efficient algo above doesn't give that. 🧐 But the halving algo does! 🎉 Since log₂|C|= log (n choose k) = O(k log n). Sadly, that's computationally inefficient! 🧐 But there is an EFFICIENT algo which does it! 🎉 It's the so-called WINNOW algo, due to Littlestone: https://en.wikipedia.org/wiki/Winnow_(algorithm) 💡 Interestingly, the Winnow algo is a special case of the [wait for it]... ... MWU algorithm mentioned a few toots earlier. Yes, multiplicative weight updates for the win! 🏆 14/

Mathstodon