agusti on Farcaster

agusti pfp

if we need 100 privy wallets does privy investors think im 100 different users?

3 replies

1 recast

17 reactions

with HR pfp

pls explain to me like i’m 5

1 reply

0 recast

1 reaction

agusti pfp

idk if this will help: ELI-5 version (super simple) Imagine you have a magic LEGO machine. 1. First level – You tell the machine, “Build me a LEGO car.” 2. Second level – Instead, you say, “Build me another LEGO machine that can build cars for me.” 3. Third level – You go further: “Build me a LEGO machine that can build other LEGO machines that build cars.” Each time you add another level, you’re asking the machine to design a designer, not the car itself. Keep nesting that idea and you get “the system that designs the system that designs the system …”. It’s just layers of builders-of-builders. ⸻ How a grown-up might phrase the same idea 1. Self-reference & recursion – A procedure that takes itself (or another procedure) as its main input. 2. Meta-design – Creating a process whose output is itself another design process. 3. Practical example – A compiler that generates a compiler (bootstrapping), or an AI that writes code for improving the AI that will write the next version, and so on. ⸻ Why it matters (testable hypothesis) Hypothesis: Adding extra meta-levels yields faster innovation—because each layer automates part of the next. Failure mode: Complexity grows faster than the benefit: too many layers → no one can debug the stack. Test: Measure time-to-new-feature vs. number of meta-levels in real projects (e.g., compiler bootstraps, AutoML pipelines). ⸻ Alternative framing • Russian-doll workflow: Each doll holds instructions for building the next, smaller doll until you reach the final product. • Factory of factories: Instead of building cars, you build a factory that can build factories that can build cars. (Both pictures carry the same core idea: indirection stacked on indirection.) ⸻ Key trade-off: • Pro: Big leverage—tiny change high up can ripple down and improve many things. • Con: Harder to reason about and test; one hidden bug can propagate through every layer. That’s the whole “system of systems” story—whether told with magic LEGOs or nested factories, it’s all about builders that build builders.

2 replies

0 recast

0 reaction

agusti pfp

part 2: Why LLMs drift into loops — a layered view Layer Mechanism Evidence Failure modes & trade-offs 1. Token-level dynamics Autoregressive decoding feeds yesterday’s output into today’s input. If the model assigns very high probability to a word it just produced, the soft-max gets “sharpened” → the same word becomes even more likely at the next step (positive feedback). Sentence-level probabilities rise with every repetition; see quantitative analysis and the “self-reinforcement effect” in DITTO experiments Loop lock-in: once entropy drops below a threshold the model cannot escape without external noise (temperature ↑, nucleus sampling, etc.).Trade-off: High entropy gives diversity but risks incoherence. 2. Decoding algorithms Greedy search / low-temperature top-k truncate the tail of the distribution. Diversity dies, leaving a narrow set of “safe” tokens that cycle. Classic degeneration demo in The Curious Case of Neural Text Degeneration Beam width ↑ reduces loops but boosts verbatim training-set copy; nucleus sampling ↓ loops but ↑ hallucinations 3. Model architecture Specific attention heads and MLP neurons copy the previous token (mechanistic “repetition features”). Activating them induces the Repeat Curse; silencing them restores diversity. Sparse-autoencoder intervention study (“Duplicatus Charm”) shows causal features Fine-grained surgery is brittle; patching one head can just move the loop elsewhere. 4. Training mismatch (exposure bias) During training the model always sees gold prefixes; at inference it must condition on its own tokens. Small early errors snowball into low-entropy contexts that favour loops. Scheduled-sampling & follow-up EMNLP study on exposure bias Mitigations (scheduled sampling, RLHF) can hurt perplexity or inject other artifacts. 5. Generation-of-training-data (model-collapse) If synthetic text from earlier LLMs pollutes the next generation’s corpus, tails of the distribution vanish. Each retrain tightens the loop until only bland, high-frequency patterns remain. Nature study on “model collapse” with recursively generated data Watermark filtering & human-only data slow but don’t eliminate tail-loss; economic cost ↑. 6. Uncertainty fallback hierarchy Under high uncertainty, models regress: hallucination → degenerate paraphrase → verbatim repetition. Repetition is the “lowest-energy” fallback. Controlled experiments mapping this ladder (“Loops → Oops”) Decoding tricks that suppress repetition can simply shift the model up the ladder to hallucinations. ⸻ Testable hypotheses & quick experiments 1. Entropy threshold hypothesis Claim: When conditional entropy Ht drops below ≈1 bit, the probability of a repetition loop rises sharply. Test: Measure Ht during generation; inject controlled noise to push Ht above/below the threshold and record loop frequency. 2. Repetition-feature causality Claim: Deactivating top-k repetition neurons cuts loop rate by >50 % without hurting BLEU on non-repeating tasks. Test: Use SAE masking as in the Repeat-Curse paper; run ablation on Wikitext-103 and evaluate loop length distribution. 3. Model-collapse curve Claim: Training on ≥30 % synthetic tokens halves tail coverage within two successive generations. Test: Re-train a lightweight Llama derivative on mixed human/synthetic corpora; compute kurtosis of token frequency spectrum across generations. (All three are small-scale GPU experiments an engineer can run in a week.) ⸻ Mitigation toolbox (with trade-offs) Technique Idea Side-effects Temperature ↑ / nucleus-p sampling Re-inject entropy into the distribution to escape local attractors. Can drift off-topic; must tune p per task. Repetition penalty / DITTO training Down-weight logits if n-grams seen in recent context. Over-penalisation → forced thesaurus-like paraphrase. Dynamic top-k Increase k when entropy falls; shrink when entropy rises. Adds latency; still heuristic. Feature surgery Identify and damp repetition heads/neuros. Requires model access; brittle across checkpoints. RLHF with diversity reward Directly optimise for non-repetition. Designer must balance fluency vs. novelty; risk of adversarial exploits. Data hygiene (human tails) Filter out synthetic text; add long-tail human corpora. Expensive, curation overhead. ⸻ Alternative framing • Dynamical-systems view: The decoder is a high-dimensional nonlinear system whose state is the hidden vector. Repetition loops are attractor cycles; decoding heuristics change the basin of attraction. • Information-theoretic view: Repetition is the entropy-minimising default when mutual information between new token and context approaches zero. ⸻ Hidden assumptions & uncertainties • The entropy threshold likely varies by architecture size and training mix (uncertain, needs measurement). • Mechanistic-feature findings may not transfer across families (e.g., Transformer-XL vs. Mamba). • Synthetic-data collapse assumes indiscriminate scraping; curated synthetic data could behave differently (provisional claim). ⸻ Bottom line: Repeating patterns are not a glitch but an emergent property of (i) maximum-likelihood training, (ii) low-entropy decoding, and (iii) feedback from the model’s own outputs. Each mitigation shifts, rather than erases, the attractor; choosing the right trade-off depends on whether your project fears loops, hallucinations, or cost spikes more.

1 reply

0 recast

0 reaction

Sayonara pfp

@askgina.eth that was a long ass lecture summarie plz

1 reply

0 recast

0 reaction