agusti pfp
agusti
@bleu.eth
if we need 100 privy wallets does privy investors think im 100 different users?
3 replies
1 recast
17 reactions

with HR pfp
with HR
@wtfken
pls explain to me like i’m 5
1 reply
0 recast
1 reaction

agusti pfp
agusti
@bleu.eth
idk if this will help: ELI-5 version (super simple) Imagine you have a magic LEGO machine. 1. First level – You tell the machine, “Build me a LEGO car.” 2. Second level – Instead, you say, “Build me another LEGO machine that can build cars for me.” 3. Third level – You go further: “Build me a LEGO machine that can build other LEGO machines that build cars.” Each time you add another level, you’re asking the machine to design a designer, not the car itself. Keep nesting that idea and you get “the system that designs the system that designs the system …”. It’s just layers of builders-of-builders. ⸻ How a grown-up might phrase the same idea 1. Self-reference & recursion – A procedure that takes itself (or another procedure) as its main input. 2. Meta-design – Creating a process whose output is itself another design process. 3. Practical example – A compiler that generates a compiler (bootstrapping), or an AI that writes code for improving the AI that will write the next version, and so on. ⸻ Why it matters (testable hypothesis) Hypothesis: Adding extra meta-levels yields faster innovation—because each layer automates part of the next. Failure mode: Complexity grows faster than the benefit: too many layers → no one can debug the stack. Test: Measure time-to-new-feature vs. number of meta-levels in real projects (e.g., compiler bootstraps, AutoML pipelines). ⸻ Alternative framing • Russian-doll workflow: Each doll holds instructions for building the next, smaller doll until you reach the final product. • Factory of factories: Instead of building cars, you build a factory that can build factories that can build cars. (Both pictures carry the same core idea: indirection stacked on indirection.) ⸻ Key trade-off: • Pro: Big leverage—tiny change high up can ripple down and improve many things. • Con: Harder to reason about and test; one hidden bug can propagate through every layer. That’s the whole “system of systems” story—whether told with magic LEGOs or nested factories, it’s all about builders that build builders.
2 replies
0 recast
0 reaction

agusti pfp
agusti
@bleu.eth
part 2: Why LLMs drift into loops — a layered view Layer Mechanism Evidence Failure modes & trade-offs 1. Token-level dynamics Autoregressive decoding feeds yesterday’s output into today’s input. If the model assigns very high probability to a word it just produced, the soft-max gets “sharpened” → the same word becomes even more likely at the next step (positive feedback). Sentence-level probabilities rise with every repetition; see quantitative analysis and the “self-reinforcement effect” in DITTO experiments  Loop lock-in: once entropy drops below a threshold the model cannot escape without external noise (temperature ↑, nucleus sampling, etc.).Trade-off: High entropy gives diversity but risks incoherence. 2. Decoding algorithms Greedy search / low-temperature top-k truncate the tail of the distribution. Diversity dies, leaving a narrow set of “safe” tokens that cycle. Classic degeneration demo in The Curious Case of Neural Text Degeneration  Beam width ↑ reduces loops but boosts verbatim training-set copy; nucleus sampling ↓ loops but ↑ hallucinations  3. Model architecture Specific attention heads and MLP neurons copy the previous token (mechanistic “repetition features”). Activating them induces the Repeat Curse; silencing them restores diversity. Sparse-autoencoder intervention study (“Duplicatus Charm”) shows causal features  Fine-grained surgery is brittle; patching one head can just move the loop elsewhere. 4. Training mismatch (exposure bias) During training the model always sees gold prefixes; at inference it must condition on its own tokens. Small early errors snowball into low-entropy contexts that favour loops. Scheduled-sampling & follow-up EMNLP study on exposure bias  Mitigations (scheduled sampling, RLHF) can hurt perplexity or inject other artifacts. 5. Generation-of-training-data (model-collapse) If synthetic text from earlier LLMs pollutes the next generation’s corpus, tails of the distribution vanish. Each retrain tightens the loop until only bland, high-frequency patterns remain. Nature study on “model collapse” with recursively generated data  Watermark filtering & human-only data slow but don’t eliminate tail-loss; economic cost ↑. 6. Uncertainty fallback hierarchy Under high uncertainty, models regress: hallucination → degenerate paraphrase → verbatim repetition. Repetition is the “lowest-energy” fallback. Controlled experiments mapping this ladder (“Loops → Oops”)  Decoding tricks that suppress repetition can simply shift the model up the ladder to hallucinations. ⸻ Testable hypotheses & quick experiments 1. Entropy threshold hypothesis Claim: When conditional entropy Ht drops below ≈1 bit, the probability of a repetition loop rises sharply. Test: Measure Ht during generation; inject controlled noise to push Ht above/below the threshold and record loop frequency. 2. Repetition-feature causality Claim: Deactivating top-k repetition neurons cuts loop rate by >50 % without hurting BLEU on non-repeating tasks. Test: Use SAE masking as in the Repeat-Curse paper; run ablation on Wikitext-103 and evaluate loop length distribution. 3. Model-collapse curve Claim: Training on ≥30 % synthetic tokens halves tail coverage within two successive generations. Test: Re-train a lightweight Llama derivative on mixed human/synthetic corpora; compute kurtosis of token frequency spectrum across generations. (All three are small-scale GPU experiments an engineer can run in a week.) ⸻ Mitigation toolbox (with trade-offs) Technique Idea Side-effects Temperature ↑ / nucleus-p sampling Re-inject entropy into the distribution to escape local attractors. Can drift off-topic; must tune p per task. Repetition penalty / DITTO training Down-weight logits if n-grams seen in recent context. Over-penalisation → forced thesaurus-like paraphrase. Dynamic top-k Increase k when entropy falls; shrink when entropy rises. Adds latency; still heuristic. Feature surgery Identify and damp repetition heads/neuros. Requires model access; brittle across checkpoints. RLHF with diversity reward Directly optimise for non-repetition. Designer must balance fluency vs. novelty; risk of adversarial exploits. Data hygiene (human tails) Filter out synthetic text; add long-tail human corpora. Expensive, curation overhead. ⸻ Alternative framing • Dynamical-systems view: The decoder is a high-dimensional nonlinear system whose state is the hidden vector. Repetition loops are attractor cycles; decoding heuristics change the basin of attraction. • Information-theoretic view: Repetition is the entropy-minimising default when mutual information between new token and context approaches zero. ⸻ Hidden assumptions & uncertainties • The entropy threshold likely varies by architecture size and training mix (uncertain, needs measurement). • Mechanistic-feature findings may not transfer across families (e.g., Transformer-XL vs. Mamba). • Synthetic-data collapse assumes indiscriminate scraping; curated synthetic data could behave differently (provisional claim). ⸻ Bottom line: Repeating patterns are not a glitch but an emergent property of (i) maximum-likelihood training, (ii) low-entropy decoding, and (iii) feedback from the model’s own outputs. Each mitigation shifts, rather than erases, the attractor; choosing the right trade-off depends on whether your project fears loops, hallucinations, or cost spikes more.
1 reply
0 recast
0 reaction

agusti pfp
agusti
@bleu.eth
“Design-the-designer” stack ↔ LLM-loop stack (Each rung “designs the system below it”; peel it like an onion.) Level “System-that-designs…” wording Concrete LLM analogue How repetition emerges Test / falsifier L0 …the text Final token stream Loops = identical n-grams Compute conditional entropy Ht; spike in loop length once Ht < 1 bit (threshold hypothesis). L1 …the token chooser Decoding algorithm (greedy, top-k, nucleus-p) Low-entropy heuristics narrow choices until only the previous token is left (degeneration).  Vary p dynamically; verify loop rate drops without BLEU loss. L2 …the chooser factory Forward pass of the neural network (attention heads, MLPs) Specific “repeat heads” copy prior token logits. Mask heads ↔ measure loop rate cut (feature-ablation test). L3 …the factory blueprint Model weights & architecture Maximum-likelihood training over-rewards high-freq tokens. Retrain with token-level anti-repeat penalty; compare perplexity vs. loops. L4 …the blueprint generator Training protocol (loss, hyper-params, RLHF) Exposure bias: model never conditions on its own tokens during training.  Scheduled-sampling vs. teacher-forcing; evaluate degradation curve. L5 …the data curator Corpus assembly (human vs. synthetic mix) “Model collapse”: self-generated text cannibalises tails, shrinking diversity.  Incrementally raise synthetic-token share; plot tail-kurtosis fall-off. L6 …the economics & governance layer Org incentives, cost ceilings, policy Pressure for cheaper data/compute ⇒ shortcuts across L4–L5, accelerating collapse. Track budget cuts vs. diversity metrics across model generations. ⸻ Tying back to the original post “Design the system that designs the system …” ⇒ Each level above literally builds or configures the next one down. • Meta-design leverage: A tiny tweak at L5 (data policy) reshapes everything beneath—exactly the “LEGO machine that builds LEGO machines” metaphor. • Degenerative attractors: Positive feedback at L1–L2 forms repetitions; feedback at L5–L6 forms model-collapse, a macroscopic analogue of the same phenomenon. ⸻ Hidden assumptions & uncertainties • Entropy threshold (L0) likely shifts with model scale—unmeasured for >70 B parameters (uncertain). • Causal importance of repeat-heads (L2) shown for GPT-2 class; unclear for state-space models (provisional). • Model-collapse severity (L5) depends on quality of synthetic text, not only quantity. ⸻ Alternative framing 1. Control-theory: L0–L2 form a fast inner loop, L3–L5 a slow outer loop. Repetition ≈ limit-cycle of the inner loop; model-collapse ≈ long-term drift of the outer loop. 2. Thermodynamic: Each layer dissipates uncertainty; repetition is the entropy floor. Raising temperature or adding noise resets the floor but costs energy/compute. Use whichever framing makes failure modes easier to instrument in your pipeline.
0 reply
0 recast
0 reaction