Hacker Newsnew | past | comments | ask | show | jobs | submit | alexpovel's commentslogin

Agree!

For fun, I had put together a GitHub bot for this purpose a while ago. It indexes all existing issues in a repo, creates embeddings and stores them in a vector DB. When a new issue is created, the bot comments with the 3 most similar, existing issues, from vector similarity.

In theory, that empowers users to close their own issues without maintainer intervention, provided an existing & solved issue covers their use case. In practice, the project never made it past PoC.

The mechanism works okay, but I've found available (cheap) embedding models to not be powerful enough. For GitHub, technology-wise, it should be easy to implement though.

https://github.com/alexpovel/issuedigger


We made a similar thing too for our community discord where you can add an Emoji on a message and it will look for similar issues with a simple RAG. That saves us so much time when a user asks if a feature is planned or not. We also ask them to go upvote the issue or create one in the response.

Not open source right now but if people are interested I could clean up the code.


Microsoft seems to use a similar bot themselves, not sure how it is called or whether it is OSS: https://github.com/microsoft/winget-cli/issues/4765#issuecom...


Oh yeah, that looks super similar. I remember the similarity score being tricky to get useful signal out of, for the underlying model I had used back then. Similar and dissimilar issues all hovered around the 0.80 mark. But surely not hard to improve on, with larger models and possibly higher-dimension vectors.


If only Microsoft was interested in finding actual useful use-cases for their machine learning tech instead of constantly selling everyone on their chat bot...


Not the OP, but you raise good points. Performance might also be a concern, thinking of languages like Python and its ast package (not sure that’s accessible without going through the interpreter).

For a tool I’m writing, the tree-sitter query language is a core piece of the puzzle as well. Once you only have JSON of the concrete syntax trees, you’re back to implementing a query language yourself. Not that OP needs it, but ast-grep might?


Yes, ast-grep already has its own rule syntax [1]. But still parser performance and binary size are critical factors for a general tool.

https://ast-grep.github.io/guide/rule-config.html


These sorts of cases are why I wrote srgn [0]. It's based on tree-sitter too. Calling it as

     cat file.py | srgn --py def --py identifiers 'database' 'db'
will replace all mentions of `database` inside identifiers inside (only!) function definitions (`def`) with `db`.

An input like

    import database
    import pytest


    @pytest.fixture()
    def test_a(database):
        return database


    def test_b(database):
        return database


    database = "database"


    class database:
        pass

is turned into

    import database
    import pytest


    @pytest.fixture()
    def test_a(db):
        return db


    def test_b(db):
        return db


    database = "database"


    class database:
        pass

which seems roughly like what the author is after. Mentions of "database" outside function definitions are not modified. That sort of logic I always found hard to replicate in basic GNU-like tools. If run without stdin, the above command runs recursively, in-place (careful with that one!).

Note: I just wrote this, and version 0.13.2 is required for the above to work.

[0]: https://github.com/alexpovel/srgn


This is super cool! I wish I'd known about this.


How does this compare to https://github.com/ast-grep/ast-grep


Regarding your third point, I put together a tool capable of that to some degree.

It allows you to grep inside source code, but limit the search to e.g. “only docstrings inside class definitions”, among other things. That is, it allows nesting and is syntax aware. That example is for Python, but the tool speaks more languages (thanks to treesitter).

https://github.com/alexpovel/srgn/blob/main/README.md#multip...


The tool you are describing is what I am trying to build at https://github.com/alexpovel/srgn . The idea is a compromise between regex (think ripgrep) and grammar awareness (through tree-sitter).


https://github.com/alexpovel/srgn

It grew out of a niche, almost historical need: using a QWERTY keyboard, but needing access to German Umlauts (ä, ö, ü, as well as ß). Switching keyboard layouts is possible but exhausting (it's much more pleasant sticking to one); using modifier keys is similarly tedious, and custom setups break and aren't portable.

So this tool can do:

    $ echo 'Gruess Gott, Poeten und Abenteuergruetze!' | srgn --german
    Grüß Gott, Poeten und Abenteuergrütze!
meaning it not only replaces Umlauts and Eszett, it also knows when not to (Poeten), and handles arbitrary compound words. Write your text, slap it all into the tool, it spits out results instantly. The original text can use alternative spellings (ou, ae, ue, ss), which is ergonomic. Combined with tools like AutohotKey, GUI integration through a single keyboard shortcut is possible. See [0] for a similar example.

A niche need I haven't yet come across someone else having as well! (just the amount of text explaining what it's all about is saying a lot in terms of specificity...)

The tool now grew into a tree-sitter based (== language grammar-aware) text manipulation thing, mostly for fun. The bizarre German core is still there however.

[0]: https://github.com/alexpovel/betterletter/blob/c19245bf90589...


I also use a QWERTY keyboard and I use a custom keyboard layout that maps alt-a to ä, alt-u to ü, alt-o to ö, alt-s to ß (plus the same for uppercase for the first 3). That works well for me without the need to post-process.

On macOS it's relatively easy to create using a tool called Ukulele (https://software.sil.org/ukelele/). You can also download my layout here: https://alex.kirk.at/USUmlaut.keylayout


On macOS you can also access related symbols by long-pressing keys. Obviously for large blocks of text you're still going to want to switch layouts but for a quick IM reply or typing a couple of characters I think it's faster.


The faster way is to press Option + U and then the letter u for ü or a for ä. Have been using a US keyboard since before this long press feature came over from iOS and it’s way faster that way.


I just use "US international PC" which allows using ä etc. by typing " and then combining it with a, u etc.


You can do the same on Windows with an AutoHotkey solution


I’ve done the same exact thing but for Latvian diacriticals (āčēģīķļņšūž).

The default behaviour of the Latvian layout in macOS is to make apostrophe a dead key, which really grinds my gears. So I made it alt+whatever letter instead. As a dev I use apostrophe way too much to be okay with typing '+space for it.


Would you mind sharing it?


On different flavors of *nix you have "compose character" that helps you with that. In my sway config I have

    xkb_options grp:win_space_toggle,compose:caps
and then I can do compose ss for ß or compose SS for ẞ.

https://en.wikipedia.org/wiki/Compose_key


WinCompose covers nicely the lack of *nix compose key in Windows.

https://github.com/samhocevar/wincompose


Now that is a tool I never knew I needed, great idea! I get by with the compose key or just googling the Umlaut if I have not set it up.

I guess I have just nerdsniped myself into building a script/hack that can automatically edit the clipboard and/or selected text on button press. There goes my weekend, but thank you for building this part!


> custom setups break and aren't portable

But this is a custom setup that's not portable?

And how is it less tedious to have to select previously typed text all the time? Is it mainly better because then you don't need to do anything on individual chars, but do it in a batch?


> But this is a custom setup that's not portable?

Caught me.

> And how is it less tedious to have to select previously typed text all the time?

This is tedious, but I have that automated (AutoHotKey). So a single, AHK-managed hotkey does the equivalent of:

    CTRL + SHIFT + HOME
    CTRL + C
    feed into tool, paste back
    CTRL + V
So once done writing, I press that single button and it's done (CTRL+SHIFT+HOME select all text from cursor to beginning). To me, that's a better tradeoff than fiddling with compose keys, which I find to break flow. For very short text, compose key is possibly better; but again, once in AHK, it's a single shortcut. So once more than 1 compose key combination is needed, it's "worth it". But you're right: this is a custom setup and might not work for everyone.


Oh, I didn't see the "CTRL + SHIFT + HOME" part in the linked AHK script, only the later steps, that's definitely less tedious, though also more dangerous (wouldn't work in rich text apps with tables and pictures etc.? though maybe your tool is smart enough to deal with rich text in the clipboard) and also pollutes the undo buffer.

Agree re. the potential to break flow part of the compose keys in general (all these grammar things should be automated away), though maybe in case of needing just a single diacritic the least fiddly alternative could be something like a hotstring in AHK ",u" to "ü" (comma + a letter without spaces, which doesn't happen in regular typing or a deadkey in keyboard layout with a similar effect)


I had this need!

Did you opt to blacklist words with vowel combinations that should not be transformed, or whitelist those that should? Or something else entirely?


It's currently whitelist-based [0]. The downside is larger (code) size. The upside is simplicity. I imagine a blacklist could also work well, at smaller size but with more preprocessing needed.

[0]: https://github.com/alexpovel/srgn/blob/0008cce1c71f0d83f6a31...


I've been thinking about a tool to format a number where the common display format is abc.defg.hijk.lmn (but I'd be retrieving it from an e.g. database without the periods), the sample AutoHotkey script is a great starting point!


This is why I stick to layouts with separate keys for " and the dead-key ¨ Plenty of european layouts with qwerty and the necessary dead keys, that work on every OS.


The greeting seems Austrian?


Wow! What a coincidence. Just the other day I finished "v1" of a similar tool: https://github.com/alexpovel/srgn , calling it a combination of tr/sed, ripgrep and tree-sitter. It's more about editing code in-place, not finding matches.

I've spent a lot of time trying to find similar tools, and even list them in the README, but `AST-grep` did not come up! I was a bit confused, as I was sure such a thing must exist already. AST-grep looks much more capable and dynamic, great work, especially around the variable syntax.


This looks really interesting, thank you for putting this together! I’ll likely give it a go today. I say that as someone who has explored quite a few of these and found them mostly quite basic. srgn looks like more than the usual.

One minor comment: I personally found the first Python example involving a docstring a little hard to parse (ha). It may show a variety of features, but in particular I found that it was hard to spot at a glance what had changed.

If you could use diff formatting or a screenshot with color to show the differences it would make it much easier to follow. If I get around to using it later today, I might submit a PR for that. :)


> diff formatting

Thank you for the feedback! That sounds good, I'll add that.


I'm working on an another similar tool! https://github.com/bablr-lang

A lot like your project but with more of a focus on supporting data structures for incremental editing of programs. Kind of a DOM for code.


I'll post my own crappy one called oak which uses templates to render the result of tree-sitter queries.

https://github.com/go-go-golems/oak

I initially hope the queries would be more powerful, but they are really not. You can write queries and a resulting template in a yaml file. The program will scan a list of repositories for all these YAML files, and expose them as command line verbs.

Here is one to find go definitions:

https://github.com/go-go-golems/oak/blob/main/cmd/oak/querie...

This can then be run as:

         oak go definitions /home/manuel/code/wesen/corporate-headquarters/geppetto/pkg/cmds/cmd.go          
        type GeppettoCommandDescription struct {
                Name      string                            `yaml:"name"`
                Short     string                            `yaml:"short"`
                Long      string                            `yaml:"long,omitempty"`
                Flags     []*parameters.ParameterDefinition `yaml:"flags,omitempty"`
                Arguments []*parameters.ParameterDefinition `yaml:"arguments,omitempty"`
                Layers    []layers.ParameterLayer           `yaml:"layers,omitempty"`
 
                Prompt       string                      `yaml:"prompt,omitempty"`
                Messages     []*geppetto_context.Message `yaml:"messages,omitempty"`
                SystemPrompt string                      `yaml:"system-prompt,omitempty"`
        }
        type GeppettoCommand struct {
                *glazedcmds.CommandDescription
                StepSettings *settings.StepSettings
                Prompt       string
                Messages     []*geppetto_context.Message
                SystemPrompt string
        }
While I can use it for good effect for LLM prompting as is, I really would like to add a unification algorithm (like the one in Peter Norvig's Prolog compiler) to get better queries, and connect it to LSP as well.


Such an awesome idea and useful tool!

Do you use tree-sitter for the AST part also?


Exactly, all the parsing is done by tree-sitter. The Rust bindings to the tree-sitter C lib are a "first-class consumer".


ancv: https://github.com/alexpovel/ancv/

Idea: renders your resume as pretty terminal output. Others can view it in their own terminals:

    curl -L ancv.io/heyho
Pipe to a pager for best viewing. Yes, it's just a nerdy gimmick with almost no real use!

I provide a GCP-hosted server that works off GitHub gists (where your resume can live in JSONResume form). However, self-hosting is a first-class citizen and easy to use as well.


A couple years ago, I switched from German QWERTZ to a UK QWERTY keyboard (wouldn't have minded US QWERTY but the differently shaped return key seemed too foreign). I am not looking back: for programming but also general tasks, having keys like

    ` [ ] \ / { }
very easily available is a blessing. The German QWERTZ keyboard has triple occupation on some keys, which is not ergonomic and harder to type fast with.

Anyway, both Linux and Windows offer fast switching between installed keyboard layouts/languages using SUPER+SPACE. This is needed in e.g. emails, where I still need Umlauts. It's just much easier to read that way. However, switching back and forth constantly is completely overwhelming and not viable. However, in German, there are perfectly and officially (?) acceptable alternative spellings for our special "Unicode"-characters. They can be typed using plain ASCII, aka a QWERTY keyboard.

So, I wrote a script to read in any text, combined it with AutoHotkey on Windows and now have a tool that, at the touch of a button, replaces selected text using alternative spellings (gruen, Duebel, Faehre) with their correct versions (grün, Dübel, Fähre). The tool could be extended for other languages rather easily. I've been using it for over a year now and recently got to release it properly on the cheese shop:

    pip install betterletter
(https://pypi.org/project/betterletter/)

Before putting this together, I had looked around for an existing tool. To my surprise (there's always something!), I found nothing. I guess this scratches a too specific itch: using QWERTY but wanting proper spelling quickly, while remaining on QWERTY as to not have a mental breakdown and stay at full typing speed.

After writing, select everything (CTRL+SHIFT+HOME works well), hit shortcut, text will be replaced. This takes about 2 seconds, much faster than switching keyboard layouts back and forth. If this ran as a daemon with the dictionary loaded into RAM already, the script could run almost instantaneously (most of the 2 seconds is IO, reading from disk), in linear time according to the text input size.


Check out Neo [0]. It has important signs under your strong fingers. Every time i am forced to use QWERTX i get reminded how bad it is to type on.

[0] https://neo-layout.org


Neat! I'm using QWERTY International layouts myself, where you can type umlauts and ß with special keys for modifiers (e.g. alt+u on Mac for ¨), but I still think this is a cool tool.

Looking through the repo I wondered why you would commit the complete German dictionary weighing in at over 30 MB, whereas you only need a small fraction, the words containing the umlauts (or their false matches). Surely this would be a huge performance boost?

Turns out: a whopping 30% of that dictionary are words containing "ae|oe|ue|ss|ä|ö|ü|ß". Crazy. I would not have guessed that, at all.


> Neat! I'm using QWERTY International layouts myself, where you can type umlauts and ß with special keys for modifiers (e.g. alt+u on Mac for ¨), but I still think this is a cool tool.

Yeah, I had looked into these but for some reason that didn't work. Don't remember why.

> Looking through the repo I wondered why you would commit the complete German dictionary weighing in at over 30 MB, whereas you only need a small fraction, the words containing the umlauts (or their false matches). Surely this would be a huge performance boost?

Yes! It would be performance boost. In fact, I had a "caching" sort of functionality in the tool before. The whole dictionary is shipped (because that makes it much easier and there's almost no risk of wrong-doing just copy-pasting a word list, plus it compresses well enough), but then a list containing only special characters will be generated on first use if it doesn't exist yet.

As you noted, a lot of words do contain special letters, so the "complexity" wasn't worth it to me and I removed that. Could be brought back anytime, but it's fine for now.


Just fyi, Windows will let you set a keyboard layout per window. If you like writing programs, you can write one to switch the linux keyboard layout based on active window.


This looks amazing and like everything I always wanted.

Sadly, I think basing off XeTeX and not LuaTeX is a mistake. Certainly renders it unusable for me. Having Lua integration is just great.

Also, `lualatex` does not have some of the limitations of `xelatex` (memory limitations, `contours` package, ...), but I guess this XeTeX reimplementation can work on removing those implementations, so that only lack of Lua integration remains.

Also, like another person said, not having biber breaks my workflow as well, which specifically tries to leverage the "latest and greatest" of what LaTeX has to offer [0]: `pdflatex` is obsolete, so `lualatex` it is. `nomencl`, `makeindex` etc. are obsolete, so `glossaries-extra` it is. `bibtex` is obsolete, so `biber` it is. Throw in `latexmk` for automatic compilation (which the tool presented here does too, which is a biggie! [1]) and CI/CD and you have a 1970s tool in 2020s attire. Lua rounds off the picture.

Among other things, this given Unicode-native (gasp) code/documents, and great automation capabilities (`latexmk`, CI/CD, Lua).

I think a modern TeX engine reimplementation should support all of the above, which are arguably the best modern options there are.

[0]: https://collaborating.tuhh.de/alex/latex-git-cookbook [1]: I wonder if the logs are available though? aux, blg etc. are important for debugging and shouldn't be dropped outright.


I don't understand either, I thought that all developments effort went into LuaTeX now and that XeTeX was obsolete... and, for, like, 5-6 years


> Also, `lualatex` does not have some of the limitations of `xelatex` (memory limitations, `contours` package, ...)

Are there some more details of the memory limitations you can share with us?


LuaLaTeX allocates memory as-needed, see section 3.4.1 in the manual [0] (and comments/answers in this thread [1]). Base TeX has an arbitrary, by modern standards low memory limit, leading to a whole class of errors plaguing unsuspecting users [2], and spawning entire extensions to deal with these limitations [3].

This is simply an artefact of times past and has no technical relevance nowadays. LuaTeX allows dynamic allocation, with the available system RAM as the upper limit (so effectively, no limitations in everyday usage).

Now, I could not find a mention of memory handling in the XeTeX reference manual [4]. People are using tricks like `tikzexternalize` with xelatex [5, 6]. Especially the first point makes me think XeLaTeX inherits base TeX memory handling/limits, but I cannot confirm this.

I just know that all my problems disappeared when switching from XeLaTeX to LuaLaTeX.

Lastly, see here [7] for a comprehensive (albeit somewhat anecdotal) list of advantages of LuaTeX over XeTeX. Of that list, `microtype` is another significant functionality I rely on.

[0]: http://www.tug.org/texlive//devsrc/Master/texmf-dist/doc/con...

[1]: https://tex.stackexchange.com/q/7953/

[2]: https://tex.stackexchange.com/search?q=tex+capacity+exceeded

[3]: https://tex.stackexchange.com/a/482560/

[4]: http://mirrors.ctan.org/info/xetexref/xetex-reference.pdf

[5]: https://tex.stackexchange.com/q/438131/

[6]: https://tex.stackexchange.com/q/334250/

[7]: https://tex.stackexchange.com/q/126206/


I’ve been a luatex advocate in the past¹, but I use xetex instead, unless I need the Lua integration. The memory handling is the reason. I find that for documents with a lot of fonts, luatex eats all the memory available and then crashes, taking a huge amount of time to do so, whereas xetex just breezes through the same document.

[1]https://lwn.net/Articles/731581/


What's troublesome for me is that I have been using

* xetex when I needed a font that was not easily achievable in pdftex over the past decade * pdftex for everything else because microtype(TM) just works(TM) (even though kerning can be done using fontspec and font features in xetex).

I've tried luatex multiple times over the past decade, it was mostly just too slow. Now luatex is fast. But I have no idea if I now "should" use luatex over pdftex for best out-of-the-box results or not.

Unfortunately, switching to luatex is not a zero-effort (moving to polyglossia, using fontspec, maybe removing some magic in many-lines private templates, and so on).

For all I know, because I'm always curious and peek at PDF file properties as a hobby (if only to check which cool font that is), basically every scientific paper I read is set using pdftex. luatex usage in the wild is, as far as I perceive it, nil, outside of enthusiast luatex user spheres. I don't think this will change unless texlive drops pdftex (as it still ships ptex and even uptex, it probably won't for a very long time).


> basically every scientific paper I read is set using pdftex

That is only because their templates are years behind the curve and they are slow to update. It is not an argument for the advantages of pdftex, aside from its stability, gained over many decades.

LuaTeX has been nothing but stable for me, so from a technical standpoint, there is no reason not to switch.

As far as scientific papers go, the publishers and editors probably value stability and backward-compatibility (I would).


Officially, luatex is the future. ConTeXt is based on it. I’ve heard that the kinds of problems I’m having are caused by its font-loading routines, and not the core parts of luatex, but without further research that doesn’t really help me.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: