Learn more about Titus Wormer's open-source natural language processing ecosystem!
Titus Wormer does a lot of work with natural language processing. He is the creator of Retext, Remark, and many many more open-source projects.
In this episode Titus chats about his extensive open-source work, and goes a bit deeper into his ecosystem of natural language processing tools and plugins. You'll also learn about abstract state trees and their practical applications!
Kent C. Dodds (00:00):
Hello friends. This is your friend to Kent C. Dodds. And I am joined by my friend at Titus Warmer. Say hi, Titus.
Titus Wormer (00:07):
Kent C. Dodds (00:08):
Well, Titus, I did ask you to say hi Titus, but that's okay. I'm just kidding. All right. So I am excited to chat with Titus. This Is actually the first time I've seen Titus not in person, we're virtual, but basically, and it's just a pleasure to chat with Titus over the internet tubes. So yeah, I've been following you Titus for god, a couple of years, on Twitter, following your stuff on GitHub. You're very active. Probably everybody listening has used software that has used your software if they haven't directly used your software. So I'm just really excited to talk with you about that. But before we get too far into this, I'd love for our audience to get to know you a little bit. So could you introduce yourself to us please?
Titus Wormer (00:58):
Yeah. Yeah. Thanks for the introduction. I'm Titus. I live in Amsterdam, the Netherlands, and I do a lot of the choice. And yeah, most people know me from my penguin avatar. So not a lot of people have seen me how I actually look. And probably also not a lot of people know about the work that I do, because a lot of it is very low level. So it's often in the things that you use and those are used in other things. And yeah, so it's very low level stuff often, but it's used a lot. I think it's like 400, 500 GitHub repose and a lot of packages come out of that. And those are downloaded, I think right now above 10 billion times a year. I used a lot and fairly low level and yeah, that's what I do.
Kent C. Dodds (02:04):
Oh, that's awesome. Do you have a job? Do you make money?
Titus Wormer (02:11):
I used to be a teacher as in a professor, so I've made choruses.
Kent C. Dodds (02:15):
Titus Wormer (02:15):
Yeah. And I taught folks how to do front end and backend and data visualization and stuff like that. And I quit, I think like November, 2018 or a bit later. Then we raised some money for Open-source and Open-collective to continue the work in Open-source because it was just a lot of issues Open-source. So it's a lot of work answering all those issues and still writing code and then having a job that you also really enjoy where you have a lot of students that are asking questions and yeah, so that was a lot.
Yeah. And then we got some money in Open-source and then you have some money in Open-source and then you kind of need to use it. So, and it's, we got, I think 20K in half year, which is a lot of money, way more than like you can print stickers with. And that's a lot of stickers. But it's also not really enough to live actually. Yeah. So in the last couple of years I've been sometimes doing some freelancing, some contracting on top of my Open-source. And in other cases recently Salesforce has actually paid me to work on an Open-source project that I was already working on. But I couldn't work full time because I had to contract next to it. And they were like, "Okay, what if you work full-time on that new thing and we'll pay you for it". And that was great. More companies should do that. Yeah. That's awesome. And then now for the last half year, I'm going to living off of that and I'm almost broke, so I need to figure out something again. And, yeah.
Kent C. Dodds (04:12):
It sounds like it's a constant hustle.
Titus Wormer (04:16):
Kind, yeah. Well, I'm fortunate enough to live in Europe where security is better. So I do have some health stuff, whereas in the US that's a bit worse. And I'm fortunate enough that I build a lot of important infrastructure. So companies do sometimes know me and reach out. And so, but it's also only a couple of years, so we'll see what your future holds, but it's going okay for now.
Kent C. Dodds (04:53):
Yeah. Well I know that like pretty much everybody who uses Gatsby is using your software. Maybe everybody, I don't know if they have markdown person built in, then I think they do. And yeah, pretty much everybody who has a developer blog is using remark and Unified. So why don't you give us a little idea behind? Oh, and one of the things I wanted to mention is you do have a sponsor's page on GitHub. So if anybody listening would like to sponsor you, I sponsor you. And yeah. So there's opportunity there. If you want to support a fellow developer or get your company to support, that would be even better. But yeah, I'd love to hear for those who don't know about the Unified project and Remark and Rehype, and maybe they know those things, but they're not sure which is which. Could you give us a rundown? I know you said you had 400 to 500 repos, so you don't have to tell us about all of them, but just kind of a lay of the land there.
Titus Wormer (05:55):
It originally kind of started with Retext and that's a natural language parser. And you give it a sentence and it'll split that sentence up in, or that text up into... Okay, this is a paragraph, it has a couple of sentences and those sentences have words and punctuation and symbols. And that was useful too, for spell checking or for readability checking. So yeah, or lots of other things. And then I build Remark and that sort of the same idea, but then for markdown. So it'll take markdown document, it'll parse that into a Syntax tree. So that's the representation for, so that computers can kind of understand what's going on.
Kent C. Dodds (06:48):
That's just like a big object.
Titus Wormer (06:52):
Yeah. A big JSON object, a tree structure. So, and tree here means that you have headings and headings contained texts and you have gold and you have lots of things. And then later I made Rehype and that's for HTML. So kind of similar thing. So most of my work revolves around natural language and content. So modular like Lego breaks and layered like a gig. So if you don't like one piece, so for example, Remark the markdown parser is too high level for you and you don't like the interface, that's fine. You can use the underlying tools and you can combine all of those tools with each other in different ways.
Kent C. Dodds (07:40):
Yeah. And it's all very composable and that's what makes that possible. And then there are even higher level tools like there's the Gatsby Remark plugin and different things like that. I have a thing called MDX bundler that composes some of these tools together to for bundling MDX. Actually MDX is a interesting topic in itself. Do you want to talk about the origin stories of MDX and your role in that?
Titus Wormer (08:08):
Yeah. The MDX-JSX project is built on these tools. So it's one of the higher level projects and one of the highest level projects that I work on. And what it does is it, parses kind and mix between markdown and JSX. So, you have your JSX if your angle, your pointy brackets, and then you have H1 and you close those nicely, that's JSX and that's really nice if you're going to do components. So if you're in react or preact or whatever, then you really enjoy components, right? So that's where JSX is great. But if you are going to do paragraphs or strong, it's just a lot of typing to do all of those pointy brackets. And Markdown is much nicer, or a lot of people think it is for paragraphs and strong and links and those typical content things. MDX allows like the combination of the two. So it has, there is markdown syntax for most of the pros, most of the texts.
And then it has JSX for the components. And I think the idea was originally from Guillermo, from Purcell and a couple of other people had similar ideas and that kind of became MDX. And then John Outlander, he wrote the MDX-JSX project on top of Unified. So remark and reopen those things do actually make it into reality.
Kent C. Dodds (09:59):
Titus Wormer (10:40):
And to add, so I think like two months ago I made a reactor for components demo on MDX and rec server components is something that Facebook is working on. And it's a really nice way to combine static things with interactive things and splitting those bundles. So, that one part is running on the server. Right? And it's already like, it's, pre-rendered, it's send to the client and then only those interactive components are sent with it to update things. And that also works really well when combined with MDX. So MDX is already cool. And I think this will make it even cooler.
Kent C. Dodds (11:24):
Yeah. Yeah. For real, it's outrageous. I don't know if regular content creators are still using wizzy wigs, or if they're like, people are starting to author things in Markdown. But for me, I love writing things in Markdown and I think it's just a really nice way to express that sort of content in general. It's just really nice and terse, and I know that what ends up being generated, isn't just like weird dibs and spans all over the place.
Titus Wormer (11:53):
Yeah. Markdown is also so like, we're developers and we really love Markdown and also JSX, but there are some people like, normal people that are pretty confused by Markdown. So the image syntax is kind of annoying. So I'm not sure if it works for everybody in the world, but it works really well for developers. And I've also recently set up a pipeline where you have a giant Markdown document and it's turned into books. So it's turned into beautiful digital on the web, but also e-books and Kindle and even print books. So it's automatically laid out in a beautiful, yeah, they're really cool. Often it's called Holloway by the way. So check it out if you're interested in that. They have some, they're kind of a modern publisher and they're tiny, but they have some cool books. Yeah.
Kent C. Dodds (12:57):
What was the company called again?
Titus Wormer (12:59):
Kent C. Dodds (13:00):
Holloway. Cool. That sounds awesome.
Titus Wormer (13:03):
And the reason I wanted to mention it is that they work with editors and authors and you can learn people, the Markdown syntax in like an hour. It may not work for your dad or for name a random person, but an editor or an author could learn this index of Markdown. And yeah.
Kent C. Dodds (13:33):
That's awesome. So you said there are 400 to 500 repos for the whole Unified organization. I'm guessing that they're more than just those three projects. So, what makes up the majority of those repos?
Titus Wormer (13:51):
So you kind of have the main projects of these nine ecosystems. So you have Markdown, you have HTML, you have natural language, you have a couple of others and the X is one. And then there are, and so you have those main projects in each organization. And then there are a lot of plugins. So, I'm not sure how many we maintain of the Remark ligands, but that's probably like 50. And then I think in total, there may be like 150 on get up. But I haven't really looked at this recently. So, we maintain a few. And under those plugins are utilities and that's also about 120 of them. And then there are even lower level things sometimes. So, There are a list of HTML text that are known or list of CSS fender prefixes, for example. And these are tiny modules, but they sometimes change. And I don't want to update all my manual lists of HTML element names in a lot of projects. So those are in separate packets in the new update one and it bubbles through the whole ecosystem.
Kent C. Dodds (15:19):
Got it. Yeah. It sounds like a lot to manage. So these plugins, I'm curious to dive deeper into that. What do plugins do that Remark doesn't do by itself?
Titus Wormer (15:31):
So Remark basically it doesn't do anything. And these ecosystem tools, Remark, Rehab, Retext, they do the following. They take a document, text, they parse it into the syntax tree. Then there are a lot of plug-ins and then they serialize that syntax tree back into text. In some cases it doesn't end up as text, but it's going to react elements or some other things. But that's the typical case and that's all these projects do. And plugins do everything in between and plugins can inspect or can transform content. So they can look at, okay, here is emphasis and it's using an asterisk. Okay. Here is emphasis and it's using an underscore. You should use one character for these things just be consistent. So that's checking, but it can also generate stuff. So again, that's the transforming. So it could find like all the headings trip during your Markdown document and then find, okay, here's a heading called table of contents and then it will do inject, a nice table of contents there, or yeah. A lot of stuff can happen. Basically anything can happen in plug-ins.
Kent C. Dodds (16:54):
All your dreams can come true. Yeah. So Remark plugins kind of serve the same role as Babel plugins, or even ESLint plugins. You can do linting and things just for Markdown and HTML. Right?
Titus Wormer (17:11):
Kent C. Dodds (18:08):
Yeah. And I know that Sebastian McKinsey is working on Rome and he's going to do lots of the same thing. So yeah. Having it only parsed. And when we say parsed, that's taking it from the texts that you write into this big JSON object that is a tree structure and that process. And then you take that tree structure that maybe your plugins have modified or something. And then you stringify it or serialize it back into text. And Babel does this. ESLint never stringify it, well, I guess they do. Yeah. Because they have their auto fix and stuff and yeah. And then Remark does this, but Remark allows you to not only transform, but also lint that the output.
Titus Wormer (18:55):
Kent C. Dodds (19:57):
Cool. Yeah. So I'll just mention a couple of the plugins that I've done. Because making plugins is, it's both easier and harder than you think. I think it's mostly easier than people think. For me, when I was starting to get into abstract syntax trees, just the name, abstract syntax trees scared me. But once I got into it, I realized there was more to it than I thought, but it wasn't as magic as I thought. So for some Remark plugins that I've done, I have one on my site that automatically turns any Amazon link to an affiliate link. So it just adds the affiliate tag on there. It does the same for any egghead link that I have on my site. So, that's really nice. So I don't have to remember to add that all the time, it's just added automatically for me.
And when I moved from medium.com to my own website, one thing that I really missed about markdown versus medium editor was I could take a URL for something, I could paste it in a medium article and just hit enter, and it would Embed. It was magical and I didn't have to go and click around and find the I-frame code or whatever.
And so I made a plugin that did exactly that. I just take the URL, I'd stick it in my mark down. That's all I had to do. And then my plugin would turn that into the Embed code. And sometimes you have to make a request, like for Twitter, for example, you make a request to get what the Embed code would be for a particular tweet or something. And so, the benefit of that is you get SEO juice from other people's tweets. And so, because it's all rendered at build time. And so it shows up in the HTML that comes back. My site's a Gatsby site right now, moving over to Remix. But anyway, so that was another one that's really useful. And that one actually has turned into a much bigger project called a Remark and better, which people should check out.
So that that's another cool one. I've got another one where I transform my Cloudinary images, any image that is hosted on at Cloudinary, I'll add different transforms to it, so that it's optimized for my website and stuff. Because they have all sorts of really cool transforms you can add. So I don't have to worry about adding those myself. I just, here's the link and then it gets transformed for me automatically. So yeah, lots of really cool things that you can do just for your own use cases. I haven't published most of these. They're just little things that I add to where I compile my MDX stuff and most of them are just like 15 lines of code. It's not that much. So are there any other interesting use cases that you've seen of people building plugins with Unified?
Titus Wormer (22:49):
Yeah. So many, there are so many cool ideas. It's sometimes like look through the code and there's a lot of hits on get up for this code. So it's a lot to swift through, but it's just really cool to see tiny plugins that people wrote themselves to do fun stuff. Yeah, just a lot of examples. I don't have anything off the top of my head.
Kent C. Dodds (23:16):
Yeah. There's about a bazillion. If you check on NPM, like find all the dependents. Yeah, people publish this stuff. But like I said, one of the coolest things about plugins, isn't the ability to publish them. But actually it's the ability to just make a short 15 line thing that just makes your life a little easier when you're creating this content. And I haven't tried any linting or anything like that for my stuff. But I imagine that would probably be a really useful feature if I had other content writers. And you even have stuff like, I think this is for Retext maybe, but little checkers for inclusive language and stuff like that too. Right?
Titus Wormer (24:01):
Yeah. So this project called Retext equality, it's a very old project, the name isn't perfect, but kind of just yeah, makes sense. And that's used inside LX and LX is kind of the famous thing that people know. And what LX tries to do is help you check your texts. Yeah. Whether it's inclusive and whether there are some words that you might, you haven't thought about, but maybe there are better alternatives. And a simple example is simple. If you write simple in your block box, just write this simple line, and the simple are maybe simple for you, but for a lot of other people, especially if they're Googling for like how to do this thing. Well, apparently it isn't simple and it's apparently not just rhyme, but it's also yeah, checking from master a slave and stuff like that.
And that's, I guess a lot of attention from a lot of people that also aren't super good at programming and I'm another white guy. So I am definitely not the person to decide whether these words are to be included or not. But a lot of people will suggest things there and there a couple of people that I find a source and then at those terms, yeah. And I also sometimes use that on the Unified website itself for the articles to make sure that I don't write just or other things.
Kent C. Dodds (25:48):
Yeah. Yeah. And what's cool is that whether you want to use LX or not, which definitely give that a look, you can make your own list of words that like... At this company, we just don't use these words because it refers to our ex founder and he's mean, or I don't know, whatever, whatever you want it to be. That's the cool thing about this is that you can build your own things to just make this tool, do what you want it to do. And plug-in systems are notoriously difficult to create and to work with. But I feel I've never felt any friction working with Remark plugin system, any plugin that I've written, it's not been, I haven't felt like I've been fighting with the plugin system, which has been really quite nice.
Titus Wormer (26:41):
Oh, that's great to hear. Yeah. That's awesome.
Kent C. Dodds (26:45):
Cool. Well, we're coming down on our time now. Is there anything else that you'd like to mention before we wrap up?
Titus Wormer (26:51):
Kent C. Dodds (26:53):
Titus Wormer (28:38):
No, that's perfect. And yeah, just there are so many possibilities. It's really cool what you can do with ASTs. It can really help your, like the, yeah.
Kent C. Dodds (28:50):
Absolutely. And another thing that I've noticed when I'm working with ASTs is that, it makes me a better programmer because I understand what my syntaxes actually means. And so I say, oh, okay. So this is an expression. So I can actually stick that right here. And I don't have to worry about making a variable out of it or whatever. So, yeah, it's pretty cool. All right. Hey, thanks everybody. It's been a pleasure Titus. What's the best way for people to connect with you?
Titus Wormer (29:16):
Yeah. You can follow me on Twitter at, at W O O R M. So triple O, so woorm. And I'm also under the same name and that same Bingwin on GitHub. Yeah. I have more followers on GitHub than on Twitter. So follow me on GitHub and if you enjoy Twitter, both are fine or not also fine.
Kent C. Dodds (29:42):
Very good. Cool. Well, Hey, thanks so much. It's been a pleasure to chat with you. We are going to do another episode. So if you're listening to this in order and your next episode should be another interesting one with Titus, we're going to talk about native ESM, which is a topic that's been causing me a lot of pain recently. So it's going to be fun as well. See you all on the next one.
Titus Wormer (30:02):