Back to overview

Titus Wormer Chats About Natural Language Processing Tools

Learn more about Titus Wormer's open-source natural language processing ecosystem!

Titus Wormer does a lot of work with natural language processing. He is the creator of Retext, Remark, and many many more open-source projects.

In this episode Titus chats about his extensive open-source work, and goes a bit deeper into his ecosystem of natural language processing tools and plugins. You'll also learn about abstract state trees and their practical applications!

Homework

Guests

Titus Wormer
Titus Wormer

Transcript

Kent C. Dodds (00:00):
Hello friends. This is your friend to Kent C. Dodds. And I am joined by my friend at Titus Warmer. Say hi, Titus.

Titus Wormer (00:07):
Hi Kent.

Kent C. Dodds (00:08):
Well, Titus, I did ask you to say hi Titus, but that's okay. I'm just kidding. All right. So I am excited to chat with Titus. This Is actually the first time I've seen Titus not in person, we're virtual, but basically, and it's just a pleasure to chat with Titus over the internet tubes. So yeah, I've been following you Titus for god, a couple of years, on Twitter, following your stuff on GitHub. You're very active. Probably everybody listening has used software that has used your software if they haven't directly used your software. So I'm just really excited to talk with you about that. But before we get too far into this, I'd love for our audience to get to know you a little bit. So could you introduce yourself to us please?

Titus Wormer (00:58):
Yeah. Yeah. Thanks for the introduction. I'm Titus. I live in Amsterdam, the Netherlands, and I do a lot of the choice. And yeah, most people know me from my penguin avatar. So not a lot of people have seen me how I actually look. And probably also not a lot of people know about the work that I do, because a lot of it is very low level. So it's often in the things that you use and those are used in other things. And yeah, so it's very low level stuff often, but it's used a lot. I think it's like 400, 500 GitHub repose and a lot of packages come out of that. And those are downloaded, I think right now above 10 billion times a year. I used a lot and fairly low level and yeah, that's what I do.

Kent C. Dodds (02:04):
Oh, that's awesome. Do you have a job? Do you make money?

Titus Wormer (02:11):
I used to be a teacher as in a professor, so I've made choruses.

Kent C. Dodds (02:15):
Really?

Titus Wormer (02:15):
Yeah. And I taught folks how to do front end and backend and data visualization and stuff like that. And I quit, I think like November, 2018 or a bit later. Then we raised some money for Open-source and Open-collective to continue the work in Open-source because it was just a lot of issues Open-source. So it's a lot of work answering all those issues and still writing code and then having a job that you also really enjoy where you have a lot of students that are asking questions and yeah, so that was a lot.
Yeah. And then we got some money in Open-source and then you have some money in Open-source and then you kind of need to use it. So, and it's, we got, I think 20K in half year, which is a lot of money, way more than like you can print stickers with. And that's a lot of stickers. But it's also not really enough to live actually. Yeah. So in the last couple of years I've been sometimes doing some freelancing, some contracting on top of my Open-source. And in other cases recently Salesforce has actually paid me to work on an Open-source project that I was already working on. But I couldn't work full time because I had to contract next to it. And they were like, "Okay, what if you work full-time on that new thing and we'll pay you for it". And that was great. More companies should do that. Yeah. That's awesome. And then now for the last half year, I'm going to living off of that and I'm almost broke, so I need to figure out something again. And, yeah.

Kent C. Dodds (04:12):
It sounds like it's a constant hustle.

Titus Wormer (04:16):
Kind, yeah. Well, I'm fortunate enough to live in Europe where security is better. So I do have some health stuff, whereas in the US that's a bit worse. And I'm fortunate enough that I build a lot of important infrastructure. So companies do sometimes know me and reach out. And so, but it's also only a couple of years, so we'll see what your future holds, but it's going okay for now.

Kent C. Dodds (04:53):
Yeah. Well I know that like pretty much everybody who uses Gatsby is using your software. Maybe everybody, I don't know if they have markdown person built in, then I think they do. And yeah, pretty much everybody who has a developer blog is using remark and Unified. So why don't you give us a little idea behind? Oh, and one of the things I wanted to mention is you do have a sponsor's page on GitHub. So if anybody listening would like to sponsor you, I sponsor you. And yeah. So there's opportunity there. If you want to support a fellow developer or get your company to support, that would be even better. But yeah, I'd love to hear for those who don't know about the Unified project and Remark and Rehype, and maybe they know those things, but they're not sure which is which. Could you give us a rundown? I know you said you had 400 to 500 repos, so you don't have to tell us about all of them, but just kind of a lay of the land there.

Titus Wormer (05:55):
It originally kind of started with Retext and that's a natural language parser. And you give it a sentence and it'll split that sentence up in, or that text up into... Okay, this is a paragraph, it has a couple of sentences and those sentences have words and punctuation and symbols. And that was useful too, for spell checking or for readability checking. So yeah, or lots of other things. And then I build Remark and that sort of the same idea, but then for markdown. So it'll take markdown document, it'll parse that into a Syntax tree. So that's the representation for, so that computers can kind of understand what's going on.

Kent C. Dodds (06:48):
That's just like a big object.

Titus Wormer (06:52):
Yeah. A big JSON object, a tree structure. So, and tree here means that you have headings and headings contained texts and you have gold and you have lots of things. And then later I made Rehype and that's for HTML. So kind of similar thing. So most of my work revolves around natural language and content. So modular like Lego breaks and layered like a gig. So if you don't like one piece, so for example, Remark the markdown parser is too high level for you and you don't like the interface, that's fine. You can use the underlying tools and you can combine all of those tools with each other in different ways.

Kent C. Dodds (07:40):
Yeah. And it's all very composable and that's what makes that possible. And then there are even higher level tools like there's the Gatsby Remark plugin and different things like that. I have a thing called MDX bundler that composes some of these tools together to for bundling MDX. Actually MDX is a interesting topic in itself. Do you want to talk about the origin stories of MDX and your role in that?

Titus Wormer (08:08):
Yeah. The MDX-JSX project is built on these tools. So it's one of the higher level projects and one of the highest level projects that I work on. And what it does is it, parses kind and mix between markdown and JSX. So, you have your JSX if your angle, your pointy brackets, and then you have H1 and you close those nicely, that's JSX and that's really nice if you're going to do components. So if you're in react or preact or whatever, then you really enjoy components, right? So that's where JSX is great. But if you are going to do paragraphs or strong, it's just a lot of typing to do all of those pointy brackets. And Markdown is much nicer, or a lot of people think it is for paragraphs and strong and links and those typical content things. MDX allows like the combination of the two. So it has, there is markdown syntax for most of the pros, most of the texts.
And then it has JSX for the components. And I think the idea was originally from Guillermo, from Purcell and a couple of other people had similar ideas and that kind of became MDX. And then John Outlander, he wrote the MDX-JSX project on top of Unified. So remark and reopen those things do actually make it into reality.

Kent C. Dodds (09:59):
And it's awesome. I love using MDX because it's just as you described. It's a perfect manage of content and interactivity, right? If you ever want to, and there's like MDX deck for slide decks that are interactive and stuff. So if you ever want to have content, but also interactive pieces in there. And before, if you want to write your content in markdown, then you'd have to put an I-frame in there for something that's interactive or maybe a div with an ID and then have JavaScripts come in later or something. But MDX just made that a whole lot easier.

Titus Wormer (10:40):
And to add, so I think like two months ago I made a reactor for components demo on MDX and rec server components is something that Facebook is working on. And it's a really nice way to combine static things with interactive things and splitting those bundles. So, that one part is running on the server. Right? And it's already like, it's, pre-rendered, it's send to the client and then only those interactive components are sent with it to update things. And that also works really well when combined with MDX. So MDX is already cool. And I think this will make it even cooler.

Kent C. Dodds (11:24):
Yeah. Yeah. For real, it's outrageous. I don't know if regular content creators are still using wizzy wigs, or if they're like, people are starting to author things in Markdown. But for me, I love writing things in Markdown and I think it's just a really nice way to express that sort of content in general. It's just really nice and terse, and I know that what ends up being generated, isn't just like weird dibs and spans all over the place.

Titus Wormer (11:53):
Yeah. Markdown is also so like, we're developers and we really love Markdown and also JSX, but there are some people like, normal people that are pretty confused by Markdown. So the image syntax is kind of annoying. So I'm not sure if it works for everybody in the world, but it works really well for developers. And I've also recently set up a pipeline where you have a giant Markdown document and it's turned into books. So it's turned into beautiful digital on the web, but also e-books and Kindle and even print books. So it's automatically laid out in a beautiful, yeah, they're really cool. Often it's called Holloway by the way. So check it out if you're interested in that. They have some, they're kind of a modern publisher and they're tiny, but they have some cool books. Yeah.

Kent C. Dodds (12:57):
What was the company called again?

Titus Wormer (12:59):
Holloway.

Kent C. Dodds (13:00):
Holloway. Cool. That sounds awesome.

Titus Wormer (13:03):
And the reason I wanted to mention it is that they work with editors and authors and you can learn people, the Markdown syntax in like an hour. It may not work for your dad or for name a random person, but an editor or an author could learn this index of Markdown. And yeah.

Kent C. Dodds (13:33):
That's awesome. So you said there are 400 to 500 repos for the whole Unified organization. I'm guessing that they're more than just those three projects. So, what makes up the majority of those repos?

Titus Wormer (13:51):
So you kind of have the main projects of these nine ecosystems. So you have Markdown, you have HTML, you have natural language, you have a couple of others and the X is one. And then there are, and so you have those main projects in each organization. And then there are a lot of plugins. So, I'm not sure how many we maintain of the Remark ligands, but that's probably like 50. And then I think in total, there may be like 150 on get up. But I haven't really looked at this recently. So, we maintain a few. And under those plugins are utilities and that's also about 120 of them. And then there are even lower level things sometimes. So, There are a list of HTML text that are known or list of CSS fender prefixes, for example. And these are tiny modules, but they sometimes change. And I don't want to update all my manual lists of HTML element names in a lot of projects. So those are in separate packets in the new update one and it bubbles through the whole ecosystem.

Kent C. Dodds (15:19):
Got it. Yeah. It sounds like a lot to manage. So these plugins, I'm curious to dive deeper into that. What do plugins do that Remark doesn't do by itself?

Titus Wormer (15:31):
So Remark basically it doesn't do anything. And these ecosystem tools, Remark, Rehab, Retext, they do the following. They take a document, text, they parse it into the syntax tree. Then there are a lot of plug-ins and then they serialize that syntax tree back into text. In some cases it doesn't end up as text, but it's going to react elements or some other things. But that's the typical case and that's all these projects do. And plugins do everything in between and plugins can inspect or can transform content. So they can look at, okay, here is emphasis and it's using an asterisk. Okay. Here is emphasis and it's using an underscore. You should use one character for these things just be consistent. So that's checking, but it can also generate stuff. So again, that's the transforming. So it could find like all the headings trip during your Markdown document and then find, okay, here's a heading called table of contents and then it will do inject, a nice table of contents there, or yeah. A lot of stuff can happen. Basically anything can happen in plug-ins.

Kent C. Dodds (16:54):
All your dreams can come true. Yeah. So Remark plugins kind of serve the same role as Babel plugins, or even ESLint plugins. You can do linting and things just for Markdown and HTML. Right?

Titus Wormer (17:11):
Yeah. And I think Unified here is different than the other ones, because for Markdown and HTML, Unified allows one pipeline that does everything. Whereas in the JavaScript world, you have Babel for transforming things. You have ESLint for checking and formatting things. And there's per year, of course, as well. There's Tersa to minify JavaScript. So these are all separate projects that all have two parse and then transform and then serialize. And if you do those steps a lot, you loose time. And I think Unified here is doing it properly in that it allows all these things on that syntax tree and only parses at the start and serializes at the end.

Kent C. Dodds (18:08):
Yeah. And I know that Sebastian McKinsey is working on Rome and he's going to do lots of the same thing. So yeah. Having it only parsed. And when we say parsed, that's taking it from the texts that you write into this big JSON object that is a tree structure and that process. And then you take that tree structure that maybe your plugins have modified or something. And then you stringify it or serialize it back into text. And Babel does this. ESLint never stringify it, well, I guess they do. Yeah. Because they have their auto fix and stuff and yeah. And then Remark does this, but Remark allows you to not only transform, but also lint that the output.

Titus Wormer (18:55):
Yeah. And well, I never tried minifying Markdown. That might be possible, which people need to do. But for HTML, that works as well. And I think one part that I didn't add before is that these are separate JSON structures, separate trees, one for Markdown one for HTML, but it's also possible to transform from one to the other so that you have you Markdown and turn it into an HTML syntax tree without having to serialize in between. And the infer is also true. So you can have your HTML and turn it into Markdown and same with Metro language. And yeah, I'm also adding recently some more things around HTML and also doing some work on JavaScript syntax trees. And there are other people have made Oracle mode, which is kind of an alternative to Markdown similar ecosystems on these tools. Yeah.

Kent C. Dodds (19:57):
Cool. Yeah. So I'll just mention a couple of the plugins that I've done. Because making plugins is, it's both easier and harder than you think. I think it's mostly easier than people think. For me, when I was starting to get into abstract syntax trees, just the name, abstract syntax trees scared me. But once I got into it, I realized there was more to it than I thought, but it wasn't as magic as I thought. So for some Remark plugins that I've done, I have one on my site that automatically turns any Amazon link to an affiliate link. So it just adds the affiliate tag on there. It does the same for any egghead link that I have on my site. So, that's really nice. So I don't have to remember to add that all the time, it's just added automatically for me.
And when I moved from medium.com to my own website, one thing that I really missed about markdown versus medium editor was I could take a URL for something, I could paste it in a medium article and just hit enter, and it would Embed. It was magical and I didn't have to go and click around and find the I-frame code or whatever.
And so I made a plugin that did exactly that. I just take the URL, I'd stick it in my mark down. That's all I had to do. And then my plugin would turn that into the Embed code. And sometimes you have to make a request, like for Twitter, for example, you make a request to get what the Embed code would be for a particular tweet or something. And so, the benefit of that is you get SEO juice from other people's tweets. And so, because it's all rendered at build time. And so it shows up in the HTML that comes back. My site's a Gatsby site right now, moving over to Remix. But anyway, so that was another one that's really useful. And that one actually has turned into a much bigger project called a Remark and better, which people should check out.
So that that's another cool one. I've got another one where I transform my Cloudinary images, any image that is hosted on at Cloudinary, I'll add different transforms to it, so that it's optimized for my website and stuff. Because they have all sorts of really cool transforms you can add. So I don't have to worry about adding those myself. I just, here's the link and then it gets transformed for me automatically. So yeah, lots of really cool things that you can do just for your own use cases. I haven't published most of these. They're just little things that I add to where I compile my MDX stuff and most of them are just like 15 lines of code. It's not that much. So are there any other interesting use cases that you've seen of people building plugins with Unified?

Titus Wormer (22:49):
Yeah. So many, there are so many cool ideas. It's sometimes like look through the code and there's a lot of hits on get up for this code. So it's a lot to swift through, but it's just really cool to see tiny plugins that people wrote themselves to do fun stuff. Yeah, just a lot of examples. I don't have anything off the top of my head.

Kent C. Dodds (23:16):
Yeah. There's about a bazillion. If you check on NPM, like find all the dependents. Yeah, people publish this stuff. But like I said, one of the coolest things about plugins, isn't the ability to publish them. But actually it's the ability to just make a short 15 line thing that just makes your life a little easier when you're creating this content. And I haven't tried any linting or anything like that for my stuff. But I imagine that would probably be a really useful feature if I had other content writers. And you even have stuff like, I think this is for Retext maybe, but little checkers for inclusive language and stuff like that too. Right?

Titus Wormer (24:01):
Yeah. So this project called Retext equality, it's a very old project, the name isn't perfect, but kind of just yeah, makes sense. And that's used inside LX and LX is kind of the famous thing that people know. And what LX tries to do is help you check your texts. Yeah. Whether it's inclusive and whether there are some words that you might, you haven't thought about, but maybe there are better alternatives. And a simple example is simple. If you write simple in your block box, just write this simple line, and the simple are maybe simple for you, but for a lot of other people, especially if they're Googling for like how to do this thing. Well, apparently it isn't simple and it's apparently not just rhyme, but it's also yeah, checking from master a slave and stuff like that.
And that's, I guess a lot of attention from a lot of people that also aren't super good at programming and I'm another white guy. So I am definitely not the person to decide whether these words are to be included or not. But a lot of people will suggest things there and there a couple of people that I find a source and then at those terms, yeah. And I also sometimes use that on the Unified website itself for the articles to make sure that I don't write just or other things.

Kent C. Dodds (25:48):
Yeah. Yeah. And what's cool is that whether you want to use LX or not, which definitely give that a look, you can make your own list of words that like... At this company, we just don't use these words because it refers to our ex founder and he's mean, or I don't know, whatever, whatever you want it to be. That's the cool thing about this is that you can build your own things to just make this tool, do what you want it to do. And plug-in systems are notoriously difficult to create and to work with. But I feel I've never felt any friction working with Remark plugin system, any plugin that I've written, it's not been, I haven't felt like I've been fighting with the plugin system, which has been really quite nice.

Titus Wormer (26:41):
Oh, that's great to hear. Yeah. That's awesome.

Kent C. Dodds (26:45):
Cool. Well, we're coming down on our time now. Is there anything else that you'd like to mention before we wrap up?

Titus Wormer (26:51):
No.

Kent C. Dodds (26:53):
Perfect. Okay. So we do have homework for folks for, we want you to write your own plugin, but that's kind of a big ask. And so all that we want you to do is to go to astexplorer.net, that's A S T explorer.net and spend a few minutes playing around. So AST is short for abstract syntax trees, and there are a bunch of parsers available on there. So you can look at Markdown, but you can also look at JavaScript and TypeScript and CSS and HTML. There's everything, there's even like SQL and all sorts of stuff. I don't, it's big. It's really, really cool. And Remark is on there of course, but yeah, just play around on there for a few minutes and actually just spending a few minutes on there will hopefully make it seem less scary to you.
That it did for me. It was like, oh, okay. So it's just like, I've got some code over here and some magical thing turns it into this JavaScript object and okay. I can work with objects. But then once you're done with that, then I have a blog post called write your own code transform for fun and profit, and you can follow along in that blank post to write your own Babel macro, which is also supported in AST Explorer. And Babel macro, it's a project that I created that makes it a lot easier to write code transformations for JavaScript. So we just want you to play around with ESTs and see that there's more to them than you think. But they're also not as magical as you think. It's all just JavaScript at the end of the day. So, anything to add there?

Titus Wormer (28:38):
No, that's perfect. And yeah, just there are so many possibilities. It's really cool what you can do with ASTs. It can really help your, like the, yeah.

Kent C. Dodds (28:50):
Absolutely. And another thing that I've noticed when I'm working with ASTs is that, it makes me a better programmer because I understand what my syntaxes actually means. And so I say, oh, okay. So this is an expression. So I can actually stick that right here. And I don't have to worry about making a variable out of it or whatever. So, yeah, it's pretty cool. All right. Hey, thanks everybody. It's been a pleasure Titus. What's the best way for people to connect with you?

Titus Wormer (29:16):
Yeah. You can follow me on Twitter at, at W O O R M. So triple O, so woorm. And I'm also under the same name and that same Bingwin on GitHub. Yeah. I have more followers on GitHub than on Twitter. So follow me on GitHub and if you enjoy Twitter, both are fine or not also fine.

Kent C. Dodds (29:42):
Very good. Cool. Well, Hey, thanks so much. It's been a pleasure to chat with you. We are going to do another episode. So if you're listening to this in order and your next episode should be another interesting one with Titus, we're going to talk about native ESM, which is a topic that's been causing me a lot of pain recently. So it's going to be fun as well. See you all on the next one.

Titus Wormer (30:02):
Awesome.

Sweet episode right?

You will love this one too.

See all episodes

Featured episode

Cher Scarlett Chats About The Consequences of Modern Software

Season 4 Episode 4 — 37:56
Cher Scarlett