I read, I like to highlight stuff (I use a Kindle). I feel like by reading I don’t get to retain more than 10% of the information I consume but it’s through re-reading the highlights or summarizing the book using them is what makes me actually understand what I read.
The problem is that, sometimes, I end up highlighting a lot.
And by a lot I mean A LOT. We cannot even call them “key notes.”
So in those cases, after reading the book, I end up either wasting a lot of time summarizing or just quit doing it (the latter is the more common).
I recently read a book that I enjoyed quite a lot and would like to fully retain what struck me the most. But, again, it was one of those books I over-highlighted.
And I didn’t want to use a lot of my scarce free time on it. So I decided to automate the process and use my tech/data skills. Because I’m happy with the result, I thought I’d share it so anyone interested can take advantage of this tool as well.
Disclaimer: my Kindle is quite old so this should work on new ones as well. In fact, there’s a slightly better approach for new Kindle versions (explained also in this post).
The Project
Let’s define the goal: generate a summary from our Kindle highlights.
As I thought about it, I imagined the following simple pipeline for a single book:
- Get the book highlights
- Create a RAG or something similar
- Export the summary
The result is different on the first part, but all due to the preprocessing needed taking into account how the data is structured.
So I’ll structure this post into two main sections:
- Data retrieval and processing
- AI model and output
1. Data Retrieval and Processing
My intuition told me there was a way to extract highlights from my Kindle. In the end, they’re stored there, so I just need a way to get them out.
There are several ways to do it but I wanted an approach that worked with both books bought on the official Kindle store but also PDFs or files I sent from my laptop.
And I also decided I wouldn’t use any existing software to extract the data. Just my ebook and my laptop (and a USB connecting them both).
Luckily for us, no jailbreak is needed and there are two ways of doing so depending on your Kindle version:
- All Kindles (presumably) have a file in the documents folder named My Clippings.txt. It literally contains any clipping you’ve made at any point and any book.
- New Kindles also have a SQLite file in the system directory named annotations.db. This has your highlights in a more structured way.
In this post I’ll be using method 1 (My Clippings.txt) mainly because my Kindle doesn’t have the annotations.db database. But if you’re lucky enough to have the DB, use it as it’ll be more straightforward and higher quality (most of the preprocessing we’ll be seeing next won’t probably be needed).
So getting the clippings is as easy as reading the TXT. Here are some key aspects and problems I encountered using this method:
- All books are on the same file.
- I’m not sure about the exact “clipping” definition on Amazon’s side but the way I’ve seen it is: anything you highlight at any point. Even if you delete it or expand it, the original will remain in the TXT. I assume this is like that because, indeed, we’re working with a TXT file and it’s very hard to delete stuff that’s not indexed in any way.
- There is a limit to clipping: I’m not aware of the exact threshold but once we cross it, we can’t retrieve any more clippings. This is done because someone could otherwise highlight the full book, extract it and share it illegally.
And this is the anatomy of a clipping:
==========
Book Name (Author Name)
– Your Highlight on page 145 | Location 2212-2212 | Added on Sunday, August 30, 2020 11:25:29 PM
transparency problem ends up in the same place as
==========
So the first step is parsing the highlights, and this is where we start seeing Python code:
def parse_clippings(file_path):
raw = Path(file_path).read_text(encoding=”utf-8″)
entries = raw.split(“==========”)
highlights = []
for entry in entries:
lines = [l.strip() for l in entry.strip().split(“\n”) if l.strip()]
if len(lines) < 3:
continue
book = lines[0]
if “Highlight” not in lines[1]:
continue
location_match = re.search(r”Location (\d+)”, lines[1])
if not location_match:
continue
location = int(location_match.group(1))
text = ” “.join(lines[2:]).strip()
highlights.append(
{
“book”: book,
“location”: location,
“text”: text
}
)
return highlights
Given the path of the clippings file, all this function does is split the text into the different entries and then loop through them. For each entry, it extracts the title name, the location and the highlighted text.
This final structure (a list of dictionaries) makes it easy to filter by book:
[
h for h in highlights
if book_name.lower() in h[“book”].lower()
]
Once filtered, we must order the highlights. Since clippings are appended to the TXT file, the order is based on when we highlight, not on the text’s location.
And I personally want my results to appear as they do in the book, so ordering is necessary:
sorted(highlights, key=lambda x: x[“location”])
Now, if you check your clippings file, you might find duplicated clippings (or duplicated subclippings). This happens because any time you edit a highlight (that you’ve failed to include all the words you aimed at, for example), it’s accounted as a new one. So there will be two very similar clippings in the TXT. Or even more if you edit it a lot of times.
We need to handle this by applying a deduplication somehow. It’s easier than expected:
def deduplicate(highlights):
clean = []
for h in highlights:
text = h[“text”]
duplicate = False
for c in clean:
if text == c[“text”]:
duplicate = True
break
if text in c[“text”]:
duplicate = True
break
if c[“text”] in text:
c[“text”] = text
duplicate = True
break
if not duplicate:
clean.append(h)
return clean
It’s very simple and could be perfected, but we basically check if there are consecutive clippings with the same text (or part of it) and keep the longest.
Right now we have the book highlights properly sorted, and we could stop the pre-processing here. But I can’t do that. I like to highlight titles every time because, when summarizing, I get to properly assign a section to each highlight.
But our code isn’t able to differ between a real highlight and a section title… Yet. See below:
def is_probable_title(text):
text = text.strip()
if len(text) > 120:
return False
if text.endswith(“.”):
return False
words = text.split()
if len(words) > 12:
return False
# chapter style prefix
if has_chapter_prefix(text):
return True
# capitalization ratio
capitalized = sum(
1 for w in words if w[0].isupper()
)
cap_ratio = capitalized / len(words)
# stopword ratio
stopword_count = sum(
1 for w in words if w.lower() in STOPWORDS
)
stop_ratio = stopword_count / len(words)
score = 0
if cap_ratio > 0.6:
score += 1
if stop_ratio < 0.3:
score += 1
if len(words) <= 6:
score += 1
return score >= 2
It can seem pretty arbitrary, and it’s not the best solution to this problem, but it does work quite well. It uses a heuristic based on capitalization, length, stopwords and prefixes.
This function is called inside a loop through all the highlights, as we’ve seen in previous functions, to check if a highlight is a title or not. The result is a “sections” list of dictionaries where the dictionary has two keys:
- Title: the section title.
- Highlights: the section’s highlights.
Right now, yes, we’re ready to summarize.
AI Model and Output
I wanted this to be a free project, so we need an AI model that’s open source.
I thought that Ollama [1] was one of the best options to run a project like this (at least locally). Plus, our data always remain ours with them and we can run the models offline.
Once installed, the code was easy. I’m not a prompt engineer so anyone with the know-how would get even better results, but this is what works for me:
def summarize_with_ollama(text, model):
prompt = f”””
You are summarizing a book from reader highlights.
Produce a structured summary with:
– Main thesis
– Brief summary
– Key ideas
– Important concepts
– Practical takeaways
Highlights:
{text}
“””
result = subprocess.run(
[“ollama”, “run”, model],
input=prompt,
text=True,
capture_output=True
)
return result.stdout
Simple, I know. But it works partly because the data preprocessing has been intense but also because we’re already leveraging the models built out there.
But what do we do with the summary? I like using Obsidian [2] so exporting a Markdown file is what makes more sense. Here you have it:
def export_markdown(book, sections, summary, output):
md = f”# {book}\n\n”
for section in sections:
md += f”## {section[‘title’]}\n\n”
for h in section[“highlights”]:
md += f”- {h}\n”
md += “\n”
md += “\n—\n\n”
md += “## Book Summary\n\n”
md += summary
output_path = Path(output)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(md, encoding=”utf-8″)
print(f”\nSaved to {output_path}”)
Et voilà.
And this is how I go from highlights to a full Markdown summary (straight to Obsidian if I want to) with less than 300 lines of Python code!
Full Code and Test
Here’s the full code, just in case you want to copy-paste it. It contains what we’ve seen plus some helper functions and argument parsing:
import re
import argparse
from pathlib import Path
import subprocess
# ———- PARSE CLIPPINGS ———-
def parse_clippings(file_path):
raw = Path(file_path).read_text(encoding=”utf-8″)
entries = raw.split(“==========”)
highlights = []
for entry in entries:
lines = [l.strip() for l in entry.strip().split(“\n”) if l.strip()]
if len(lines) < 3:
continue
book = lines[0]
if “Highlight” not in lines[1]:
continue
location_match = re.search(r”Location (\d+)”, lines[1])
if not location_match:
continue
location = int(location_match.group(1))
text = ” “.join(lines[2:]).strip()
highlights.append(
{
“book”: book,
“location”: location,
“text”: text
}
)
return highlights
# ———- FILTER BOOK ———-
def filter_book(highlights, book_name):
return [
h for h in highlights
if book_name.lower() in h[“book”].lower()
]
# ———- SORT ———-
def sort_by_location(highlights):
return sorted(highlights, key=lambda x: x[“location”])
# ———- DEDUPLICATE ———-
def deduplicate(highlights):
clean = []
for h in highlights:
text = h[“text”]
duplicate = False
for c in clean:
if text == c[“text”]:
duplicate = True
break
if text in c[“text”]:
duplicate = True
break
if c[“text”] in text:
c[“text”] = text
duplicate = True
break
if not duplicate:
clean.append(h)
return clean
# ———- TITLE DETECTION ———-
STOPWORDS = {
“the”,”and”,”or”,”but”,”of”,”in”,”on”,”at”,”for”,”to”,
“is”,”are”,”was”,”were”,”be”,”been”,”being”,
“that”,”this”,”with”,”as”,”by”,”from”
}
def has_chapter_prefix(text):
return bool(
re.match(
r”^(chapter|part|section)\s+\d+|^\d+[\.\)]|^[ivxlcdm]+\.”,
text.lower()
)
)
def is_probable_title(text):
text = text.strip()
if len(text) > 120:
return False
if text.endswith(“.”):
return False
words = text.split()
if len(words) > 12:
return False
# chapter style prefix
if has_chapter_prefix(text):
return True
# capitalization ratio
capitalized = sum(
1 for w in words if w[0].isupper()
)
cap_ratio = capitalized / len(words)
# stopword ratio
stopword_count = sum(
1 for w in words if w.lower() in STOPWORDS
)
stop_ratio = stopword_count / len(words)
score = 0
if cap_ratio > 0.6:
score += 1
if stop_ratio < 0.3:
score += 1
if len(words) <= 6:
score += 1
return score >= 2
# ———- GROUP SECTIONS ———-
def group_by_sections(highlights):
sections = []
current = {
“title”: “Introduction”,
“highlights”: []
}
for h in highlights:
text = h[“text”]
if is_probable_title(text):
sections.append(current)
current = {
“title”: text,
“highlights”: []
}
else:
current[“highlights”].append(text)
sections.append(current)
return sections
# ———- SUMMARY ———-
# ———- EXPORT MARKDOWN ———-
def export_markdown(book, sections, summary, output):
md = f”# {book}\n\n”
for section in sections:
md += f”## {section[‘title’]}\n\n”
for h in section[“highlights”]:
md += f”- {h}\n”
md += “\n”
md += “\n—\n\n”
md += “## Book Summary\n\n”
md += summary
output_path = Path(output)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(md, encoding=”utf-8″)
print(f”\nSaved to {output_path}”)
# ———- MAIN ———-
def main():
parser = argparse.ArgumentParser()
parser.add_argument(“–book”, required=True)
parser.add_argument(“–output”, required=False, default=None)
parser.add_argument(
“–clippings”,
default=”Data/My Clippings.txt”
)
parser.add_argument(
“–model”,
default=”mistral”
)
args = parser.parse_args()
highlights = parse_clippings(args.clippings)
highlights = filter_book(highlights, args.book)
highlights = sort_by_location(highlights)
highlights = deduplicate(highlights)
sections = group_by_sections(highlights)
all_text = “\n”.join(
h[“text”] for h in highlights
)
summary = summarize_with_ollama(all_text, args.model)
if args.output:
export_markdown(
args.book,
sections,
summary,
args.output
)
else:
print(“\n—- HIGHLIGHTS —-\n”)
for h in highlights:
print(f”{h[‘text’]}\n”)
print(“\n—- SUMMARY —-\n”)
print(summary)
if __name__ == “__main__”:
main()
But let’s see how it works! The code itself is useful but I bet you’re willing to see the results. It’s a long one so I decided to delete the first part as all it does is just copy-paste the highlights.
I randomly chose a book I read like 6 years ago (2020) called Talking to Strangers by Malcolm Gladwell (a bestseller, pretty enjoyable read). See the model’s printed output (not the Markdown):
$ python3 kindle_summary.py –book “Talking to Strangers”
—- HIGHLIGHTS —-
…
—- SUMMARY —-
Title: Talking to Strangers: What We Should Know About Human Interaction
Main Thesis: The book explores the complexities and paradoxes of human
interaction, particularly in conversations with strangers, and emphasizes
the importance of caution, humility, and understanding the context in
which these interactions occur.
Brief Summary: The author delves into the misconceptions and shortcomings
in our dealings with strangers, focusing on how we often make incorrect
assumptions about others based on limited information or preconceived
notions. The book offers insights into why this happens, its consequences,
and strategies for improving our ability to understand and communicate
effectively with people we don’t know.
Key Ideas:
1. The transparency problem and the default-to-truth problem: People often
assume that others are open books, sharing their true emotions and
intentions, when in reality this is not always the case.
2. Coupling: Behaviors are strongly linked to specific circumstances and
conditions, making it essential to understand the context in which a
stranger operates.
3. Limitations of understanding strangers: There is no perfect mechanism
for peering into the minds of those we do not know, emphasizing the need
for restraint and humility when interacting with strangers.
Important Concepts:
1. Emotional responses falling outside expectations
2. Defaulting to truth
3. Transparency as an illusion
4. Contextual understanding in dealing with strangers
5. The paradox of talking to strangers (need versus terribleness)
6. The phenomenon of coupling and its influence on behavior
7. Blaming the stranger when things go awry
Practical Takeaways:
1. Recognize that people may not always appear as they seem, both
emotionally and behaviorally.
2. Understand the importance of context in interpreting strangers’
behaviors and intentions.
3. Be cautious and humble when interacting with strangers, acknowledging
our limitations in understanding them fully.
4. Avoid jumping to conclusions about strangers based on limited
information or preconceived notions.
5. Accept that there will always be some degree of ambiguity and
complexity in dealing with strangers.
6. Avoid penalizing others for defaulting to truth as a defense mechanism.
7. When interactions with strangers go awry, consider the role one might
have played in contributing to the situation rather than solely blaming
the stranger.
And all this within a few seconds. Quite cool in my opinion.
Conclusion
And that’s basically how I’m now saving a lot of free time (that I can use to write posts like this one) by leveraging my data skills and AI.
I hope you enjoyed the read and felt motivated to give it a try! It won’t be better than the summary you’d write with your own perception of the book… But it won’t be far from that!
Thank you for your attention, feel free to comment if you have any ideas or suggestions!
Resources
[1] Ollama. (n.d.). Ollama. https://ollama.com
[2] Obsidian. (n.d.). Obsidian. https://obsidian.md
