Introduction
This is a written version of a presentation I gave to a working group of researchers studying the impact of institutional investors. The session was organized by Anthony Damiano, PhD and hosted by the Center for Urban and Regional Affairs at the University of Minnesota Twin Cities.
A need for accessible and robust methods to identify institutional investors, especially single-family rental (SFR) investors, motivates this discussion. Considering the lack of rental registries in the United States, researchers must bear significant technical burdens in identifying these investors from messy parcel and sales records. The challenges include, but are not limited to: inconsistent naming, corporations with multiple business addresses, and difficulty in acquiring data, especially historical data.
I refer to algorithms that identify institutional investors as entity resolution (fuzzy matching) or clustering algorithms, depending on which I am referring to. I use these terms to reflect the general desire to group same-owners (such as subsidiary corporations) within the data, even if the exact definitions of these algorithms differ. Classification algorithms may also be relevant, but we generally do not have a complete list of investors to supervise learning. I discuss machine learning (clustering, classification, and deep learning) algorithms in more detail later.
Tradeoffs
We should first analyze the fundamental challenges of this topic. All methods must make a trade-off between accuracy, runtime, and technical complexity.
Accuracy
For accuracy, we can prioritize an absolute number of matches, percent correct matches, percent true matches vs false matches, or some combination of these criteria.
Generally, most situations aim to maximize the number of correct matches without crossing a threshold for percent false matches. Essentially, we want as many matches as possible and most of them need to be correct.
While accuracy is most important, an ideal algorithm should be precise or otherwise invariant to minor changes in data format. Some data may contain, for instance, street postfix abbreviations (e.g. “STREET” or “ST”) within the owner address whereas others will not. An ideal algorithm should retain its accuracy and precision regardless of these differences.
Runtime (Time Complexity)
For execution time or runtime, needs vary. For practitioners, waiting days for an algorithm to run is not feasible. Researchers are generally more willing to consider these algorithms. We must also understand the limitations of the hardware each party has access to.
In this context, there are three potential target categories of runtime:
- Good algorithms that can be run quickly and easily on any hardware,
- Better algorithms that experienced professionals can run if necessary,
- Optimal algorithms that may require offsite, vast infrastructure.
It is unlikely that the marginal improvement from category 2 to 3 is worth the additional effort. Training a deep learning model may be an exception, as it only requires extensive infrastructure and expertise once during training. We will discuss this further in the following sections.
Technical and Implementation Complexity
Technical complexity is often related to algorithmic efficiency; however, some fast algorithms might be very difficult to implement. For instance, concurrent processing may make an algorithm fast enough for practitioners to use, but they cannot be expected to implement this themselves.
This issue can be resolved with the creation of programmatic tools or applications that facilitate the use of these algorithms.
Theoretical Foundations: Time Complexity
Let us further analyze algorithm efficiency, otherwise known as time and space complexity. This will help us compare potential approaches later.
We are generally more concerned with time complexity. It is relatively cheap to buy RAM but impossible to buy time.
We can categorize algorithms with Big-O notation; we write time complexity as O(f(n))
, where f(n)
is an upper bound function for the runtime of the algorithm given input of size n
.
Runtime is not how long an algorithm takes to run on any individual computer, but rather the number of operations in the worst case. We focus on how quickly the runtime of an algorithm grows with respect to its input size.
With this understanding, let us analyze the runtime of algorithms related to entity resolution and clustering.
Time Complexity of String Comparisons
These comparisons are the building blocks of entity resolution algorithms.
Vectorized Comparison (Equals Operator): O(1)
This is an exact comparison between two encoded values (in this case, strings). A computer can take two binary encodings from memory and compare them exactly with a single operation.
Similarity Metrics or Other Iterative Comparisons: O(n)
These are metrics that calculate a similarity score based on aspects of two strings. Many string distance metrics calculate the “distance” between corresponding letters in each string. This requires a comparison for each letter in the shortest string.
Time Complexity of Entity Resolution Algorithms
Aggreggate / GroupBy: O(n)
These functions utilize the same efficiencies of vectorized comparison by hashing values and placing them into blocks/buckets.
Similarity Threshold Grouping: O(n^2)
These functions typically compare each string to every other string using a similarity metric, grouping strings that are above a certain threshold.
Runtime may be improved with blocking or concurrency; OpenRefine’s clustering function appears to use an optimized implementation.
Time Complexity of Machine Learning Algorithms
K-Nearest Neighbors (Clustering, Unsupervised): O(n^2)
This algorithm encodes strings as vectors in a multi-dimensional space. Then, it compares each vector to every other vector using a vector distance metric (like cosine similarity), finding the k
closest vectors, where k
is user-defined. It is the foundation of clustering algorithms.
Some specialized versions of this algorithm, like metric trees, are more optimized.
Approximate K-Nearest Neighbors (Clustering, Unsupervised): generally at or below O(n)
This is a modified form of K-Nearest Neighbors that trades perfect accuracy for speed. It uses an efficient algorithm to split the vector space into subspaces. Rather than comparing a vector to every other vector, it utilizes the subspaces to quickly identify where the most similar vectors likely are.
Deep Learning (Classification, Supervised):
Deep learning uses a series of subsequent layers to transform data into a final layer (categories). Within each layer, there are nodes which each contain a separate activation function. The model is trained by feeding input data and comparing its output to the desired output; the nodes are modified with each data point to optimize the result.
Training a deep learning model typically requires dedicated GPUs, but once the model is trained, it can be used much more efficiently.
Runtime complexity depends on the implementation. However, the entire dataset will not be used during training, and training only needs to occur once, not every time the model is used.
Current and Potential Approaches
Simple Address Key
In my paper with Brian An, Uncovering Neighborhood-level Portfolios of Corporate Single-Family Rental Holdings and Equity Loss, our priority was developing a simple method to accurately quantify ownership scale, rather than identify the portfolios of specific owners (Polimeni and An 2024). However, this is possible with an additional querying method. This section details these methods.
Motivation and Design
We rely on a modified version of the owner address to group the same-owners together. Owner address has a lower potential for false positive matches than owner name while also being easier to standardize and correctly match. Moreover, owner name has greater potential for inconsistency between subsidiary corporations, but many subsidiary or shell SFR corporations use the same address. In this way, we account for many cases of nested ownership structure, where different holding corporations may own properties, but report the same business address for records purposes. We favor this approach as it narrows focus to our research question without opaque or complex methodological underpinnings.
In particular, by using an encodable key for each address, this method takes advantage of vectorized comparison. Therefore, it can use an aggregate or groupby operation for a runtime complexity of O(n)
.
Method Details
For each parcel, we construct an owner address key by concatenating the address number, address string (street name or PO Box), and zip code. Because the address string excludes postfixes, we can avoid some missed matches due to inconsistencies between labeling, such as “STREET” vs. “ST”. Before creating the key, we eliminate various other inconsistencies by removing periods, commas, multiple spaces, and uppercasing all characters. Finally, to resolve inconsistencies for PO boxes, we can take any address string with at least one number, extract the numbers, and postpend them to the string literal “PO BOX”.
Motivation
Due to the possibility that the same corporation may use multiple addresses, the address key method is not intended to robustly describe the portfolios of specific corporate owners. However, it is possible to query the resulting data (in this example, we label the table ownership scale) for this information.
Steps
- Produce a owner table by aggregating on
address key
- Create a list of substrings associated with the desired owner (for instance, “AMHERST” and “ARVM” for “Amherst”).
- Identify rows where at least one associated owner name contains the substring
- Optional: apply a threshold to avoid small, unrelated owners / individuals
*A threshold may help prevent accidental matches. For instance, if a query produces three owners each with over 50 properties owned, and another with only one property owned, it is likely the latter is not associated with the former.
In the case that a complete list of substrings for each owner is not known, it is possible to begin querying with the base name or other substrings that are known. From this result, add more potential substrings.
For instance, only query with “Amherst”, then look through the resulting associated names. This is why we added “ARVM” as a substring key. In our experience, most substrings or other types of patterns appear for multiple addresses, so it is possible to quickly identify the most common substrings.
Code Example (Python)
# Keywords to query for each owner
# For short strings, like "IH" that might accidentally appear,
# a space is added to reduce this possibility.
= {
owner_keywords "Amherst": ["AMHERST", "ARVM"],
"Cerberus": ["CERBERUS", "FKH", "RM1 ", "RMI "],
"Progress": ["PROGRESS", "FYR"],
"Invitation": ["INVITATION", "IH "],
"Colony": ["COLONY", "STARWOOD", "CSH", "CAH "],
"Sylvan": ["SYLVAN", "RNTR"],
"Tricon": ["TRICON", "TAH"]
}
# Query for each owner described above
for owner in owner_keywords:
= "|".join(owner_keywords[owner])
query_str
# Find rows for TAXYR 2020 and where at least one associated name
# contains a keyword matching the owner
= owner_scale[
owned_by_given_corp "TAXYR"] == 2020)
(owner_scale[& owner_scale["assoc_owner_names"].apply(
lambda x: any(((re.search(query_str, name)) for name in x)
))
]
# Only retain matched subsidiary owners over a threshold,
# this is to prevent false positive matches. For instance,
# if a single parcel was owned by someone with the last name "SYLVAN"
= owned_by_given_corp[
owned_by_given_corp "count_owned_fulton_yr"] > 49
owned_by_given_corp[
]
# Sum total over all matched subsidaries
= owned_by_given_corp["count_owned_fulton_yr"].sum() total_owned
Rather than manually building a set of substrings for each corporation, it may be possible to start with a single substring (say, “Amherst”) and find all associated owner names based on the address key method. Then, using an NLP technique, automatically identify the most unique substrings (“ARVM” should appear as this is an unusual combination of letters) and query for those.
We utilized this method to create a straightforward framework quantifying equity loss from communities due to institutional investors (Polimeni and An 2024).
The analysis was conducted on an extensive dataset of parcel and sale records from 2010 to 2022, revealing $1.25B in lost financial equity in Atlanta’s neighborhoods, concentrated in predominantly African American neighborhoods.
Advanced Address Key
The Simple Address Key
has difficulty with precision; on differently formatted data, it may struggle. For instance, if “ST” or “STREET” are included in the address string. It also does not account for different SUITE numbers. Additionally, there may be other minor inconsistencies due to unforeseen issues like spelling or word order which are impossible to clean.
A more complex address key can be formulated with this in mind, but both the technical complexity and runtime complexity increase.
The simplest form of an Advanced Address Key
extracts the address number, the suite number, the zip code, and the last two letters of the longest substring in the street address. Such features are almost always invariant to changes in spelling, word order, street postfixes, or other inconsistencies.
Address number and suite number can be identified as the first and last series of numbers, separated by spaces. A REGEX pattern can be used to extract these numbers.
Availability of usable state business registry data may vary, so I have not included it in the base algorithm. However, if available, it has the potential to aggregate same-owners that use multiple addresses. The exact implementation may vary with different data schemas.
For instance, an owner might record multiple owner addresses for different parcel records but have a single business address registered with the state. This offers the possibility of exactly matching the owner name to business registry data, although this has the same challenges of spelling and naming inconsistencies.
In some cases, business registry data may list owner, or beneficial, corporations and their addresses. These can be used to aggregate subsidiary addresses and names at the state level.
Regardless, considering data inconsistency, matching even good data from business registry records to parcel or sales records requires the use of fuzzy matching algorithms discussed here.
Using an address key approach, we can construct a list of owners associated with a business address. To create a comprehensive list that identifies the portfolios of specific investors, we need to aggregate these lists with querying (see Simple Address Key querying).
If researchers complete this step within their own metro areas, we can combine this data to create a robust, crowdsourced ownership database. This database may consist of two tables, one linking address keys to an owner index, and another linking owner indexes to all associated corporate names. With this method, we can efficiently create a linkage between all of a corporate owner’s addresses and corporate names.
Appendix C of (Polimeni and An 2024) contains this data for Fulton County, GA.
String Similarity Metrics
While key methods are ideal for applications that require fast computation or lack computational power, some use of string similarity metrics may be advantageous for those with the capability.
The use of string similarity metrics becomes more useful for ambiguous cases, or for grouping owners with similar names but different addresses. The efficiency may also be increased by blocking; for instance, by using a vectorized aggregation to create buckets of owners with at least one matching word substring; doing so reduces the number of necessary comparisons exponentially.
However, most string distance metrics are not optimized for address strings or corporate owner names. They are typically best for spelling inconsistencies. Therefore, developing a string distance metric optimized for corporate names has potential.
My experience indicates that such a string distance metric should prioritize token comparison over letter comparison.
For instance, “2018 3 IH BORROWER LP” and “HOME SFR BORROWER IV LLC” have many letters in common, but they are different entities. More weight should be placed on the substrings, like “IH” and “HOME”, since common terms like “SFR” and “BORROWER” appear frequently.
This warrants further investigation as the most common string distance metrics have a high number of false positives in our use case.
Clustering (KNN or ANN)
Clustering requires that the data can be vectorized. We can encode owner names string into vectors using a string embedding model. However, these models are generally intended to encode semantic meaning. They are not useful for identifying strings that look similar.
Address keys are also vectorized, but we do not gain anything from using a clustering technique versus a simple aggregation.
Clustering only has potential if we consider other dimensions in the data; for instance, if we want to include property characteristics. For instance, the same owner might own many properties in the same geographic area. However, this method may cause false positives.
Clustering is likely not a solution to our problem.
Deep Learning
Deep learning requires a correct, labeled dataset to train on. This takes additional effort to create. Furthermore, there are many instances of subsidiary corporations with no discernable pattern (see more in the next section).
Deep learning is also likely not a solution.
Is This a Solvable Problem? If Not, What is the Best Solution?
Ultimately, the problem is solvable from a computational perspective (there exists a polynomial time algorithm), but it is not perfectly solvable from a data perspective.
The challenge comes from cases where both the owner names and addresses are different but they represent the same owner. For instance, how is it possible that any algorithm or model can correctly classify “Jeff 1 LLC” as a subsidiary of Amherst Capital if the owner address is different?
Alternatively, how can an algorithm distinguish “HOME BORROWER I LLC” from “HOME SFR BORROWER LLC” if these are owned by different corporations? Or, if they are owned by the same corporation, but use different addresses, we run into the same issue.
If we list out all possible combinations, we see that some are difficult or impossible to capture. There are no string distance metrics that will resolve this issue and no discernable pattern for deep learning.
Without rental registries, or some other complete ground truth, I argue that a perfect algorithm is not possible. It may not even be worth the time to aim for this. The best solution is likely an optimized fuzzy address key, coupled with additional data steps to aggregate specific portfolios. This solution finds an ideal balance between accuracy, runtime, and implementation complexity. Additionally, it generally avoids false positives, which is ideal for most purposes.
A Methods Paper to Reduce These Challenges
For each research question that necessitates the classification or identification of property ownership, researchers must spend valuable time wrangling with these challenges. Practitioners can also gain valuable insights from this analysis but lack the resources or knowledge to overcome these barriers.
To reduce the burden on researchers and increase the speed at which the impact of SFR investment can be studied, a methods paper should concretely compare all current methods.
The goal of this speculative project is to produce:
- A sample of labeled data (ground truth) across multiple metro areas to train and/or test models against. This is related to the data crowdsourcing mentioned previously.
- A paper comparing all (and potential) methods, highlighting the tradeoffs for each, and which methods are best for under specific circumstances.
- Optimized coding tools or an application to facilitate the best procedures.
Conclusion
Until rental registries are established, these challenges will continue to limit actionable and evidence-based policymaking for the housing market. It is essential that we quickly uncover significant evidence to urge policymakers to act before corporations find new ways to obfuscate ownership.
Q&A from the Live Session
Will be recorded here after the conclusion of the session.